BACKGROUND: There is notable heterogeneity in the clinical presentation of patients with COPD. To characterise this heterogeneity, we sought to identify subgroups of smokers by applying cluster analysis to data from the COPDGene study. METHODS: We applied a clustering method, k-means, to data from 10 192 smokers in the COPDGene study. After splitting the sample into a training and validation set, we evaluated three sets of input features across a range of k (user-specified number of clusters). Stable solutions were tested for association with four COPD-related measures and five genetic variants previously associated with COPD at genome-wide significance. The results were confirmed in the validation set. FINDINGS: We identified four clusters that can be characterised as (1) relatively resistant smokers (ie, no/mild obstruction and minimal emphysema despite heavy smoking), (2) mild upper zone emphysema-predominant, (3) airway disease-predominant and (4) severe emphysema. All clusters are strongly associated with COPD-related clinical characteristics, including exacerbations and dyspnoea (p<0.001). We found strong genetic associations between the mild upper zone emphysema group and rs1980057 near HHIP, and between the severe emphysema group and rs8034191 in the chromosome 15q region (p<0.001). All significant associations were replicated at p<0.05 in the validation sample (12/12 associations with clinical measures and 2/2 genetic associations). INTERPRETATION: Cluster analysis identifies four subgroups of smokers that show robust associations with clinical characteristics of COPD and known COPD-associated genetic variants.