Genome-Wide Association of Familial Late-Onset Alzheimer's Disease Replicates and and Nominates in Interaction with

Late-onset Alzheimer's disease (LOAD) is the most common form of dementia in the elderly. The National Institute of Aging-Late Onset Alzheimer's Disease Family Study and the National Cell Repository for Alzheimer's Disease conducted a joint genome-wide association study (GWAS) of multiplex LOAD families (3,839 affected and unaffected individuals from 992 families plus additional unrelated neurologically evaluated normal subjects) using the 610 IlluminaQuad panel. This cohort represents the largest family-based GWAS of LOAD to date, with analyses limited here to the European-American subjects. SNPs near APOE gave highly significant results (e.g., rs2075650, p = 3.2×10−81), but no other genome-wide significant evidence for association was obtained in the full sample. Analyses that stratified on APOE genotypes identified SNPs on chromosome 10p14 in CUGBP2 with genome-wide significant evidence for association within APOE ε4 homozygotes (e.g., rs201119, p = 1.5×10−8). Association in this gene was replicated in an independent sample consisting of three cohorts. There was evidence of association for recently-reported LOAD risk loci, including BIN1 (rs7561528, p = 0.009 with, and p = 0.03 without, APOE adjustment) and CLU (rs11136000, p = 0.023 with, and p = 0.008 without, APOE adjustment), with weaker support for CR1. However, our results provide strong evidence that association with PICALM (rs3851179, p = 0.69 with, and p = 0.039 without, APOE adjustment) and EXOC3L2 is affected by correlation with APOE, and thus may represent spurious association. Our results indicate that genetic structure coupled with ascertainment bias resulting from the strong APOE association affect genome-wide results and interpretation of some recently reported associations. We show that a locus such as APOE, with large effects and strong association with disease, can lead to samples that require appropriate adjustment for this locus to avoid both false positive and false negative evidence of association. We suggest that similar adjustments may also be needed for many other large multi-site studies.

Published in the journal: . PLoS Genet 7(2): e32767. doi:10.1371/journal.pgen.1001308
Category: Research Article
doi: 10.1371/journal.pgen.1001308


Late-onset Alzheimer's disease (LOAD) is the most common form of dementia in the elderly. The National Institute of Aging-Late Onset Alzheimer's Disease Family Study and the National Cell Repository for Alzheimer's Disease conducted a joint genome-wide association study (GWAS) of multiplex LOAD families (3,839 affected and unaffected individuals from 992 families plus additional unrelated neurologically evaluated normal subjects) using the 610 IlluminaQuad panel. This cohort represents the largest family-based GWAS of LOAD to date, with analyses limited here to the European-American subjects. SNPs near APOE gave highly significant results (e.g., rs2075650, p = 3.2×10−81), but no other genome-wide significant evidence for association was obtained in the full sample. Analyses that stratified on APOE genotypes identified SNPs on chromosome 10p14 in CUGBP2 with genome-wide significant evidence for association within APOE ε4 homozygotes (e.g., rs201119, p = 1.5×10−8). Association in this gene was replicated in an independent sample consisting of three cohorts. There was evidence of association for recently-reported LOAD risk loci, including BIN1 (rs7561528, p = 0.009 with, and p = 0.03 without, APOE adjustment) and CLU (rs11136000, p = 0.023 with, and p = 0.008 without, APOE adjustment), with weaker support for CR1. However, our results provide strong evidence that association with PICALM (rs3851179, p = 0.69 with, and p = 0.039 without, APOE adjustment) and EXOC3L2 is affected by correlation with APOE, and thus may represent spurious association. Our results indicate that genetic structure coupled with ascertainment bias resulting from the strong APOE association affect genome-wide results and interpretation of some recently reported associations. We show that a locus such as APOE, with large effects and strong association with disease, can lead to samples that require appropriate adjustment for this locus to avoid both false positive and false negative evidence of association. We suggest that similar adjustments may also be needed for many other large multi-site studies.


Alzheimer's disease (AD, MIM 104300) is by far the most common form of dementia in the elderly. Late onset Alzheimer's disease (LOAD), defined by the onset of symptoms after age 60 years, has annual incidence rates increasing from 1% at 65–70 years to 6–8% at 85 years and older [1]. By age 85 years and up, prevalence is 10–30% or more [2]. While the underlying causes of LOAD are still unknown, there is ample evidence for genetic factors affecting risk, including high estimated heritability of LOAD (58–79%) [3], and evidence from both twin [4], [5] and family studies [6][9].

A small number of genes have been identified in which variation contributes to Alzheimer's disease risk. Multiplex early-onset Alzheimer's disease (EOAD) pedigrees [10][12] facilitated the identification of mutations in three genes: the amyloid precursor protein (APP) [13], presenilin 1 (PSEN1) [14] and presenilin 2 (PSEN2) [15]. In contrast to the success in familial EOAD, only one gene, APOE, is an unequivocally established “susceptibility” gene for LOAD [16], with the ε4 allele associated with increased risk in a dose-dependent manner and the ε2 allele with decreased risk [17]. There is incomplete lifetime penetrance even in the highest-risk APOE genotypes [18], and the fraction of genetic variance for LOAD risk attributed to APOE is estimated as only 10–20% [19], [20]. This, coupled with results of oligogenic segregation analyses supporting the presence of at least 4–6 additional major genes [21], [22], suggests that additional risk loci remain to be discovered.

Multiple approaches have been used to identify additional loci contributing to LOAD. Several regions have been implicated as a result of multiple linkage-based genome scans [23][32]. With the exception of the APOE gene region, there is only modest overlap among the chromosomal regions identified by different analyses [33], and it has been difficult to identify causal variants. Multiple genome-wide association studies (GWAS) of unrelated subjects have now also been carried out [34][44]. With the exception of single nucleotide polymorphisms (SNPs) near APOE, all associated SNPs in these studies have had small estimated effect sizes, with odds ratios reported in the range of 1.1 to 1.5, and also with little overlap among studies. Such estimated odds ratios are likely to be highly inflated, and the true effects much smaller [45], complicating replication and identification of causal variants. However, among recent GWAS studies, a small number of loci have shown some evidence for replication across samples including clusterin (CLU), phosphatidylinositol binding clathrin assembly protein (PICALM) and complement component (3b/4b) receptor 1 (CR1) [42][44].

The use of densely affected families with LOAD, which are expected to carry higher frequencies of risk alleles, is an excellent alternative method of identifying additional genes contributing to LOAD. For example, the APOE-ε4 frequency is higher in LOAD cases with a positive family history than in sporadic LOAD cases [46], [47]. Compared to the more typical use of unrelated subjects, often without a family history of LOAD, family-based designs may enrich for variants with higher penetrance and consequent increase in odds ratios, and thus increase the power for their detection [48]. Such families can be used in both linkage and association-based designs, with appropriate correction for inclusion of related individuals [38].

Here we present results from a GWAS in multiplex LOAD families. Unlike many other studies, unaffected relatives were also evaluated and are included to increase the amount of genetic information and to provide additional phenotypes that can be used in subsequent analyses. A supplementary control group consisting of unrelated individuals was also recruited and underwent the same phenotypic evaluation. Thus, this cohort represents the largest family-based GWAS of LOAD to date, allowing us to explore issues related to stratification as well as providing a powerful approach for detailed modeling of the effects of APOE in the search for other novel risk loci in LOAD. The genotypic and phenotypic data generated in this study are part of the NIA-LOAD/NCRAD family study ( and are available to the research community through dbGaP ( Biological samples from these well-characterized individuals and families are also available through NCRAD. Our results implicate a new region on chromosome 10p in individuals with the APOE ε4/ε4 genotype, and provide support for some of the recently implicated loci. They also suggest that sample structure and ascertainment bias related to the strong APOE association with AD risk are important confounders. This affects the interpretation of some of the recently implicated loci as well as other GWAS studies of LOAD.


Subjects and Sample


The patient sample contained individuals from families as well as unrelated individuals; however, all patients with LOAD had a family history of Alzheimer's disease. All patients were recruited after providing informed consent and with approval by the relevant institutional review boards, and the study was conducted according to the principles expressed in the Declaration of Helsinki. Regardless of the source (NIA-LOAD Family Study or NCRAD), patients and families were required to meet the same study criteria. In the families, probands were required to have a diagnosis of definite or probable LOAD [49] with onset >60 years of age and a sibling with definite, probable or possible LOAD with a similar age at onset. A third biologically-related family member was required, who could have been a first-, second-, or third-degree relative of the affected sibling pairs, and who was 60 years of age or older if unaffected, or 50 years of age if diagnosed with LOAD or mild cognitive impairment [50]. In these families, additional relatives over age 50 years were recruited regardless of cognitive status. Persons deemed unaffected (controls) were required to have had documented cognitive testing and clinical examination to verify this clinical designation. The largest component of the dataset consisted of 607 families (1,516 affected, 1,306 unaffected) from the NIA-LOAD Family Study and 138 families from the National Cell Repository for Alzheimer Disease (NCRAD; 337 affected, 166 unknown intermediate phenotypes; see Figure S1). We also included pairs of affected siblings whose third family member was either too young, not sampled at the time of the investigation, or had died before an evaluation and blood sample could be obtained. Another 471 unrelated patients from the NIA-LOAD Family Study and NCRAD were included in order to enhance sample size and because they had a well-documented family history of dementia, although no other participating family members had as yet been examined at the time of the investigation.

Unrelated controls were ascertained through three sources: the NIA-LOAD Family Study (n = 794), and NCRAD (n = 144), with the NCRAD controls including 141 subjects from the University of Kentucky. The controls recruited by NIA-LOAD and NCRAD did not have a family history of LOAD in a first degree relative, while those recruited by the University of Kentucky were not excluded if they had a family history of LOAD. All controls demonstrated or had a documented history of normal cognitive function for age, and were evaluated in person or had neuropathology that did not provide any evidence of LOAD.


A minimal dataset was available for each person consisting of demographics, diagnosis, age at onset for cases, method of diagnosis, Clinical Dementia Rating Scale [51], and the presence of other relevant health problems. Each recruitment site used standard research criteria for the diagnosis of LOAD [49]. Participants with advanced disease or those living in a remote location that could not complete a detailed in-person evaluation contributed a blood sample, and the site investigator conducted a detailed review of medical records to document the presence or absence of LOAD based on the same criteria. The age at onset for patients with LOAD was the age at which the family first observed cognitive complaints. For controls, we used their age at the time of their examination confirming the absence of dementia. For deceased family members who had undergone a postmortem brain evaluation, results of neuropathology were used to document the diagnosis. In general the data from NCRAD was more limited because families were geographically scattered, requiring medical record review, telephone cognitive assessment, and neuropathology data from brain tissue. In total, neuropathological documentation was available for 306 cases from 199 families, and for 25 controls.

For the purpose of analyses, a clinical case was defined as any individual meeting NINCDS-ADRDA criteria for probable or possible AD [49]. We used the NINCDS-ADRDA criteria for definite AD when clinical and pathological criteria were met or CERAD pathological criteria [52] for AD when based on postmortem information alone. Individuals with unspecified dementia, mild cognitive impairment, or uncorroborated family reports of dementia were not used in the analyses. Controls were defined as any individual with no evidence of LOAD, as described above.


Genome-scan genotyping for all samples was provided by CIDR ( as a single project using the Illumina Infinium II assay protocol with hybridization to Illumina Human610Quadv1_B BeadChips (Illumina, San Diego, CA, USA). This array contained 592,532 SNPs with a mean spacing of 5.8 kb, and the minimum genotype completion rate for any sample released by CIDR was 98.3%. Blind duplicate reproducibility was 99.99% based on 118 paired samples. We used only these directly-genotyped data for analysis, without adding further genotypes via imputation. Genotyping of APOE polymorphisms (based on SNPs rs7412 and rs429358) for all samples was performed at PreventionGenetics ( Genotyping was carried out in array tape [53] using allele-specific PCR with universal molecular beacons [54], [55]. DNA sequencing of positive control DNA samples was completed to assure correct assignment of alleles.

Data Analysis

Overarching approach

Our initial aim was to carry out a GWAS in the complete European-American (EA) component of the sample, with and without adjustment for APOE genotype, while accommodating the presence of related subjects in the sample. To this end, we had three goals: (1) to try to confirm recent reports of associated SNPs in the complete EA component of the sample [42][44]; (2) to determine if reports of residual association with SNPs near APOE that implicated other genes in that region [56][58] were robust to full adjustment for APOE genotype; and (3) to identify new associations in the genome scan, either in the presence or absence of APOE adjustment. Towards these goals, we carried out two primary genome scans, based on absence or presence of adjustment for APOE genotype, with this adjustment taking into account the full APOE genotype, and not just presence or absence of the ε4 allele, as is commonly done. The main analysis was based on a comparison of allele frequencies between cases and controls with adjustment for the effects of related individuals.

As a consequence of evaluation of the diagnostic indicators of the validity of assumptions associated with the statistical analyses, we also investigated possible sources and effects of confounding. The observations that lead to these additional analyses included deviation from the expected genome-wide null distribution of the test statistics, absence of overlap in SNPs identified with evidence of association across APOE-genotype strata, identification of ethnic subgroups within the larger EA sample, and recognition that the APOE allele frequencies differed among these subgroups. We note that deviation from the expected genome-wide null distribution was less extreme than has been reported in some other recent investigations [42], [43], but was still sufficiently large to warrant investigation. Identification of the sources of confounding involved a number of additional genome scan analyses of the full sample. No correction for additional testing was imposed on these analyses, since their primary purpose was investigation of possible confounding effects, and not of identification of associated SNPs.

Finally, it is important to note that there were complications introduced by use of a sample containing related individuals, coupled with need to make comparisons with results from only the unrelated individuals as part of our broader investigations into confounding and covariate adjustment. The major advantage of the full sample is increased power: with the methods used and assuming a type-I error of 10−7, an allele frequency difference with ∼80% power of detection in the complete family-based sample had a power of only ∼20% in the smaller sample of unrelated subjects. However, use of the complete sample in some analyses as well as the unrelated subjects in other analyses introduced constraints, since the identical analysis approaches were either not always possible or were statistically inappropriate. To the extent possible, when analyses were carried out on different datasets or components of a dataset, the analytical methods chosen were selected to be as comparable as possible. The goal was to be able address the same question, even if the details behind the method of analysis differed.

Quality assessment of data and samples

Family structure both within and across pedigrees was checked and confirmed using Relative ver. 1.1 [59] ( and Prest ver. 3.0.2 [60] ( A genotyping error rate of 0.03 was used, with a likelihood ratio >1000 used to flag and review pairs with potential discrepancies from their stated relationships. For the analyses presented here, family structure was subsequently used only to identify a sample of unrelated individuals for use in a series of sub-analyses.

After stringent filtering on marker quality control indicators and eliminating monomorphic markers, 565,336 polymorphic autosomal markers from the Illumina panel were used for analysis. Incomplete genotyping (>2%) lead to elimination of 1.12% of markers. Additional metrics such as deviation from compatibility with Hardy Weinberg equilibrium (HWE) in unrelated individuals (0.2% of SNPs) were only used to further evaluate SNPs with evidence for association (p<0.00001), because presence of such disequilibrium is expected in regions with true association and can be informative in the search for gene-disease associations [61][63]. Of course such disequilibrium can also identify problem SNPs, but in the absence of evidence for association, such SNPs have little effect on the overall conclusions. No markers of greatest interest were eliminated because of such deviation from HWE, although a small number of such SNPs were eliminated from the tabulation of those SNPs yielding p<5×10−4.

Population structure

Two data sets of unrelated individuals were constructed, based on self-reported ethnic information combined with principal component analysis, described below. The first was drawn from the complete sample, and was of mixed origin, with all unrelated individuals used together. The sample was selected by choosing a random genotyped individual from each family. The second consisted of a sub-sample of individuals defined as “European-American-clustering”, based on the first analysis (henceforth referred to as European-Americans). From the latter sample of European-Americans, a dataset containing only unrelated cases and controls (CCun) was generated such that a single case was selected from each family. For this purpose, clinically defined cases were prioritized based on how AD was defined. Definite AD as defined by NINCDS-ADRDA [49] or CERAD [52] was selected over probable or possible AD. In instances where more than one individual met the most stringent criteria, the case with the lower age of onset was selected.

The complete sample of unrelated individuals was used for initial investigation of population structure in the sample using genotypes from the Illumina panel and smartpca from the Eigensoft package [64]. Initial cluster analysis was based on a principal component analysis (PCA) of the complete unrelated sample, with ethnic-specific clusters delineated based on self-declared ethnicity. This initial analysis was used to refine the cluster location of the European-American sample within the larger sample, and to classify subjects with undeclared ethnicity to define the final sample used for further analysis. For evaluation of cluster-membership of all subjects, the SNP weights for each eigenvector were then applied to all remaining family members. Self-described non-Hispanic European-Americans that clustered as part of the main European-American cluster were subsequently flagged and reanalyzed separately, first using unrelated European-Americans and then applying the SNP weights to all European-Americans. This separate PCA of the European-American sample was stimulated by results in the full sample, in which concerns arose about possible effects of additional stratification. This final PCA lead us to identify and further investigate and delineate three subgroups within the European-American sample.

We hypothesized that these subgroups might represent individuals of northwestern European, southeastern European, and Ashkenazi Jewish ancestry, based on other studies [65] coupled with the population makeup near the collection sites. We used 88 of 159 European-specific ancestry informative markers (AIMs) tailored to NW European (NW), SE European (SE), and Ashkenazi Jewish (AJ) ancestry [65] ( that were available in our marker panel for evaluating this hypothesis, restricting analysis to the markers for which the minor allele frequency (MAF) was <0.4 in all three sub-populations. These were the markers that provided unambiguous matching of our alleles with those reported previously [65], and avoided potential allele mismatch due to unclear specification of allele calling procedures. For these markers, we obtained the population that best explained each subject by maximizing the likelihood over all markers under the assumption of independence of the markers and using the published allele frequencies [65]. The distributions of the difference in reported vs. observed marker allele frequencies in each cluster vs. known population were also investigated. Finally, after assigning subjects to their subgroup defined by their PCA clustering, allele frequencies were computed genome-wide for pairs of subpopulations, to determine whether known strong gene-frequency clines that characterize north-south gradients in Europe were apparent, such as the regions surrounding lactase and HLA [66], [67].

Kinship estimation

Kinship coefficients estimated from the data were used for several purposes. This included detecting cryptic relatedness, quantifying and comparing relatedness, and correcting for relatedness in statistical tests carried out on samples that included related individuals. We estimated kinship coefficients for pairs of individuals in the European-American sample with Kinship ver. 2.0 ( based on methods described in detail elsewhere [68]. In brief, for each pair of individuals, we maximized the likelihood for each of the nine condensed identity coefficients [69], and collapsed these into an estimate of the kinship coefficient. Use of a maximum likelihood estimate (MLE), rather than a moment estimate, provides the most accurate such kinship estimates [70], with superior power and control of type I error in hypothesis testing, although at modest increased computational cost. We estimated the kinship coefficients for two groups: the entire group of European-American subjects, including related individuals (referred to as the CCall sample), as well as for the three subgroups: the NW, SE and AJ, as determined by the analyses described above. Each of these four groups included both related and unrelated cases and controls. The estimates for the individual subgroups were used only in analyses that incorporated subgroup information; in all other situations, the estimates from the CCall sample were used.

For the purpose of estimating kinship coefficients, we used 11,471 SNPs chosen to be maximally informative and relatively-uniformly spaced on the reference sequence. Use of higher numbers of markers adds little additional precision, while adding unnecessary computational burden [68]. The panel of SNPs was chosen by restricting use to those with >99% data completion and a MAF>0.05 in the whole data set, with a minimum interval between markers of 100 KB, and an attempt to maximize the MAF. Ultimately, 87% of SNPs used for this purpose had MAF>0.45 and 95% had MAF>0.4.

Case-control analysis, full sample

All tests of allele frequency differences in the CCall sample were carried out with the program cor_chi (, using procedures that correct for relationships in the sample [68]. We used pair-wise kinship coefficients estimated from the marker data to correct the variance in a chi-square test of association, in a modification to the approach suggested by Bourgain et al [71]. Cases were defined as all affected individuals, and controls as all unaffected individuals, as described above under Phenotyping. This approach allows for variation in the realized IBD-sharing within families, as well as incorporating the effects of additional, unspecified relationships. This approach thus has somewhat better properties in the presence of related subjects than a test based only on pedigree-based expected kinship coefficients [68] or other relationship information from the pedigree structure alone [38]. The correction for kinship was carried out either under the assumption that the full European-American sample represents a single population, or under the assumption that each of three identified sub-populations represents a separate population.

For our primary genome scans, we carried out an analysis of the full sample without adjusting for APOE since this is a common approach [34], [42], as well as an analysis that controlled for APOE genotype effects. The method of analysis that corrected for the existence of relatives did not permit use of a covariate-adjusted model, which is the most common approach for controlling for specific genotype effects. Instead, we carried out a stratified analysis (e.g., structured association), which is also a well-established approach [72]. It is less often used than the former approach because of the need for large sample sizes, but the large size of the full sample available to us made this possible. The consequent number of strata that could be used also allowed a more subtle accommodation for known differences in LOAD risk among APOE genotypes [73] than simply adjusting for presence or absence of the ε4 allele. Attention to appropriate use of covariates is more critical for detection of small than large contributions to risk and to avoid spurious conclusions about association, so that this fuller adjustment for APOE genotype was expected to provide the better model in this context. The strata were based on four APOE genotype groups consisting of ε4/ε4, ε4/ε3, ε3/ε3, and the combined sample of ε3/ε2 and ε2/ε2 (full adjustment). The ε3/ε2 and ε2/ε2 genotypes were combined because they are each relatively infrequent low-risk genotypes for AD [17]. The ε4/ε2 genotype was omitted because of small sample size in analysis alone, and because it was unclear how to combine it with other genotype(s) to compensate for the small sample size. Analyses were performed within each stratum, with results also combined across strata in a Cochran-Mantel-Haenszel 1 df test. Results were combined across strata by weighting allele frequency differences and their variances proportional to sample sizes, as well as by assuming equal weights across strata. For APOE stratified analyses, for which there were relatives in different strata, this did not fully adjust for the effect of relationships across strata, resulting in modest genome-wide inflation, λ, of the observed test statistic relative to that expected, as quantified by the observed vs. expected median [74]. We therefore corrected the cross-strata tests by dividing by λ as a genome control correction factor [74].

Case-control analysis, unrelated sample

The CCun dataset was also used in the primary analyses to provide a comparison with the analysis of the full sample including the pedigree data, including different effects of adjusting for covariates. Analyses were carried out both without and with adjustment for APOE genotype. The primary analyses were based on logistic regression, using case status as the dependent variable with an additive model for the test SNP (0, 1 or 2 copies of the minor allele). Logistic regression, as opposed to an allele-based test, was chosen because it is insensitive to small deviations from HWE, and can include APOE genotypes as covariates, rather than requiring a less-ideal stratified analysis. The smaller CCun dataset was too small to carry out a stratified analysis, since the small within-stratum analysis made this statistically inappropriate. Therefore, to parallel the analysis of CCall as closely as possible given the different data structure, we included the number of APOE ε4 and ε2 alleles as covariates in an additive model (full-adjustment). This approach efficiently captures most available haplotype information with the cost of only two degrees of freedom [75], [76], and sufficiently captured the effects of APOE, as measured by residual association in the APOE region. For a small number of diagnostic comparisons of the tests used, we also carried out analyses with an allele-based test analogous to that used for analysis of CCall, but without the kinship correction, since there was little evidence of cryptic relatedness. Finally, further analyses were undertaken to probe interaction of SNP effects within the APOE ε4/ε4 genotype, in order to parallel results obtained in the CCall sample within ε4 homozygotes. For these analyses, we carried out focused analyses of specific SNPs, using logistic regression with APOE and test SNP main effects plus an interaction effect, since the analyses based on CCall suggested a statistical interaction. The main interest in these analyses was in the coefficient for interaction.

Adequacy and interpretation of analysis models

Adequacy of case-control tests was evaluated by computing λ [74], and by verifying the entire distribution of resulting p-values against expected quantiles. We used quantile difference plots, which better facilitate visual evaluation of the adequacy of the full distribution of results than do standard QQ plots, by plotting the difference in the negative logarithm of the observed and expected p-values against the negative logarithm of the expected p-values. Deviation from the null distribution over any part of the distribution is an indicator of possible violation of assumptions behind the test, and therefore interpretation of the results. A conservative Bonferroni correction was used initially to adjust for multiple testing, resulting in a threshold of p = 8.8×10−8 (−log(p) = 7.05) as a significance threshold that retains a genome-side significance level of 0.05, given the number of markers used for analysis. Plotting and evaluation of analysis results was carried out with R ( and GnuPlot ( In addition, for the analyses based on the CCun sample, we carried out permutation tests to define empirical confidence bounds under the null distribution. For these analyses we permuted the disease status, but kept the genome scan genotyping intact in order to maintain the LD structure in the genotype data. An equivalent estimate of the confidence bounds of the test results could not be obtained for the CCall sample because of the difficulty of carrying out the permutations under the constraints of relationships in the sample. However, since the CCall sample is larger than the CCun sample, with greater power, the confidence bounds for the CCun sample can be taken as a conservative estimate for the CCall sample.

Analysis of confounding in the NIA-LOAD/NCRAD sample

To evaluate possible sources and effects of confounding and different approaches to correcting for confounding, we carried out additional analyses in both the CCall and CCun samples. Evaluation of the efficacy of each approach to correcting for confounding was based on examination of the genome-wide p-value distribution, as a metric of adequacy of the correction. For comparison to the full-adjustment for APOE, we carried out a stratified analysis (CCall) and a covariate analysis (CCun) based only on presence/absence of APOE ε4 (ε4-adjustment), since this is a common approach for APOE adjustment [42], [43], [57]. Results were combined across strata as described above. In addition to genome-wide distributions of p-values, particular attention was paid to the effects of this intermediate adjustment on evidence for association with SNPs in the APOE region because of interest in additional potential risk loci in this region [56][58]. In one set of analyses of CCall, we carried out a stratified analysis based on the three identified European-American sub-populations, using kinship coefficients estimated separately within each group. In one analysis of CCun we included covariate adjustment for the loadings on the first four European-American-specific principal components [64] as an evaluation of correction for stratification in the sample.

Direct evaluation of the APOE ε4 allele as a source of confounding was obtained through two analyses, after eliminating from the analysis all SNPs within 300 KB of APOE. We examined ε4 allele frequency as a function of PCA loading. We also carried out a genome-wide case-only analysis to compare SNP allele frequencies in ε4 carriers vs. non-carriers among cases from the CCall sample. This eliminated association attributable to case status so that residual association, manifested as either genome-wide deviation of the test results from that expected under the null distribution or for specific SNPs of interest, could therefore be attributed to APOE-associated confounding or similar sources of population structure.

Bioinformatics Analysis

We performed bioinformatics analysis to identify genes that are located near the top SNP signals from our own GWAS, or genes that are biologically related to the genes at the SNP locations. For this purpose, we used SNAP and GRAIL ( to identify candidate genes. SNAP ( identifies genes, extending in both directions until r2<0.5, while GRAIL searches for genes that are in the SNP region and that are biologically related to each other based on the published literature.

Replication Sample and Analysis

As an independent replication of the CUGBP2 association identified in the main analysis in the presence of APOE ε4 homozygotes, we examined a combined sample consisting of a Caribbean Hispanic cohort, and subjects from the combined Washington University case-control dataset and the Alzheimer's Disease Neuroimaging Initiative (WU-ADNI) [77]. These datasets were chosen because all had been genotyped on one of the Illumina platforms and shared multiple SNPs. The Caribbean Hispanic sample comprised 549 cases and 544 controls from two studies, including the Washington Heights-Inwood Columbia Aging Project (WHICAP) study [78] and the Caribbean Hispanic family study of familial AD [79]. The WU-ADNI data set comprised 788 EA cases and 643 EA controls. For the replication analysis, we used a conservative sample of 231 cases and 187 controls from the Caribbean Hispanic sample and 386 cases and 386 controls from the WU-ADNI sample, restricting subjects to homozygotes for each of the APOE ε4 and ε3 alleles, respectively, to avoid the heterogeneity caused by pooling different APOE genotypes that was identified in our primary analysis. While the Caribbean Hispanic sample is ethnically different that the European-American NIA-LOAD sample, this is advantageous since it reduces the probability of inflated evidence for association due simply to the shared ancestry of repeated samples from the same population [80]. SNP genotypes were from the Illumina HumanHap 650Y panel (Caribbean Hispanics) and several different Illumina platforms (WU-ADNI). We evaluated a total of all 24 SNPs in CUGBP2 that were genotyped in all of these samples as well as in the NIA LOAD sample. APOE genotyping was based on the same method as that for the NIA-LOAD cohort, or as described elsewhere [57].

We tested possible association of SNPs in CUGBP2 on LOAD risk in joint analyses across cohorts. For our primary analysis, we analyzed only rs201119 in the independent replication cohorts because this SNP gave the strongest genome-wide-significant evidence for association in the NIA-LOAD/NCRAD sample in the APOE ε4/ε4 stratum. For this analysis we did not apply a multiple testing correction because it was the single primary SNP tested for replication. In a second analysis we carried out analysis of all 24 SNPs that were available in both the replication and original cohorts, using a Bonferroni correction for multiple testing. In a final analysis, we combined the CCun component of the original NIA LOAD/NCRAD cohort with the replication sample, and carried out the same joint analysis using all of the cohorts. The analysis model used was stimulated by the APOE genotype-specific association identified in the main sample, which suggested an interaction between APOE ε4/ε4 and rs201119: we used logistic regression with an additive model for cohort, number of SNP alleles, APOE genotype (ε3/ε3 vs. ε4/ε4), and an interaction between the SNP and APOE, testing for both a SNP main effect and an interaction with APOE genotype. The component of the analysis of interest here was the interaction coefficient, given the original results that suggested such an interaction.


Sample Used for Analysis

The final genotyped NIA-LOAD/NCRAD cohort consisted of 5,220 subjects. The complete sample was ethnically diverse, with 4,232 who were self-declared European-Americans, and the remainder 180 self-declared African American subjects, 309 Hispanic subjects, 300 subjects with other backgrounds, and 199 subjects with no self-reported race and ethnic information. Some individuals clustered with a group other than their self-reported group, leaving 3839 individuals (Table 1) that clustered as European-Americans based on a principal components analysis of all unrelated subjects (Figure 1) and were used in the CCall sample. Of the 3,839 European-American subjects, 993 cases and 884 controls were used in the CCun sample. As expected in a geographically distributed sample from North America, the fraction of subjects from any one self-reported ethnic group varied across collection site.

Principal components analysis of the complete sample, based on all ethnicities.
Fig. 1. Principal components analysis of the complete sample, based on all ethnicities.
Red: European-American subjects. PC1 and PC2: first and second principal component.

Tab. 1. European-American–clustering individuals used for analysis.
European-American–clustering individuals used for analysis.
Definite, probable, and possible AD diagnoses were 28%, 64%, and 8% of the total sample, respectively, and 34.6%, 63.9%, and 1.5% among affected individuals used in the CCun sample.

Population Structure

The European-American-specific principal components (PCs) revealed substructure within the sample. Although apparent with the first two principal components (PC1 and PC2), three subgroups were most clearly defined by the first and fourth principal components (Figure 2). Estimated fractions of each subpopulation varied across sites (Table 2), with the NW group the largest (90.2%) sample (Table 2). A few subjects fell between the main clusters, and were excluded in subsequent subgroup analyses (Figure 2). Subgroup assignments were strongly supported by likelihood computations based on European subgroup-specific AIMs, and by comparison of allele frequencies in the three groups with those of the AIMs. Large between-group allele frequency differences between the NW and other groups near lactase on chromosome 2 and HLA on chromosome 6 [81] further supported these subgroup assignments: e.g., allele frequency differences >0.55 for SNPs near lactase, as do overall comparison of allele frequency differences between pairs of populations. Although the median allele frequency difference was relatively low (<0.04) for all three pairs of populations (Figure 3A), 7%, 9% and 12% of the markers had a substantial allele frequency difference of >0.1 in the NW-SE, AJ-SE, and NW-AJ comparison, respectively. These larger allele frequency differences coupled with varying fractions of cases from the different contributing sites (Table 2) predispose to confounding.

First four principal components (PCs) in the European-American sample alone.
Fig. 2. First four principal components (PCs) in the European-American sample alone.
Colors represent inferred ancestry. Black: northwest (NW) Europe; green: southeast (SE) Europe; cyan: Ashkenazi Jewish (AJ); magenta: indeterminate (omitted from subpopulation analyses).

Cumulative distribution of absolute value of allele frequency differences between subpopulations and <i>APOE</i> genotypes.
Fig. 3. Cumulative distribution of absolute value of allele frequency differences between subpopulations and APOE genotypes.
Panel A: subjects from NW and SE (dotted line), AJ and SE (dashed line), and NW and AJ (solid line) groups. Panel B: cumulative distribution of European-American PC4 values as a function of APOE genotype for ε4 homozygotes (dotted line), ε4 heterozygotes (dashed line); genotypes with no ε4 (solid line). In panel A, the horizontal axis is truncated at 0.25 despite a few rare allele frequency differences that extend to 0.59; in panel B the vertical axis is only presented for the upper quartile of the distributions, where the curves are differentiated.

Tab. 2. Estimated subpopulation membership for the 17 largest individual contributing sites.
Estimated subpopulation membership for the 17 largest individual contributing sites.
Sites that each contributed >50 European-American subjects.

APOE allele frequencies also differed among the three sub-groups, along with a higher fraction of cases relative to the subgroup sample size drawn from the AJ and SE sub-groups than the NW sub-group (Table 3). The allele frequencies in the unrelated controls varied in a manner that is consistent with a known north-south ε4 allele frequency gradient, with higher ε4 allele frequencies in northern than southern European populations [82][84], and with lower ε4 frequencies reported in Jewish populations [85], [86]. In these unrelated controls, the ε4 allele frequency was higher in subjects of NW ancestry (0.139) than in subjects of SW (0.109) or AJ (0.092) ancestry, with the same allele frequency patterns also apparent in the unrelated (control) family members, and in the affected individuals (cases). The cumulative distribution of European-American PC4 values in the whole European-American sample differed among APOE genotypes in a manner that was also consistent with existence of sub-structure (Figure 3B), with similar results observed in the NW group alone (not shown).

Tab. 3. APOE allele frequencies in European-American subjects.
<i>APOE</i> allele frequencies in European-American subjects.
Number of alleles.

Case-Control Analyses

Genome scans, no APOE adjustment

As expected, SNPs near APOE provided the strongest genome-wide evidence for association in the unadjusted analyses (Figure 4A, Table 4). In the primary analyses, SNP rs2075650 in TOMM40, which is in strong linkage disequilibrium with rs429358 in our sample (D′ = 0.70; r2 = 0.45; using the 884 unrelated controls), and which tags the APOE ε4 allele, gave highly significant results in analysis of both the CCall and CCun sample (p = 3.2×10−81 and p = 6.3×10−77, respectively). The secondary analyses gave similar results with rs2075650 in the analysis of the ethnic-stratified analysis for the unweighted combined results, the NW sample, and the AJ sample (p = 1.2×10−15, p = 3.2×10−73, and p = 3.7×10−8, respectively). In each of these analyses, six additional SNPs near APOE also provided very strong support for this association (e.g., in the CCall sample, p-values ranged from p = 4.9×10−10 to p = 2.9×10−24). Only in the small SE sample was the evidence for association with rs2075650 merely suggestive (p = 0.03), consistent with the reduced inflation, in this sample, of the ε4 frequency in cases relative to that observed in the other subgroups (Table 3). For this SE sample, rs7007878 on chromosome 8 at ∼29 MB provided the strongest evidence of association (p = 6.5×10−6).

Genome scan of European-American subjects.
Fig. 4. Genome scan of European-American subjects.
Panel A: CCall sample analyzed as a single population; panel B: stratified analysis of CCall sample that accounts for three subpopulations (NW, SE, AJ); panel C: stratified analysis of CCall sample across four APOE genotypes; panel D: CCun sample, with covariate adjustment for the number of ε2 and ε4 alleles. Plots have been truncated at −log10p = 10 on the vertical axis to more easily visualize results for most of the genome. Multiple SNPs near APOE on chromosome 19 yielded −log10p≫10 in the analyses that did not control for APOE (Panels A and B, see text for details), and are represented by a single triangle at the top of each such panel. Horizontal line shows genome-wide significance level.

Tab. 4. SNPs with strongest evidence for association under each analysis condition.
SNPs with strongest evidence for association under each analysis condition.

Other than SNPs in the APOE region, the region with the strongest evidence for association in the primary analysis spanned 109.2–109.8 MB on chromosome 8, in which several SNPs (e.g., rs1975804, rs1679666, rs1789964) came close to achieving genome-wide significance in either the CCun or CCall sample (Tables S1, S2). Both samples gave similar results: p-values 1.6×10−6 to 3.3×10−7 in the CCall sample, and 9×10−6 to 5.7×10−7 in the CCun sample. A few regions of the genome yielded marginally stronger evidence than the unadjusted analysis for association in the analysis that stratified on ethnic subgroup (Figure 4B), but no regions other than the APOE region reached genome-wide significance. A portion of this sample has been used previously to investigate 29 SNPs as part of focused followup analyses [38], [40]. However, only two of these previously-investigated SNPs overlap with our current study, and neither of these SNPs gave significant results in either the earlier [38] or current analyses.

Genome scans, APOE full adjustment

In contrast to unadjusted analysis, the analyses based on a full-adjustment for APOE genotype identified no SNPs with genome-wide significance in either the full CCun or CCall samples (Figure 4C, 4D). Complete adjustment for APOE genotype accounted for most association in the APOE region. Evidence for association with rs2075650, which had the strongest evidence for association in the unadjusted analysis, fell precipitously after adjustment: only modest evidence for association remained for the CCun sample (p = 6.6×10−4, Table S3) and evidence for association was eliminated in the CCall sample (p = 0.15).

Analysis of individual APOE genotype strata led to identification of one novel region with genome-wide-significant evidence of association on chromosome 10p14 (Figure 5A, Table 4). SNP rs201119 provided strong genome-wide-significant evidence of association within the APOE ε4/ε4 stratum (p = 1.5×10−8). Surrounding SNPs also gave strong results within this stratum (Figure 5A, Table S4), including rs201099, which also reached genome-wide significance (p = 8.3×10−8). Even stronger genome-wide-significant results were obtained when analysis was confined to the APOE ε4/ε4 individuals in the NW subgroup (p = 6.6×10−9 and 2.2×10−8 for rs201119 and rs201099, respectively), eliminating the possibility that this association was explained by the existence of the AJ and SE subjects in the sample. SNPs on chromosomes 8 and 6 gave suggestive evidence for association in the APOE ε3/ε4 (Figure 5B, p = 1.0×10−6, Table S5) and APOE ε3/ε3 (Figure 5C, p = 1.2×10−6, Table S6) strata. Finally, a region on chromosome 1 was identified with strong evidence for association in the APOE ε3/ε2+ε2/ε2 stratum (Figure 5D, p = 8.4×10−7).

Stratified analysis of <i>APOE</i>-defined subgroups of all European-American subjects.
Fig. 5. Stratified analysis of APOE-defined subgroups of all European-American subjects.
Panels A: ε4/ε4 genotype, B: ε3/ε4 genotype; C: ε3/ε3 genotype; and D: ε2/ε2+ε2/ε3 combined genotype. Horizontal line shows genome-wide significance level.

Bioinformatics results

Table 4 lists genes located on or near the SNPs with the strongest p-value for each analysis. Several genes emerged (Table 4), in addition to APOE and related genes (e.g., APOC1). The most promising candidate is CUGBP2 (CUG triplet repeat, RNA binding protein 2; 11,087,265–11,509,495 bp) on chromosome 10p14, which was associated with the top SNP, rs201119, identified from the APOE ε4/ε4 restricted analysis. This SNP is in the middle of this gene, as is SNP rs201099, also with genome-wide significant evidence for association.

Replication analysis of CUGBP2

The primary analysis of CUGBP2 in the replication cohorts supported rs201119 as associated with LOAD in the presence of APOE ε4/ε4 (Table 5). Targeted analysis gave a significant interaction effect with APOE (p = 0.048, OR = 1.43, 95% CI 1.0–2.03 for the same allele as showed a higher frequency in cases than controls in the CCall ε4/ε4 sample). The NIA-LOAD/NCRAD CCun sample similarly gave significant evidence for an interaction with the same model (p = 0.00016), consistent with the results from the stratified analysis of the larger CCall sample. Among the 24 SNPs evaluated in the joint analyses of the replication cohorts, 6 SNPs provided nominal p-values below 0.05 for interaction with APOE, with p = 5.6×10−4 for rs62209 (OR = 1.75, 95% CI 1.27–2.41), which is significant at the 1% level after Bonferroni correction. SNP rs201099, which also provided genome-wide significant evidence for association in the NIA-LOAD scan, provided nominal evidence of an interaction in the replication sample (p = 9.1×10−3). In the joint analysis of the NIA LOAD/NCRAD, Caribbean Hispanic, Washington University, and ADNI samples, 12 SNPs were significant for interaction with APOE at a nominal 5% level, and 6 SNPs remained significant after Bonferonni correction (p = 2.08×10−3, Table 5). The cohorts used for the replication analysis had similar patterns of linkage disequilibrium between pairs of SNPs in CUGBP2 (Figure S2), further supporting the evidence for replication. These results in joint analysis of the replication samples provide further support for a statistical interaction of SNPs in CUGBP2 with APOE ε4/ε4 in conferring AD risk.

Tab. 5. Significance of SNPs in CUGBP2 in prediction of disease risk.
Significance of SNPs in <i>CUGBP2</i> in prediction of disease risk.
Allele depicted is the allele with higher frequency in the APOE ε4/ε4 cases than ε4/ε4 controls, as denoted on the Illumina TOP strand.

Replication of prior GWAS results

SNPs in four genes that were recently implicated [42][44] each provided initial nominal evidence of association in our dataset in analysis of the unadjusted data (Table 6), with the same allele associated with disease risk in our sample as was previously reported. An additional SNP, rs597668, near EXOC3L2 [44], was also considered, but because of its proximity to APOE, was examined in less detail. Two of the three SNPs highlighted initially with strongest evidence for association - rs3818361 in CR1 and rs11136000 in CLU - each reached nominally significant (p<0.05) evidence of association in the unstratified or unadjusted analyses of the CCun and CCall samples. SNP rs3851179 in PICALM achieved evidence for association at this nominal level only in analysis of CCall. The gene encoding bridging integrator 1 (BIN1) was also noted initially at a lower threshold of significance [42] with stronger results in a recent replication analysis [44], and gave nominally significant results in the unstratified and unadjusted analyses of our sample for rs7561528 in both CCun and CCall, and for rs744373 in CCun (Table 6). In addition, five of the six additional SNPs that were in these four genes in our genotyping panel achieved nominal significance in the unstratified analysis of the CCall sample, and two SNPs achieved nominally significant results in unadjusted analysis of the CCun sample (Table 6).

Tab. 6. P-values for candidate SNPs based on genes previously reported with genome-wide significant results.
P-values for candidate SNPs based on genes previously reported with genome-wide significant results.
Bold: nominal significance p<0.05.


The distribution of p-values obtained from the unadjusted genome scans deviated from a uniform distribution, suggesting the presence of uncorrected confounding. This effect was mild near the median test value (λ = 0.97 for CCun, λ = 1.02 for CCall) but more apparent in the tails of the distribution, providing evidence for potential confounding in analysis of both samples (Figure 6A, 6B; magenta points; Figure S2). Some of the deviation from the null distribution is likely to be attributable to the greater sensitivity to HWD for the allele-based tests than for logistic regression in CCun (Figures S2, S3, CCun results). However, deviation from the null distribution in the direction of an increased type I error over the nominal level was especially marked in the upper 0.1% of the tail of the distribution for the unadjusted analysis of the CCun sample even under analysis with logistic regression (Figure S4), and in the upper 1% for the CCall sample (Figure 6A, 6B, magenta points). The excess fraction of small p-values was not explained by SNPs in the APOE region (Figure 6C), some of which had, as expected, much more extreme p-values. This deviation from the null distribution was not explained by inadequate correction for relationships in the CCall sample since the same excess pattern of extreme p-values occurred in the analysis of both the CCun and CCall samples, and over a wider range of p-values when the CCun sample was analyzed with a chi-square test instead of with logistic regression (Figure S2). Control for test statistic inflation was also not achieved by incorporation of the first four principal components as covariates [64] (Figure 6A, grey points; Figure S3), or by restricting analysis to the more uniform NW group (Figure 6B, grey points).

Quality control evaluation of association tests in the CC<sub>un</sub> and CC<sub>all</sub> samples.
Fig. 6. Quality control evaluation of association tests in the CCun and CCall samples.
Panels A, B: Quantile difference plots for association tests excluding SNPs in the APOE region; and panel C: −log10(p) for the same analyses for the 95 SNPs in the APOE region. For panels A and B, results are shown, for N tests, as the difference of the ith of N ordered observed (−log10(pi)) and expected (−log10(i/N)) quantiles plotted against the expected quantiles. A: results for the CCun sample, with grey: PCA adjusted; magenta: unadjusted analysis; cyan: ε4 adjustment; black: full adjustment. B: results for the sample containing related individuals; grey: unadjusted analysis of NW subgroup; magenta: unadjusted analysis of CCall; cyan: ε4-stratified analysis of CCall; black: full adjustment. C: UN depicts results for analysis of CCun; REL depicts results for analysis of the larger sample, in both cases for the same four conditions and colors as in panels A and B.

Two sources of evidence suggested that an important source of potential confounding was APOE genotype. The first was the effect of adjustment for APOE genotype, which had a notable effect on the distribution of resulting genome-wide p-values. Simple adjustment of APOE through binary ε4-status yielded a distribution of p-values that was closer to a uniform distribution than was obtained from unadjusted analysis. However, deviation from the expected null distribution was still evident (Figure 6A, 6B, cyan points), and there was still evidence for association with SNP rs2075650 near APOE (Figure 6C) in both the unrelated and related samples (p = 1.5×10−9 for CCun, and p = 1.2×10−7 for CCall). The full APOE adjustment achieved the best control of the null distribution of p-values (Figure 6A, 6B, black points), and produced close to the expected uniform distribution of p-values under the null distribution (Figure S3). Addition of the PCs as covariates alone did not produce the desired distribution of p-values (Figure S4, Table S7) and in addition to the full APOE adjustment in the CCun sample did not provide further improvement to the distribution of p-values over the APOE adjustment (Table S8 versus Table S3). This analysis also eliminated all statistically significant association with SNPs in the APOE region (Figure 6C), and evidence for adequate genomic control within each APOE stratum was reasonable (λ = 0.997, 1.02, 1.009, 1.003 for the ε4/ε4, ε4/ε3, ε3/ε3 and ε3/ε2+ε2/ε2 strata, respectively). A second source of evidence for confounding or population stratification was obtained from the results from the case-only analysis: the genome-wide distribution of p-values from the allele frequency comparison in ε4 carriers vs. non-carriers in the case-only sample also showed an overall deviation from the expected null distribution in the direction of an excess of small p-values (Figure 7). This indicates that there are many markers that are correlation with APOE in the highly-ascertained case sample.

Quantile difference plot of tests of allele frequency differences in <i>APOE ε4</i>-carrier versus non-carrier cases.
Fig. 7. Quantile difference plot of tests of allele frequency differences in APOE ε4-carrier versus non-carrier cases.
SNPs in the APOE region are not included.

Effect of confounding on replication SNPs

Association between LOAD and a subset of the replication SNPs showed evidence of APOE-induced confounding. Evidence for association with SNPs in PICALM was highly sensitive to adjustment for APOE genotype (Table 6), suggesting the possibility of confounding. Results from the case-only analysis supported this interpretation: differences in allele frequencies among cases who were ε4 carriers vs. non-carriers were nominally significant (p<0.05) or close to significant for all of the four SNPs evaluated in PICALM: p = 0.0047, 0.0184, 0.0524, and 0.0222 for rs541458, rs543293, rs7941541, and rs3851179, respectively. Furthermore, for all four SNPs, the pattern of allele frequencies in cases and controls, and among the individual APOE genotypes in the cases, was consistent with such confounding: the allele that was associated with higher risk of case-status in the original case-control analysis always had the highest allele frequency among ε4/ε4 cases and the lowest allele frequency among ε3/ε3 cases, with the allele frequency intermediate in the ε3/ε4 cases. For example, for rs3851179, the major allele, C, had a frequency of 0.655 in cases and 0.634 in controls, and had allele frequencies of 0.618, 0.665, and 0.67 in ε3/ε3, ε3/ε4, and ε4/ε4 cases, respectively. Similar results were obtained for rs597668 near EXOC3L2, with evidence for association in the absence of APOE adjustment (p = 0.0007 in CCall) weakening considerably with full adjustment for APOE (p = 0.67), and with very strong evidence for allele frequency differences between ε4 carriers and non-carriers in the case-only analysis (p = 6.79×10−8). In contrast, there were no significant allele frequency differences identified in case-only analyses for the SNPs in Table 6 for CR1, CLU and BIN1 (p = 0.31–0.74).

The evidence for association with SNPs in some of these four genes was also dependent on ethnic stratification, suggesting a further or alternative source of confounding. For all tested SNPs in CR1, evidence for association in the CCall sample, while modest in each sub-population, was consistent across subpopulations and therefore strengthened in analysis that stratified on ethnic subgroup (Table 6). Evidence for association with rs7561528 in BIN1 also remained present in the ethnically-stratified analysis. In contrast, evidence for association with CLU and PICALM was only present in the unstratified analysis of the full sample, and in the NW group, with no support from the AJ and SE samples or from the ethnically-stratified analysis. Adjustment for ethnic subgroup did not fully correct for residual sources of correlation, as judged by the quantile difference plot for the NW sample (Figure 6B), suggesting that residual sources of correlation exist, even after correcting for relationship information. Similarly, evaluation of the quantile difference plot for the analysis of the CCun sample, using principal component loadings to correct for possible ethnic variability, also failed to produce the desired genome-wide quantile difference plot (Figure 6A).


Analysis of the NIA-LOAD/NCRAD sample indicates that unraveling susceptibility to LOAD is complex even when individuals from genetically-loaded multiplex families are included. As with other studies, support for the association between LOAD and SNPs near APOE was strong. By taking advantage of this association, we were able to identify a potential novel locus, CUGBP2, on chromosome 10p14 with genome-wide significant evidence of association within the highest-risk APOE ε4/ε4 stratum, with replication in an independent sample. We also found support for association with recently-reported SNPs in CLU and BIN1, and to a lesser extent with CR1. However, we found that the strong APOE association also introduced a source of structure into the sample that had effects that were detectable through standard evaluation of analysis results. Our results provide strong evidence that this correlation with APOE explains the association in this sample with some, but not all, previously-noted SNPs, including PICALM and the recently-proposed association near EXOC3L2, both of which have significantly different allele frequencies in AD cases who are carriers vs. non-carriers of the APOE ε4 allele.

Detection of true risk loci in a GWAS of LOAD requires careful attention to potential sampling biases [87]. Large samples such as ours are necessary for detecting modest associations, but such samples usually involve multiple collection sites, introducing the potential for confounding or other complications. Consistent with this, across our participating sites we found variability in the numbers of cases and controls, the fraction of underlying identifiable ethnic subgroups, differences among subgroups in terms of APOE genotype frequencies, and differences in APOE genotype distributions as a function of an indicator of genetic differentiation. None of this is surprising, given the history of US colonization and immigration coupled with differentiation among European populations [81], [88]. Other large samples in Europe and other locations are likely to have similar issues, as suggested by genome-wide inflation factors reported by recent studies [42], [43] that were higher than those in our study. Appropriate accommodation for confounding or structure when it is present can provide both protection against false positive associations, as well as increased power to detect associations that are confined to a subset of the sample, as we have demonstrated as part of our investigations surrounding the influence of APOE on our results. We also found that common methods failed to provide the necessary correction for APOE-induced associations, including use of principal components adjustment [64] and genomic control [74]. Together these observations have important implications for interpretation of results from other large combined samples.

Accommodation for APOE genotype was key for obtaining appropriate genomic control in our sample. Incorporation of individual APOE genotypes, as opposed to the more typical use of presence or absence of ε4, resulted in the closest approximation to a uniform distribution of p-values over a wide range of the test results. This likely resulted in a reduction in false positive association results since such control must be achieved before accepting evidence of association. Not only were our genome-wide results impacted by adjustment for APOE genotypes, but the support for some SNP associations from previous studies was similarly affected. For the SNPs that were most sensitive to APOE-adjustment, the allele frequencies differed among cases as a function of APOE genotype, suggesting a relatively simple diagnostic for which SNPs require adjustment for APOE as part of the analysis: for such SNPs, a full adjustment for APOE genotype may be critical for genomic control in part because of allele frequency differences among populations [82], [89]. These differences could lead to structure in the ascertained sample through variability in disease risk or survival in underlying subpopulations, as seen across the subpopulations identified in this sample. It thus may represent a corollary to confounding through ascertainment of cases, possibly related to the effects discussed by Voight and Pritchard [90]. Alternatively, it may represent statistical interaction resulting from population stratification, which can create mild linkage-disequilibrium between many markers that are on different chromosomes, with the strongest such LD occurring between loci with the largest frequency differences across populations. Such genome-wide effects of population stratification have recently been demonstrated both in simulated data, and in breast cancer, where there is association, detectable in cases, between SNPs in LCT and genome-wide SNPs, with a similar genomewide shift in the distribution of p-values [91]. Such adjustments for loci with strong effects may also be important in other diseases with such strong risk loci.

Stratification on APOE genotype did facilitate the identification of a novel region with genome-wide significant evidence for association on chromosome 10p14, which replicated in a second sample consisting of three additional cohorts. This region was identified only in the APOE ε4/ε4 stratum or in a logistic analysis that contrasted ε4 and ε3 homozygotes in a model with an interaction term with APOE. The relative infrequency of ε4 homozygotes means that these results will need to be further investigated in other large data sets to determine its importance. Data sets that consist of high-risk families, such as our sample and the NIMH AD sample [92], may be preferable in such analyses, since such sample ascertainment may have contributed to the detection of this locus through the resulting presence of a relatively high fraction of APOE ε4 homozygotes. It is also worth noting that an earlier linkage analysis of a subset of the families used here, based on the Illumina 6K mapping panel, obtained lod scores for rs1537626 of 2.35 in the whole sample and 1.6 in an analysis that retained only APOE ε4-positive cases. This SNP is within 10 cM of rs201119 [93]. This SNP was not on the marker panel used here, nor was rs201119 on the earlier 6K marker panel, preventing further comparison of results. It is also possible that analysis within the high-risk APOE ε4/ε4 genotype improved detection of this region in the current study by increasing the within-genotype penetrance, possibly by affecting age-at onset. If so, this would be similar to the strategy of identifying risk- or age-at-onset modifier loci on a background of a single, early-onset AD mutation [94][96]. The implicated region on chromosome 10p14 contains the genes CUGBP2 and PITRM1. CUGBP2 has one isoform that is expressed predominantly in neurons, with experimental evidence suggesting involvement in apoptosis in the hippocampus [97], with both these observations consistent with a role in pathogenesis of Alzheimer's disease. PITRM1 can degrade amyloid β4 APP protein when it is accumulated in mitochondria [98].

Our results both support and refute recently proposed association with SNPs in several genes [42][44]. Evidence for association with SNPs previously reported in each of BIN1, CLU, and CR1 was relatively robust to APOE adjustment within this European-American sample, with evidence for BIN1 and CR1 also obtained across an analysis that conditioned on ethnic background. Recent reports by others that include portions of the sample we used here also report evidence for association with PICALM [99], [100], but did not report the results of quality control analyses that allow evaluation of adequacy of correction for confounding. In our analyses, with correction for sources of confounding, evidence for association with SNPs in PICALM and EXOC3L2 was much less convincing than for these other three loci because of the exquisite sensitivity to APOE adjustment. One interpretation of sensitivity of these associations to APOE adjustment is that this statistical interaction is indicative of biological interaction in an analysis that includes a subset of the current sample [99]. However, the differences in SNP allele frequencies across APOE strata within cases that we showed here coupled with information demonstrating the existence of population stratification raise concerns that the original associations for these latter SNPs may represent confounding or other aspects of sample or population structure. This could include linkage disequilibrium with APOE, even for unlinked markers. Further investigation in genetically more diverse populations will still be necessary to clarify even the role of SNPs with positive evidence for association, because shared history can lead to spurious replication in samples drawn from the same population [80].

The results presented here and in other GWAS reports of LOAD underscore the view that such studies do not necessarily identify the specific genetic alterations contributing to disease risk. Rather, they are useful in identifying genes or gene pathways involved in disease pathogenesis or risk. In that sense, GWAS represents a method of screening the genome for genes that may also contain rare variants. While the large number of subjects in current GWAS provides a benefit in terms of perceived statistical power, it comes at a price. For example, despite the very low p-values representing genome-wide statistical significance, the effect sizes in most recent GWAS involving LOAD are small. It has also been suggested that different significance thresholds as a function of sample size are needed in order to balance power against the false-discovery rate [101], with very large studies requiring more stringent thresholds. This means that subtle differences in the genetic architecture of either the cases or the controls become more important with increasing sample sizes. In this situation some of the “significant” differences in allele frequency may also represents differences in ancestral origins rather than disease phenotype-genotype associations, and would likely not lead to further biological insights. As we have shown here, genetic variability within European-American groups exists and can affect analyses of association. Moving forward, GWAS in LOAD should consider more detailed care to control for population stratification or APOE genotypes prior to drawing firm conclusions about associations. In this sense bigger studies of LOAD or of other diseases with similar influential risk loci may not always be better, if the increases in sample size result in added data structure or confounding.

Supporting Information

Attachment 1

Attachment 2

Attachment 3

Attachment 4

Attachment 5

Attachment 6

Attachment 7

Attachment 8

Attachment 9

Attachment 10

Attachment 11

Attachment 12


1. MayeuxR

2003 Epidemiology of neurodegeneration. Annu Rev Neurosci 26 81 104

2. FratiglioniL

De RonchiD


1999 Worldwide prevalence and incidence of dementia. Drugs & Aging 15 365 375

3. GatzM





2006 Role of genes and environments for explaining Alzheimer disease. Arch Gen Psychiatry 63 168 174

4. BergemALM


1997 Apolipoprotein E type epsilon 4 allele, heritability and age at onset in twins with Alzheimer's disease. Clin Genet 52 408 413

5. MeyerJM


1998 Multiple threshold model for the onset of Alzheimer's disease in the NAS-NRC twin panel. Am J Med Genet 81 92 97

6. AkessonHO

1969 A Population Study of Senile and Arteriosclerotic Psychoses. Hum Hered 19 546

7. BreitnerJCS



1986 Familial Aggregation in Alzheimer Dementia .1. a Model for the Age-Dependent Expression of an Autosomal Dominant Gene. J Psychiatr Res 20 31 43

8. MohsRC




1987 Alzheimers-Disease - Morbid Risk among 1st-Degree Relatives Approximates 50-Percent by 90 Years of Age. Arch Gen Psychiatry 44 405 408

9. LautenschlagerNT





1996 Risk of dementia among relatives of Alzheimer's disease patients in the MIRAGE study: What is in store for the oldest old? Neurology 46 641 650

10. Van BroeckhovenC


Van den BergheA



1987 Failure of Familial Alzheimers-Disease to Segregate with the A4-Amyloid Gene in Several European Families. Nature 329 153 155

11. BirdTD





1988 Familial Alzheimer's disease in american descendants of the Volga Germans: probable genetic founder effect. Ann Neurol 23 25 31

12. St. George-HyslopPH





1989 Familial Alzheimers-Disease - Progress and Problems. Neurobiol Aging 10 417 425

13. GoateA





1991 Segregation of a missense mutation in the amyloid precursor protein gene with familial Alzheimer's disease. Nature 349 704 706

14. SherringtonR





1995 Cloning of a gene bearing missense mutations in early-onset familial Alzheimer's disease. Nature 375 754 760

15. Levy-LahadE





1995 Candidate gene for the chromosome 1 familial Alzheimer's disease locus. Science 269 973 977

16. CorderEH





1993 Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer's disease in late onset families. Science 261 921 923

17. CorderEH





1994 Protective Effect of Apolipoprotein-E Type-2 Allele for Late-Onset Alzheimer-Disease. Nat Genet 7 180 184

18. MyersRH





1996 Apolipoprotein E epsilon 4 association with dementia in a population-based study: The Framingham study. Neurology 46 673 677

19. BennettCL





1995 Evidence that the APOE locus influences rate of disease progression in late-onset familial Alzheimer's-disease but is not causative. Am J Med Genet B Neuropsychiatr Genet 50 1 6

20. SlooterAJC





1998 Risk estimates of dementia by apolipoprotein E genotypes from a population-based incidence study: The Rotterdam study. Arch Neurol 55 964 968

21. DawEW



1999 Multipoint oligogenic analysis of age-at-onset data with applications to Alzheimer's disease pedigrees. Am J Hum Genet 64 839 851

22. DawEW





2000 The number of trait loci in late-onset Alzheimer disease. Am J Hum Genet 66 196 204

23. Pericak-VanceMA





2000 Identification of Novel Genes in Late-Onset Alzheimer's Disease. Exp Gerontol 35 1343 1352

24. BlackerD





2003 Results of a high-resolution genome screen of 437 Alzheimer's Disease families. Hum Mol Genet 12 23 32

25. FarrerLA





2003 Identification of multiple loci for Alzheimer disease in a consanguineous Israeli-Arab community. Hum Mol Genet 12 415 422

26. ScottWK





2003 Ordered-subsets linkage analysis detects novel Alzheimer disease loci on chromosomes 2q34 and 15q22. Am J Hum Genet 73 1041 1051

27. WijsmanEM





2004 Evidence for a novel late-onset Alzheimer's disease locus on chromosome 19p13.2. Am J Hum Genet 75 398 409

28. BertramL





2005 Family-based association between Alzheimer's disease and variants in UBQLN1. N Engl J Med 352 884 894

29. RademakersR





2005 Linkage and association studies identify a novel locus for Alzheimer disease at 7q36 in a Dutch population-based sample. Am J Hum Genet 77 643 652

30. HahsDW





2006 A genome-wide linkage analysis of dementia in the Amish. Am J Med Genet B Neuropsychiatr Genet 141B 160 166

31. LeeJH





2006 Expanded genomewide scan implicates a novel locus at 3q28 among Caribbean Hispanics with familial Alzheimer disease. Arch Neurol 63 1591 1598

32. LiY





2006 DAPK1 variants are associated with Alzheimer's disease and allele-specific expression. Hum Mol Genet 15 2560 2568

33. ButlerAW





2009 Meta-analysis of linkage studies for Alzheimer's disease-A web resource. Neurobiol Aging 30 1037 1047

34. CoonKD





2007 A high-density whole-genome association study reveals that APOE is the major susceptibility gene for sporadic late-onset Alzheimer's disease. J Clin Psychiatry 68 613 618

35. GrupeA





2007 Evidence for novel susceptibility genes for late-onset Alzheimer's disease from a genome-wide association study of putative functional variants. Hum Mol Genet 16 865 873

36. ReimanEM





2007 GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers. Neuron 54 713 720

37. AbrahamR





2008 A genome-wide association study for late-onset Alzheimer's disease using DNA pooling. BMC Medical Genomics 1 44

38. BertramL





2008 Genome-wide Association Analysis Reveals Putative Alzheimer's Disease Susceptibility Loci in Addition to APOE. Am J Hum Genet 83 623 632

39. BeechamGW





2009 Genome-wide Association Study Implicates a Chromosome 12 Risk Locus for Late-Onset Alzheimer Disease. Am J Hum Genet 84 35 43

40. CarrasquilloMM





2009 Genetic variation in PCDH11X is associated with susceptibility to late-onset Alzheimer's disease. Nat Genet 41 192 198

41. PodusloSE




2009 Genome screen of late-onset Alzheimer's extended pedigrees identifies TRPC4AP by haplotype analysis. Am J Med Genet B Neuropsychiatr Genet 150B 50 55

42. HaroldD





2009 Genome-wide association study identifies variants at CLU and PICALM associated with Alzheimer's disease. Nat Genet 62 1088 1093

43. LambertJC





2009 Genome-wide association study identifies variants at CLU and CR1 associated with Alzheimer's disease. Nat Genet 41 1094 1099

44. SeshadriS





2010 Genome-wide Analysis of Genetic Loci Associated With Alzheimer Disease. JAMA 303 1832 1840

45. ZhongH


2010 Correcting “‘Winner's Curse” in Odds Ratios from Genomewide Association Findings for Major Complex Human Diseases. Genet Epidemiol 34 78 91

46. JarvikGP





1996 Influence of apolipoprotein E genotype on the transmission of Alzheimer disease in a community-based sample. Am J Hum Genet 58 191 200

47. SlooterAJC


1997 Genetic epidemiology of Alzheimer disease. Epidemiol Rev 19 107 119

48. McCarthyM



1998 Sib-pair collection strategies for complex diseases. Genet Epidemiol 15 317 340

49. McKhannG





1984 Clinical diagnosis of Alzheimer's disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer's Disease. Neurology 34 939 944

50. PetersenRC





1999 Mild cognitive impairment - Clinical characterization and outcome. Arch Neurol 56 303 308

51. HughesCP





1982 A new clinical scale for the staging of dementia. Br J Psychiatry 140 566 572

52. MirraSS





1991 The Consortium to Establish a Registry for Alzheimer's Disease (CERAD). Part II. Standardization of the neuropathologic assessment of Alzheimer's disease. Neurology 41 479 486

53. ChudykA





2006 Soluble HLA class I molecules exert differentiated influence on renal graft condition. Transplant Proc 38 90 93

54. MyakishevMV




2001 High-throughput SNP genotyping by allele-specific PCR with universal energy-transfer-labeled primers. Genome Res 11 163 169

55. HawkinsJR




2002 Miniaturized sealed-tube allele-specific PCR. Hum Mutat 19 543 553

56. BekrisLM





2008 Multiple SNPs within and surrounding the apolipoprotein E gene influence cerebrospinal fluid apolipoprotein E protein levels. J Alzheimers Dis 13 255 266

57. PotkinSG





2009 Hippocampal Atrophy as a Quantitative Trait in a Genome-Wide Association Study Identifying Novel Susceptibility Genes for Alzheimer's Disease. PLoS ONE 4 e6501 doi:10.1371/journal.pone.0006501

58. RosesAD

2010 An Inherited Variable Poly-T Repeat Genotype in TOMM40 in Alzheimer Disease. Arch Neurol 67 536 541

59. GöringHHH


1997 Relationship estimation in affected rib pair analysis of late-onset diseases. Eur J Hum Genet 5 69 77

60. SunL



2002 Enhanced pedigree error detection. Hum Hered 54 99 110

61. NielsenDM



1998 Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. Am J Hum Genet 63 1531 1540

62. Wittke-ThompsonJK



2005 Rational inferences about departures from Hardy-Weinberg equilibrium. Am J Hum Genet 76 967 986

63. ZouGY


2006 The merits of testing Hardy-Weinberg equilibrium in the analysis of unmatched case-control data: A cautionary note. Ann Hum Genet 70 923 933

64. PriceAL





2006 Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38 904 909

65. PriceAL





2008 Discerning the ancestry of European Americans in genetic association studies. PLoS Genet 4 e236 doi:10.1371/journal.pgen.0030236

66. HarveyCB





1998 Lactase haplotype frequencies in Caucasians: association with the lactase persistence/non-persistence polymorphism. Ann Hum Genet 62 215 223

67. HolloxEJ





2001 Lactase haplotype diversity in the Old World. Am J Hum Genet 68 160 172

68. ChoiY



2009 Case-control Association Testing in the Presence of Unknown Relationships. Genet Epidemiol 35 668 678

69. JacquardA

1972 Genetic Information Given by a Relative. Biometrics 28 1101 1114

70. MilliganBG

2003 Maximum-likelihood estimation of relatedness. Genetics 163 1153 1167

71. BourgainC





2003 Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am J Hum Genet 73 612 626

72. PritchardJK


2001 Case-control studies of association in structured or admixed populations. Theor Popul Biol 60 227 237

73. FarrerLA





1997 Effects of age, sex, and ethnicity on the association between apolipoprotein E genotype and Alzheimer disease. A meta-analysis. APOE and Alzheimer Disease Meta Analysis Consortium. JAMA 278 1349 1356

74. DevlinB


1999 Genomic control for association studies. Biometrics 55 997 1004

75. ChapmanJM




2003 Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Hum Hered 56 18 31

76. YuCE





2007 Comprehensive analysis of APOE and selected proximate markers for late-onset Alzheimer's disease: Patterns of linkage disequilibrium and disease/marker association. Genomics 89 655 665

77. CruchagaC





2010 SNPs Associated with Cerebrospinal Fluid Phospho-Tau Levels Influence Rate of Decline in Alzheimer's Disease. PLoS Genet 6 e1001101 doi:10.1371/journal.pgen.1001101

78. TangMX





1998 The APOE-epsilon 4 allele and the risk of Alzheimer disease among African Americans, Whites, and Hispanics. JAMA 279 751 755

79. RomasSN





2002 Familial Alzheimer disease among Caribbean Hispanics - A reexamination of its association with APOE. Arch Neurol 59 87 91

80. RosenbergNA


2009 Replication of Genetic Associations as Pseudoreplication due to Shared Genealogy. Genet Epidemiol 33 479 487

81. HeathSC





2008 Investigation of the fine structure of European populations with applications to disease association studies. Eur J Hum Genet 16 1413 1429

82. LucotteG



1997 Pattern of gradient of apolipoprotein E allele *4 frequencies in Western Europe. Hum Biol 69 253 262

83. CorboRM


1999 Apolipoprotein E (APOE) allele distribution in the world. Is APOE*4 a ‘thrifty’ allele? Ann Hum Genet 63 301 310

84. SinghPP



2006 APOE distribution in world populations with new data from India and the UK. Ann Hum Biol 33 279 308

85. RosenmannH





2003 An association study of the codon 72 polymorphism in the pro-apoptotic gene p53 and Alzheimer's disease. Neurosci Lett 340 29 32

86. Dresner-PollakR





2009 Estrogen Receptor Beta Gene Variant Is Associated with Vascular Dementia in Elderly Women. Genet Test Mol Biomarkers 13 339 342

87. LittleJ





2009 Strengthening the reporting of genetic association studies (STREGA): an extension of the STROBE Statement. Hum Genet 125 131 151

88. NelisM





2009 Genetic Structure of Europeans: A View from the North-East. PLoS ONE 4 e5472 doi:10.1371/journal.pone.0005472

89. ZekraouiL





1997 High frequency of the apolipoprotein E *4 allele in African pygmies and most of the African populations in sub-Saharan Africa. Hum Biol 69 575 581

90. VoightBF


2005 Confounding from cryptic relatedness in case-control association studies. PLoS Genet 1 e32 doi:10.1371/journal.pgen.0010032

91. BhattacharjeeS





2010 Using Principal Components of Genetic Variation for Robust and Powerful Detection of Gene-Gene Interactions in Case-Control and Case-Only Studies. Am J Hum Genet 86 331 342

92. BlackerD





2001 Results of a high resolution genome screen in 443 Alzheimer's disease families: the NIMH Genetics Initiative. Am J Hum Genet 69 498 498

93. LeeJH





2008 Analyses of the National Institute on Aging Late-Onset Alzheimer's Disease Family Study Implication of Additional Loci. Arch Neurol 65 1518 1526

94. PastorP





2003 Apolipoprotein E ε4 modifies Alzheimer's disease onset in an E280A PS1 Kindred. Ann Neurol 54 163 169

95. WijsmanEM





2005 APOE and other loci affect age-at-onset in Alzheimer's disease families with PS2 mutation. Am J Med Genet B Neuropsychiatr Genet 132B 14 20

96. MarchaniEE





2010 Evidence for three loci modifying age-at-onset of Alzheimer's disease in early-onset PSEN2 families. Am J Med Genet B Neuropsychiatr Genet 153B 1031 1041

97. PaciniA





2005 NAPOR-3 RNA binding protein is required for apoptosis in hippocampus. Brain Res Mol Brain Res 140 34 44

98. FalkevallA





2006 Degradation of the amyloid beta-protein by the novel mitochondrial peptidasome, PreP. J Biol Chem 281 29096 29104

99. JunG





2010 Meta-analysis confirms CR1, CLU, and PICALM as Alzheimer disease risk loci and reveals interactions with APOE genotypes. Arch Neurol Sept. 3 Epub ahead of print

100. CarrasquilloMM





2010 Replication of CLU, CR1, and PICALM Associations With Alzheimer Disease. Arch Neurol 67 961 964

101. WakefieldJ

2009 Bayes Factors for Genome-Wide Association Studies: Comparison with P-values. Genet Epidemiol 33 79 86

Genetika Reprodukční medicína

Článek vyšel v časopise

PLOS Genetics

2011 Číslo 2

Nejčtenější v tomto čísle
Zapomenuté heslo

Nemáte účet?  Registrujte se

Zapomenuté heslo

Zadejte e-mailovou adresu, se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.


Nemáte účet?  Registrujte se