ISSN: 2165-7890
Research Article - (2013) Volume 3, Issue 1
Background: Twin and family studies have indicated a strong genetic component in autism spectrum disorders, and genetic studies have revealed highly heterogeneous risk factors. The range and severity of the symptom presentation also vary in the spectrum. Thus, symptom-based phenotypes are putatively more closely related to the underlying biology of autism than the end-state diagnosis.
Methods: We performed principal component analysis on Autism Diagnostic Interview-Revised algorithm for 117 Finnish families and 594 families from the Autism Genetic Research Exchange (AGRE). The resulting continuous component scores were used as quantitative phenotypes in family-based association analysis. In addition, K-means clustering was performed to cluster and visualize the results of the PCA. Unaffected siblings were included in the study.
Results: The components were interpreted as Social Component (SC), communication component and Restricted and Stereotyped Behavior Component (RSBC). K-means clustering showed that, especially in SC, the range of the symptom severity was increased by the siblings. The association of neuroligin 1 with SC was increased, compared to a previous study where only the end-state diagnosis was used. In RSBC, the range of the symptom severity of siblings overlapped greatly with that of patients, which could explain why no association of reelin was found in previous studies in which only the end-state diagnosis was used, but a significant association of reelin with RSBC was now found in the Finnish families (Bonferroni-corrected p=0.029 for rs362644). Although, the Finnish sample is isolated and genetically very homogeneous, compared to the heterogeneous background of AGRE families, many single-nucleotide polymorphisms in reelin, showed modest association with RSBC in the AGRE sample, too.
Conclusions: This study demonstrates how the quantitative phenotypes can affect the association analyses, and yields further support to the use of siblings in the study of complex neuropsychiatric disorders.
Keywords: Autism; ADI-R; Principal component analysis; K-means clustering; Neuroligin 1; Reelin; Quantitative phenotype; Association analysis
ADI-R: Autism Diagnostic Interview-Revised; AGRE: Autism Genetic Research Exchange; ASDs: Autism Spectrum Disorders; CC: Communication Component; DSM-IV: Diagnostic and Statistical Manual of Mental Disorders, fourth edition; FIN: Finnish (data sample); ICD-10: International Classification of Diseases, tenth revision; NLGN1: Neuroligin 1; PC: Principal Component; PCA: Principal Component Analysis; PDD-NOS: Pervasive Developmental Disorder Not Otherwise Specified; RELN: Reelin; RSBC: Restricted and Stereotyped Behavior Component; SC: Social Component; SNP: Single-nucleotide Polymorphism
Autism is a neurodevelopmental disordercharacterized by deficits in reciprocal social interactions and communication, and restricted and stereotyped behavior. It has been estimated to affect 20.6 per 10,000 individuals, while Autism Spectrum Disorders (ASDs) altogether have been estimated to affect approximately 60-70 per 10,000 [1]. In Finland, the prevalence of autism is 41 per 10,000, and that of ASDs is 84 per 10,000, according to DSM-IV [2,3]. Although twin and family studies have indicated a strong genetic component in ASDs [4,5], resulting in a heritability estimate of over 90% [6], only 10-15% of the cases can be explained by a variety of single-gene disorders [7].
The symptoms of ASDs have been described to belong to a complex continuum, where each symptom can vary from mild to severe [8]. Constantino [7] also emphasized in his recent publication that understanding the social abnormality of autism as a quantitative trait, instead of a categorically defined disorder is important. Autism-related traits have been reported also in close relatives of individuals with ASDs [6], supporting the idea that ASDs are the other end of a continuous spectrum of symptoms present, also in general population. Genetic studies using quantitative traits have indeed shown enhancement in specific linkage signals, while also using siblings in the analysis [9,10]. In addition, it has been demonstrated that the same genes that predispose to complex disorders could be responsible also for the phenotypic variation in the general population [7]. Thus, the use of quantitative traits and siblings in the genetic analyses of complex disorders can be essential.
Autism Diagnostic Interview-Revised (ADI-R) is a widely used clinical diagnostic instrument in assessment of autism [11]. Therefore, it would be valuable to have a simple and practical way of using the ADI-R results as phenotypic data in genetic analyses. In ADI-R algorithm, the information received from the ADI-R is gathered together, and the diagnosis according to ADI-R is calculated from the sums of the different subareas of the algorithm. Although this is one way to reduce the data in the ADI-R algorithm, for the purposes of genetic analyses, there may be more efficient ways. Principal Component Analysis (PCA) is a computational tool, which reduces the dimensions of the data by maximizing the amount of variation preserved and forms orthogonal, independent principal components [12]. Thus, the few first principal components contain most of the variation in the data, and in addition, the components are independent of each other. This could be important, as also the predisposing genetic factors may be independent of each other. Furthermore, the data is not influenced by any assumptions made beforehand.
In some previous studies, PCA has been used to reduce the dimensions of ADI-R data. However, the studies have either focused only on a certain domain of ADI-R [13,14], or used ADI-R items to form clusters based on the principal components and intracorrelating items [15]. Those clusters or in some studies a part of the clusters have been used in genetic analyses with enhanced results [16-19]. Sutcliffe et al. [20] used the clusters to explore phenotypic correlates with the coding variants in their study. In addition, factor analysis and clustering methods have been used to form models, explaining the symptom spectrum of ASDs [21-24].
In this study, we have performed a principal component analysis for ADI-R algorithm data, resulting in independent principal components. The scores of the components were then used in genetic analyses. Our study differs from previous studies in many aspects: only ADI-R algorithm data was used, instead of the whole interview data; all subareas of the symptomatology were studied, the results of the PCA were directly used in genetic analyses without any modifications, and finally, siblings were included in the study. In no previous study all of these aspects were included. In addition, our aim was not to build a model to describe the whole spectrum of symptoms in ASDs, which perhaps would have required a different approach, as used by Van Lang et al. [22], Frazier et al. [23] and Lecavalier et al. [24], but to use a robust and practical method to construct independent quantitative phenotypes from the widely used ADI-R for genetic analyses.
In addition to performing PCA, we interpreted the principal components by calculating correlations between the components and ADI-R items. We also used the components in clustering individuals with k-means clustering [25], to visualize the results of PCA, and to further validate the interpretations of the components. In addition to test the reliability of the components, we calculated Cronbach’s alphas for the ADI-R items correlating with the components, and compared them with the Cronbach’s alphas of the items belonging to the subareas of the ADI-R algorithm. Finally, we used the principal component scores, describing the positions of the individuals along the principal components, as quantitative phenotypes in family-based association analyses of previously genotyped candidate genes. All analyses were first conducted in a Finnish sample, after which they were repeated in a sample received from Autism Genetic Research Exchange (AGRE) [26]. However, it is noteworthy that the Finnish sample is isolated and genetically very homogeneous [27], whereas the AGRE sample has been gathered all over the United States, and it is only required that one of the parents is English-speaking [28].
Ethics statement
For the Finnish sample, the ethics committee of Helsinki University Central Hospital approved the research protocol. Also, all participants or their parents have signed a written consent form.
Finnish sample
The Finnish sample included 117 families with autism. They were recruited through Finnish university and central hospitals, and the final diagnoses were established by a multidisciplinary group headed by an experienced child neurologist. Only individuals fulfilling both ICD-10 and DSM-IV criteria were included. The ADI-R (version published in 1994 [11]) was administered in Finnish for all individuals included in the PCA. However, all interviewers did not have inter-rater reliability. Yet, all but one of the interviewers had been schooled together to the use of ADI-R, and they also did interviews together in the training phase. In total, the study sample included 177 individuals from 117 families, for whom ADI-R data was available. Of these individuals, 120 were diagnosed with autism, 10 with a Pervasive Developmental Disorder Not Otherwise Specified (PDD-NOS), and five with Asperger syndrome. The remaining 42 individuals with available ADI-R data were siblings of the affected individuals, who were asked to participate in the study, whether or not any symptoms had been noticed earlier.
AGRE sample
AGRE is one of the world’s largest shared genetic resources for the study of autism. DNA, clinical and medical information has been gathered from families throughout the United States. In the AGRE families, one of the parents must be English-speaking. Unfortunately, clinical diagnoses are not available for AGRE participants. Instead, the diagnostic assessment in the AGRE is based on the ADI-R, which is based on both ICD-10 and DSM-IV. All individuals in the AGRE sample with the same version of the ADI-R (referred to 1995 version in AGRE) were included in this study, resulting in a total of 1395 individuals from 594 families. 1075 individuals had a diagnosis of autism and the rest belonged to the “Not Quite Autism”, “Broad Spectrum”, or “Not Met” category. These categories are described in detail on AGRE website [29], except the “Not Met” category. It consists of individuals who did not meet the criteria for any of AGRE’s affected status categories. In our study, all individuals in the “Not Met” category were siblings of affected individuals.
Autism Diagnostic Interview–Revised
ADI-R is a widely used structured diagnostic interview used both by clinicians and researchers. It is administered to the primary caregivers of individuals with suspected autism diagnosis. The interview contains 111 questions concerning the behavior and development of the child, and provides data on all three symptom domains in ASDs. ADI-R algorithm has been designed to gather together the information received from the ADI-R, and it contains four subareas concerning social behavior, communication, restricted and repetitive behaviors, and development. The diagnosis according to ADI-R is then based on certain threshold values of the sums of the subareas of the ADI-R algorithm. Although in this study, the earlier version of ADI-R was used, algorithms for both versions [11,30] are comprised of the same items. In our study, the values of individual items were used exactly as they are coded in the algorithm, as explained in detail on AGRE website [31]. Missing values were coded as -1. ADI-R takes at least two hours to complete, which has limited its applicability in large studies. Therefore, we decided to use the ADI-R algorithm instead of the whole interview, as in the future it could be possible to use only the questions in the algorithm for research purposes.
Principal component analysis and interpretation of the principal components
Principal Component Analysis (PCA) was then performed for the data, consisting of all items in the ADI-R algorithm. PCA is a statistical method, which performs a linear transformation of the data into a new orthogonal basis. The new variables, called Principal Components (PCs), are linear combinations of the original variables selected to preserve as much of the variance as possible, and ordered by decreasing variance. PCA is performed by calculating the singular value decomposition of the data matrix. PCA is an excellent tool to reduce the dimensions of the data, as the first principal components often explain most of the variance in the data. In addition, the orthogonality of the new basis means that the PCs are independent of each other, which is often useful. Factor analysis has also been used to find the main factors of ADI-R data [24], and PCA and factor analysis are often confused. While PCA maximizes the amount of variation preserved, the aim in factor analysis is to estimate interdependence between variables, and thereby find common factors, which however, do not necessarily contain most of the variation in the data, and are not necessarily independent of each other.
PCA was performed for both Finnish and AGRE data. In the Finnish sample, the amount of principal components used in the sequential analyses was chosen according to commonly used Kaiser’s criteria [32], where only principal components with an eigenvalue greater than one are included in the analyses. In the AGRE sample, the same amount of components was chosen. The correlations (c) between the algorithm items and the most significant (eigenvalue>1) principal components were calculated, and the components were interpreted according to the correlations. A correlation threshold of 0.2 was chosen as it is statistically significant for the Finnish sample (p<0.01), and of course, for the much larger AGRE sample too [33]. The statistical significance of the correlations is further increased as the use of ordinal data reduces the amount of correlation observed [34]. It should be noted that in PCA, the direction of the axes is chosen arbitrarily. Thus, if all algorithm items are correlated negatively to the first principal component in the Finnish data and the same items are correlated positively in the AGRE data, it means that the direction of the axis has been chosen differently in the two analyses, but actually the two data sets are correlated in the same way with the items. Therefore, the correlation threshold actually means that all items with a correlation coefficient above 0.2 or below -0.2 were chosen for the interpretation.
Visualization and validation of the data
In addition to the correlation analysis, the most significant PCs were used in clustering individuals into three clusters to visualize the data, and to further validate the interpretations of the components. The clustering was performed using the k-means algorithm, which minimizes the sum of the distances between the points, and the centroid of the cluster over all clusters. Squared Euclidean distances were used in this study. Some relevant information such as the number of individuals with an autism diagnosis according to ADI-R, sex, and age, were calculated for each of the clusters.
Principal components carry the maximal amount of data preserved, and are independent of each other. The ADI-R scores are computed for each domain, as listed in the diagnostic criteria for research (World Health Organization, 1992) [11]. Thus, they do not necessarily carry the maximal amount of variation and neither are they necessarily independent of each other. However, for genetic analyses, both of these properties are useful, and therefore, we performed PCA on ADI-R algorithm data. It is also important that the PCs have good reliability. We calculated Cronbach’s alpha [35], a reliability measurement for the most significant PCs, by using the answers of the significantly correlating items, and compared them to the ADI-R algorithm domains by using the answers to the items that belong to the different subareas of the ADI R algorithm. MATLAB® [36] was used in PCA, correlation analysis, clustering and reliability measurements.
Family-based association analysis
Finally, we applied the information provided by PCA to genetic analyses. We performed family-based association analysis with quantitative phenotypes for previously genotyped candidate genes in Finnish population [37-41] (and unpublished data), including 290 Single Nucleotide Polymorphisms (SNPs) in 23 genes (Supplementary table 1). All SNPs in the Finnish sample were tested with each of the three PCs. The SNPs are located in known autism susceptibility genes, as well as genes in regions identified in several Finnish genome-wide linkage and association scans for ASDs. Genotyping was performed using allele specific primer extension on microarrays [42], Homogenous MassEXTEND (hME) and iPLEX (Sequenom Inc., San Diego, CA, USA) technology using the MassARRAY Platform, as specified by manufacturer’s instructions. Association analyses in the AGRE sample were performed for the two genes, which resulted in most significant associations in the Finnish sample. The SNPs for the AGRE families were extracted from genome-wide SNP data from Affymetrix 5.0 chips genotyped at the Broad Institute [43].
For the family-based association analyses, a linear regression of phenotype on genotype was performed, and permutation was used to correct for the dependence between related individuals [44]. The permutation procedure is also robust for non-normal phenotype distributions. The principal component scores, describing the positions of the individuals along the PCs, were used in family-based association analysis as quantitative phenotypes. The association analyses were carried out with the PLINK software package, by using the QFAM option [45-49]. Physical positions of SNPs were obtained from NCBI dbSNP build 125. Haploview was used to check the haplotype blocks of the most significantly associating SNPs, and Bonferroni correction was used to correct for multiple testing of the data.
PCA of ADI-R algorithm data resulted in three PCs, with an eigenvalue greater than 1 in the Finnish sample. In the AGRE sample, the same amount of components as for the Finnish sample was selected, although the eigenvalue for the fourth component was slightly over 1. In the Finnish sample, the selected components accounted for 61% of variation in the data, while in AGRE data they accounted for 47% of variation. Although, for practical reasons, only the first three PCs were used in this study, it is possible that also the PCs with smaller eigenvalues could contain relevant phenotypic information, and should be investigated in the future. The eigenvalues for both samples are presented in supplementary table 2.
Interpreting the principal components
The interpretations of the three PCs were formed based on the correlation analysis. Correlation coefficients which were greater than 0.20 are shown in bold in table 1. The PC correlating with items concerning social interactions was named Social Component (SC), and the PC correlating with items concerning communication was named Communication Component (CC). Respectively, the third principal component was named Restricted and Stereotyped Behavior Component (RSBC).
Despite the similar themes of the components in the two samples, there were some differences too. First, in the Finnish data, the SC accounted for 37% and CC for 20% of variation in the data, whereas in the AGRE sample the SC accounted for 17% and the CC for 26% of variation. RSBC was the third PC in both samples and accounted for 4% of total variance, both in the Finnish and AGRE samples. Second, there were some differences in the contents of the components, as can be seen in table 1. For example, more items correlated with the SC in the Finnish sample than in the AGRE sample, and different items in the two samples had the biggest correlation with the RSBC: hand and finger movements in the Finnish data and compulsions/rituals in the AGRE data. Finally, only one of the items significantly correlated with more than one PC in the Finnish data, whereas in AGRE data three items correlated with two components.
ADI-R algorithm item | F-SC | A-SC | F-CC | A-CC | F-RSBC | A-RSBC | ADI-R domain |
---|---|---|---|---|---|---|---|
Direct gaze | -0.19 | 0.21 | 0.03 | -0.05 | -0.07 | -0.03 | SOC |
Social smiling | -0.22 | 0.22 | 0.03 | -0.08 | -0.07 | 0.04 | SOC |
Range of facial expressions | -0.22 | 0.21 | -0.00 | -0.07 | -0.05 | 0.05 | SOC |
Imaginative play with peers | -0.19 | 0.20 | 0.05 | -0.01 | -0.03 | 0.01 | SOC |
Interest in children | -0.23 | 0.19 | -0.02 | -0.10 | -0.02 | 0.10 | SOC |
Response to other children’s approaches | -0.20 | 0.18 | -0.03 | -0.09 | -0.03 | 0.06 | SOC |
Group play with peers or friends | -0.21 | 0.17 | 0.03 | -0.01 | -0.00 | -0.01 | SOC |
Showing and directing attention | -0.23 | 0.20 | 0.01 | -0.12 | -0.01 | 0.10 | SOC |
Offering to share | -0.17 | 0.13 | -0.01 | -0.08 | 0.06 | 0.05 | SOC |
Seeking to share own enjoyment with others | -0.22 | 0.21 | -0.00 | -0.10 | -0.05 | 0.10 | SOC |
Use of other’s body | -0.08 | 0.11 | -0.02 | -0.11 | 0.28 | -0.13 | SOC |
Offers comfort | -0.20 | 0.17 | 0.00 | -0.11 | -0.08 | 0.09 | SOC |
Quality of social overtures | -0.22 | 0.22 | 0.03 | -0.11 | -0.17 | -0.00 | SOC |
Inappropriate facial expressions | -0.10 | 0.14 | 0.04 | -0.05 | 0.20 | -0.17 | SOS |
Appropriateness of social responses | -0.20 | 0.19 | 0.02 | -0.09 | -0.10 | 0.05 | SOS |
Pointing to express interest | -0.19 | 0.17 | 0.04 | -0.13 | -0.12 | 0.06 | COM |
Conventional instrumental gestures | -0.20 | 0.20 | 0.02 | -0.12 | -0.15 | 0.07 | COM |
Nodding | -0.17 | 0.16 | -0.02 | -0.16 | -0.22 | 0.15 | COM |
Headshaking | -0.20 | 0.17 | -0.03 | -0.15 | -0.17 | 0.18 | COM |
Spontaneous imitation of actions | -0.21 | 0.12 | -0.01 | -0.08 | -0.09 | 0.07 | COM |
Imaginative play | -0.20 | 0.15 | 0.05 | -0.09 | -0.03 | 0.03 | COM |
Imitative social play | -0.18 | 0.17 | 0.02 | -0.08 | -0.10 | 0.04 | COM |
Social chat | -0.00 | 0.22 | 0.39 | 0.31 | 0.01 | 0.22 | COM |
Reciprocal conversation | -0.02 | 0.23 | 0.42 | 0.38 | 0.03 | 0.23 | COM |
Stereotyped utterances | 0.01 | 0.21 | 0.50 | 0.41 | -0.00 | -0.03 | COM |
Inappropriate questions | 0.07 | 0.12 | 0.34 | 0.34 | -0.16 | -0.13 | COM |
Pronominal reversal | 0.04 | 0.17 | 0.36 | 0.36 | -0.01 | -0.03 | COM |
Neologisms/idiosyncratic language | 0.09 | 0.09 | 0.28 | 0.27 | -0.04 | -0.08 | COM |
Circumscribed interests | -0.01 | 0.11 | 0.15 | 0.13 | 0.14 | -0.19 | RSB |
Unusual preoccupations | -0.12 | 0.06 | 0.04 | -0.03 | 0.29 | -0.25 | RSB |
Verbal rituals | -0.04 | 0.12 | 0.21 | 0.13 | 0.10 | -0.15 | RSB |
Compulsions/rituals | -0.13 | 0.09 | 0.03 | 0.03 | 0.24 | -0.43 | RSB |
Hand and finger movements | -0.13 | 0.13 | -0.01 | -0.09 | 0.57 | -0.27 | RSB |
Complex mannerisms | -0.16 | 0.11 | 0.01 | -0.10 | 0.24 | -0.33 | RSB |
Repetitive use of objects | -0.17 | 0.14 | 0.01 | -0.06 | 0.29 | -0.35 | RSB |
Unusual sensory interests | -0.16 | 0.11 | 0.01 | -0.09 | 0.10 | -0.29 | RSB |
Age when parents first noticed | -0.05 | 0.02 | -0.00 | -0.02 | -0.00 | -0.02 | DEV |
Age when abnormality first evident | -0.08 | 0.04 | 0.00 | -0.02 | -0.04 | -0.01 | DEV |
Interviewer’s jugdement on age when developmental abnormalities first manifest | -0.04 | 0.02 | -0.01 | -0.01 | 0.01 | -0.01 | DEV |
Age of first single words | -0.01 | 0.03 | -0.04 | -0.06 | 0.02 | 0.04 | DEV |
Age of first phrases | -0.07 | 0.03 | -0.02 | -0.06 | 0.02 | 0.04 | DEV |
F=Finnish sample, A=AGRE sample, SC=Social Component, CC=Communication Component, RSBC=Restricted and Stereotyped Behavior Component. (Correlation coefficients of absolute value over 0.20 are bold).
Table 1: ADI-R algorithm item correlations with the three most significant principal components in Finnish and AGRE samples.
Figure 1: Clusters based on PCA and k-means clustering of ADI-R data in Finnish and AGRE samples. Although cluster 1 contains mostly unaffected individuals according to ADI-R scores (see table 2), especially in the Finnish sample, there is a lot of variation in the symptom severity in that cluster. Thus, the inclusion of unaffected siblings increases the range of the symptom severity, increasing also the power of association analyses with quantitative phenotypes. The direction of the SC in AGRE sample has been inverted for visualization purposes, as the SC was in the opposite direction in the two samples (in PCA the direction of the axes is chosen arbitrarily).
Variable | FIN C1 | AGRE C1 | FIN C2 | AGRE C2 | FIN C3 | AGRE C3 |
---|---|---|---|---|---|---|
1. Number of individuals | 41 | 342 | 50 | 448 | 86 | 544 |
2. Speaking individuals (%) | 97.56% | 98.25% | 0.00% | 0.00% | 100.00% | 100.00% |
3. Average age (years) | 12.668 | 8.4229 | 13.201 | 7.4542 | 12.795 | 9.6887 |
4. Males (%) | 60.97% | 74.24% | 75.99% | 77.55% | 75.58% | 78.25% |
5. Social interaction (avg) | 3.9756 | 11.099 | 25.8 | 22.824 | 23.291 | 23.623 |
6. Verbal communication (avg) | 4.2927 | 9.6637 | -1 | -2 | 17.93 | 18.007 |
7. Non-verbal communication (avg) | -0.97561 | -1.8889 | 13.04 | 12.277 | -1 | -2 |
8. Repetitive/stereotyped behavior (avg) | 1.6829 | 4.4298 | 5.78 | 5.0536 | 7.1512 | 6.5257 |
9. Abnormality of development (avg) | 2.0488 | 3.3012 | 4.14 | 4.6116 | 3.4651 | 4.1452 |
10. Diagnosis according to ADI-R (%) | 9.76% | 42.98% | 94.00% | 91.07% | 94.19% | 95.59% |
FIN=Finnish sample, AGRE=AGRE sample, C1=Cluster 1, C2=Cluster 2, C3=Cluster 3
Number of individuals, percent of speaking individuals, average age of individuals, and percent of men are provided for all clusters in both samples. In addition, the average score (of all individuals in a cluster) for all subareas in ADI-R algorithm is provided for all clusters in both samples.
Table 2: Descriptive phenotypic information of the three clusters in Finnish and AGRE samples (shown in figures 1 and 2).
The differences between the two samples could be explained by various things. Most importantly, the population of Finland is isolated and genetically homogeneous, whereas the AGRE sample has been gathered all over the United States, and the subjects are only required to have one English-speaking parent. Thus, it is reasonable to assume that the AGRE sample is both genetically and phenotypically more heterogeneous than the Finnish sample. This can lead to stronger cultural differences, both within AGRE sample and between Finnish and AGRE sample.
Visualizing the data by k-means clustering
The results of the k-means clustering are shown in figures 1 and 2, and the calculated properties of the clusters are shown in table 2. Visually, two clusters can be clearly recognized in the data, one consisting of verbal and other of non-verbal individuals. However, we have selected to use three clusters instead of two, because especially in the Finnish sample the k-means clustering was then able to separate almost all unaffected individuals based on ADI-R scores into a separate cluster (cluster 1). Likewise, in AGRE sample, most of the unaffected individuals belonged to cluster 1, but also about 40% of the individuals in that cluster had a diagnosis of autism .Clusters 2 and 3 contained mostly affected individuals, according to ADI-R in both samples (Table 2).
The clustering supported our interpretations of the components, as the component that we had named social component could be used to distinguish subjects based on their social skills (according to the social domain of ADI-R algorithm), and the same could be applied to communication component (Figure 1). In RSBC, there was a big overlap in the range of values of all three clusters, as shown in figure 2, but according to the repetitive/stereotyped behavior domain of ADI-R algorithm the severity of symptoms in restricted and stereotyped behavior was lowest in cluster 1. Respectively, the range of the symptom presentation was smallest in cluster 1 (Figure 2). Due to the differences in the samples, the orientation of the principal components differs between the Finnish and AGRE samples, approximately 30 degrees.
Figure 2: Visualization of the variation in the restricted and stereotyped behavior component in the Finnish sample. In RSBC, the range of the symptom severity of unaffected individuals (according to ADI-R) overlaps greatly with that of patients. If only the end-state diagnosis is used in family-based association analyses, this could lead to negative results, because the unaffected individuals do not differ much from the affected ones.
Component | FIN PCA | FIN ADI-R | Between FIN PCA & ADI-R | AGRE PCA | AGRE ADI-R | Between AGRE PCA & ADI-R |
---|---|---|---|---|---|---|
SC | 0.9681 | 0.9545 | 0.7506 | 0.8474 | 0.8732 | 0.6434 |
CC | 0.9197 | 0.8269 | 0.6331 | 0.7449 | 0.7278 | 0.6014 |
RSBC | 0.7579 | 0.7263 | 0.4470 | 0.8053 | 0.4638 | 0.6785 |
FIN=Finnish sample, AGRE=AGRE sample, PCA=Cronbach’s alpha for significantly correlating items in the principal components (PCA), ADI-R=Cronbach’s alpha for the equivalent ADI-R algorithm subareas, Between=Cronbach’s alphas between the principal component scores and equivalent ADI-R algorithm subarea scores.
Table 3: Cronbach’s alphas for the significantly correlating items in the principal components, and equivalent ADI-R algorithm subareas in Finnish and AGRE samples.
Furthermore, K-means clustering showed that the unaffected individuals (according to ADI-R) also had variation in the symptom severity. Especially in SC, the range of the symptom severity was increased by the siblings (Figure 1), which could give more power to the association analysis with quantitative phenotypes. In RSBC, the range of the symptom severity of the unaffected individuals overlapped greatly with that of patients (Figure 2). In family-based association analyses where only the end-state diagnosis is used, this could lead to negative results because of the similar range of symptom severity for the affected and unaffected individuals.
Measuring the reliability of the principal components with Cronbach’s alpha
Interestingly, the Cronbach’s alphas for the ADI-R subareas were greater in the Finnish sample than in the AGRE sample, as shown in table 3. The weaker consistency of the subarea content could indicate that the AGRE sample is phenotypically more heterogeneous than the Finnish sample. For all three PCs, the Cronbach’s alphas were greater in the Finnish sample than in the AGRE sample, indicating that the PCs also had better internal consistency in the Finnish sample than in the AGRE sample. In the Finnish families, for all of the three PCs, the alphas were greater for the items correlating with the PCs than for the ones included in the equivalent ADI-R algorithm subareas. This indicates that the data content of the principal components was internally more consistent than the data in the ADI-R algorithm subareas, and thus, the reliability of the components was better. In the AGRE sample, the CC and RSBC also had greater alphas than the equivalent ADI-R algorithm subareas, whereas the alpha of SC was slightly lower.
In addition, the principal component scores and the equivalent ADI-R algorithm subarea totals were compared with Cronbach’s alphas. They varied from 0.45 to 0.75, indicating that the principal components do not describe exactly the same thing as the subareas. Especially the RSBC component consisted of many different items than the equivalent subarea in the algorithm (Table 1). This explains why the range of symptom severity overlapped so much in RSBC in all the three clusters (Figure 2), although the cluster averages based on ADI R restricted and stereotyped behavior subarea differed from each other (Table 2).
Comparison of our methodology and results concerning PCA with others’ studies
Tadevosyan-Leyfer et al. [15] performed PCA to common questions or items in ADI and ADI-R, and then used the results to cluster the items. However, after that they removed items based on multiple criteria and performed the PCA again to the remaining items. No reliability scores were used to show the superiority of the modified PCs, in comparison to the original ones. We did not want to remove any data, because it would lead to losing some of the variation in the data. Besides, the correlation of an item with many components does not mean that it would not contain important information to the components. Furthermore, the PCs are by definition, independent of each other. Actually, in our analysis, the first three principal components accounted for 61% of variation in the Finnish data and 47% of variation in the AGRE data, while in Tadevosyan-Leyfer’s study, six clusters of items accounted for about 41% of variation. However, these values are not directly comparable, because they used the original ADI-R questions, while we used the questions in the ADI-R algorithm.
In some studies [22-24], the aim has been to find an optimal model describing the symptomatologyof autism. Thus, the researchers wanted to make a model that would be simple to understand and would best describe the data, whereas we wanted to use existing phenotype data in a more efficient way, for the purposes of association analyses. Different goals lead to different methods, as a model that describes well the whole spectrum of the symptomatology does not necessarily describe as well a certain phenotype.
CHRa | SNPb | EMP1c | BP POSITIONd | SAMPLE |
---|---|---|---|---|
7 | rs362731 | 0.003643 | 102996264 | Finnish |
7 | rs362732 | 0.008088 | 102997304 | Finnish |
7 | rs123712 | 0.003961 | 103018526 | Finnish |
7 | rs2249372 | 0.004579 | 103025446 | Finnish |
7 | rs362644 | 0.000101 | 103039099 | Finnish |
7 | rs6958081 | 0.00439 | 103079061 | Finnish |
7 | rs3919520 | 0.002824 | 103079723 | Finnish |
7 | rs13239238 | 0.00345 | 103165523 | Finnish |
7 | rs1963647 | 0.001698 | 103231266 | Finnish |
7 | rs262375 | 0.006105 | 103240293 | Finnish |
7 | rs4727583 | 0.006032 | 103386180 | AGRE |
7 | rs4727582 | 0.006333 | 103386042 | AGRE |
7 | rs2109697 | 0.00717 | 103020445 | AGRE |
7 | rs6943822 | 0.007598 | 103385907 | AGRE |
7 | rs10247439 | 0.008886 | 103143532 | AGRE |
aChromosome
bSingle Nucleotide Polymophism
cEmpirical p-value given by PLINK based on one million permutations
dBase Pair Position from NCBI dbSNP build 125
Table 4: SNPs within RELN showing association with RSBC (p<0.01).
Factor analysis has also been used to find factors from ADI-R data [24]. However, we chose to use PCA for three reasons. We wanted to preserve the maximum amount of variation in the resulting components, and we wanted that the resulting components would be independent of each other. Both properties are important when the aim is to use the components in genetic analyses. In factor analysis, the aim is not to preserve the maximal amount of variation, but to find factors consisting of items that correlate with each other, as much as possible. The resulting factors are not necessarily independent either. In addition, in factor analysis, the factors are often rotated to find the best possible explanation of the factors. We did not want to subjectively influence the data by rotating the factors to correspond to any assumptions made beforehand.
Family-based association analysis
Finally, the continuous PC scores were used as quantitative phenotypes in family-based association analysis. It was first performed for the Finnish sample, where two genes could be distinguished: Neuroligin 1 (NLGN1) and Reelin (RELN). Then the association analysis was repeated in the AGRE sample for 144 SNPs within NLGN1, and 103 SNPs within RELN. As we used existing data, the SNP panels in the Finnish and AGRE data did not contain overlapping SNPs. However, both SNP sets have been designed to tag common variation in the genes examined. Particularly in RELN, where the associated SNPs were located in the same haplotype block in Finnish and AGRE sample, these SNPs tag the same genetic risk factors.
In the Finnish sample, the SNP rs1488545 in NLGN1 showed association with the SC (p=0.0004). Neuroligins are a family of genes encoding postsynaptic cell adhesion molecules, and mutations in the family have been reported before in ASDs and mental retardation [50,51]. The association of NLGN1 with SC improved from a previously reported association with the end-state diagnosis of autism in the same sample (formerly p=0.002 for rs1488545) [39], perhaps due to the use of the quantitative phenotype and inclusion of siblings. In contrast, the lack of association of NLGN1 in AGRE sample, and the fact that the result in the Finnish sample did not endure Bonferroni correction, indicates that it might be a false positive. However, the different results in the two samples could also be explained by differences in the SC and phenotypic and genetic backgrounds of the samples.
Several SNPs in RELN showed association with RSBC (p<0.01) in the Finnish sample. The SNP rs362644 also endured the Bonferroni correction (uncorrected p=0.0001 and Bonferroni-corrected p=0.029). In the embryonic brain, RELN is involved in neuron guidance and radial cell migration, whereas in the adult brain it is implicated in synaptic plasticity and neurotransmission [52]. It is located at a well replicated linkage peak for ASDs on chromosome 7q22 [53,54], and in addition to association between RELN and autism [55,56], association between RELN and schizophrenia and other neuropsychiatric disorders[52], has been reported before. Post-mortem studies of individuals with ASDs also indicate that RELN might have a role in the pathophysiology of autism [57,58].
Interestingly, in the same sample, no association of RELN was found in previous studies, in which only the end-state diagnosis was used. This could be due to the large overlap in the range of the symptom severity between patients and unaffected siblings. Our finding indicates that RELN is associated particularly to restricted and stereotyped behavior, whether it takes place in affected individuals or their siblings. Thus, this is an interesting example where the use of quantitative phenotypes is extremely important, and further supports the demand for new analysis techniques in the research of complex neuropsychiatric disorders [59].
Many SNPs in RELN associated with the RSBC in the AGRE sample too, although none of them endured the Bonferroni correction (after correction p>0.05). However, it is not entirely unexpected that the association of RELN with RSBC was stronger in the Finnish sample, compared to AGRE sample. The Finnish population is isolated and genetically very homogeneous [27], while AGRE families have a heterogeneous background, as the families are from all over the United States, and it is only required that one of the parents is Englishspeaking [28]. The heterogeneity of the sample decreases the power of the association analyses and in addition, the genetic background of the disorder in the AGRE sample could be slightly different than that of the homogeneous Finnish sample. Indeed, in a recent review, autism was described as a behavioral manifestation of tens, or perhaps hundreds of genetic and genomic disorders [60]. Furthermore, the AGRE sample has a bias towards multiplex families [28], whereas the Finnish sample consists mostly of simplex families, and it has been reported that there are differences in the rates of copy number variations and autism-related traits in non-affected family members of simplex and multiplex families [28]. This gives further support to the possibility of slightly different genetic backgrounds of Finnish and AGRE samples. In addition, there were small differences in the content of RSBC between the two samples. In the AGRE data, the biggest correlation in RSBC was with the item compulsions/rituals, whereas in the Finnish sample, it was with the item hand and finger movements. Cuccaro et al. [61] have reported a study which suggests that these two items might belong to two different components, within restricted and repetitive behavior: repetitive sensory motor actions and resistance to change. All SNPs within RELN associated with RSBC (p<0.01), either in Finnish or AGRE sample, are shown in table 4.
Not much is known about the role of RELN in the neurobiological background of ASDs. Therefore, our findings indicating an association of RELN with restricted and stereotyped behavior, might prove important in planning future studies, both in molecular genetics and animal model studies of complex neuropsychiatric disorders. Agedependent decrease in prepulse inhibition of startle has been reported in heterozygous reeler mice, a mouse model with an autosomal recessive mutation in RELN [62], and it has been demonstrated that adults with autism exhibit a decrease in prepulse inhibition, which is correlated with increased ratings of restricted and repetitive behavior [63].
We showed in this study that PCA of ADI-R algorithm items lead to relevant PCs, evaluated by correlation analysis, k-means clustering and reliability measurements. Especially in the Finnish sample, the unaffected individuals were clearly separated into one cluster by the k-means clustering. In addition, K-means clustering showed that siblings also had variation in the symptom severity. Especially in SC, the total range of the symptom severity was increased by siblings. The association of NLGN1 with SC was increased, compared to a previous study where only the end-state diagnosis was used. In RSBC, the range of the symptom severity of siblings overlapped greatly with that of patients, which could explain why no association of RELN was found in previous studies, in which only the end-state diagnosis was used, but a significant association of RELN with RSBC was now found in the Finnish families (Bonferroni-corrected p=0.029 for rs362644). Thus, our study also shows that in some situations, it is extremely important to use quantitative phenotypes.
We also compared the findings in the isolated and homogeneous Finnish sample to a much larger, but also much more heterogeneous AGRE sample. The association of RELN with RSBC in the Finnish sample was stronger than in the AGRE sample, which could be explained by the heterogeneity of the AGRE sample and differences in the genetic backgrounds of the samples. In addition, the small differences in the content of the RSBC could also affect the results of the association analysis.
In future studies, the role of RELN in restricted and stereotyped behavior should be further studied, both in ASDs and other neuropsychiatric disorders like schizophrenia, Tourette syndrome and obsessive-compulsive disorder. In addition, also PCs with smaller eigenvalues can contain important phenotypic information, and should be investigated. Finally, the proposed approach should be used in broader genome-wide association analyses.
We gratefully acknowledge the resources provided by the Autism Genetic Resource Exchange (AGRE) Consortium* and the participating AGRE families. The Autism Genetic Resource Exchange is a program of Autism Speaks, and is supported, in part, by grant 1U24MH081810 from the National Institute of Mental Health to Clara M. Lajonchere (PI).
*The AGRE Consortium: Dan Geschwind, M.D., Ph.D., UCLA, Los Angeles, CA; Maja Bucan, Ph.D., University of Pennsylvania, Philadelphia, PA; W.Ted Brown, M.D., Ph.D., F.A.C.M.G., N.Y.S. Institute for Basic Research in Developmental Disabilities, Staten Island, NY; Rita M. Cantor, Ph.D., UCLA School of Medicine, Los Angeles, CA; John N. Constantino, M.D., Washington University School of Medicine, St. Louis, MO; T.Conrad Gilliam, Ph.D., University of Chicago, Chicago, IL; Martha Herbert, M.D., Ph.D., Harvard Medical School, Boston, MA Clara Lajonchere, Ph.D, Cure Autism Now, Los Angeles, CA; David H. Ledbetter, Ph.D., Emory University, Atlanta, GA; Christa Lese-Martin, Ph.D., Emory University, Atlanta, GA; Janet Miller, J.D., Ph.D., Cure Autism Now, Los Angeles, CA; Stanley F. Nelson, M.D., UCLA School of Medicine, Los Angeles, CA; Gerard D. Schellenberg, Ph.D., University of Washington, Seattle, WA; Carol A. Samango-Sprouse, Ed.D., George Washington University, Washington, D.C.; Sarah Spence, M.D., Ph.D., UCLA, Los Angeles, CA; Matthew State, M.D., Ph.D., Yale University,New Haven, CT. Rudolph E. Tanzi, Ph.D., Massachusetts General Hospital, Boston, MA.