Developing an efficient imputation pipeline to construct near complete genome variant data information in GWAs datasets

Project Leader: Aarno Palotie

PostDoc: Priit Palta

The project aim to use population specific whole genome and whole exome sequence data as a backbone for imputing low frequency variants in Estonian and Finnish population cohort GWAs data and use the data for register based diagnostic outcomes such as cancers and comorbidities
Over the past eight years genome wide variant data has been accumulated from large samples collections in all NIASC sites. Currently the two performance site Tartu and Helsinki have accumulated GWAs data from 70 000 Finnish and 20 000 Estonian individuals and whole exome or whole genome sequence data from 16 000 Finnish and 1800 Estonian individuals. These large datasets provide a substantial resource for association studies. To efficiently use all genomewide variant data, we would also like to include low frequency variants in the outcome association analysis. To achieve this, we would have to impute the non‐genotyped variants in the GWAs results. Although HapMap and 1000genomes data provide a fundament and a standardized imputation backbone, there is increasing evidence that for low frequency variants these panels are not sufficient. Population specific sequence data improves substantially the imputation accuracy for variants that have a population frequency under 5%. Estonia and Finland are historically and linguistically closely related. Comparing low frequency variant association data between these two populations is thus especially interesting and potentially beneficial. As replication is challenging for low frequency variants, we hypothesize that similarly imputed datasets between two ethnically related countries would be helpful; the likelihood for shared haplotypes is likely to be higher.
The haplotype reference consortium led by Goncalo Abecasis, Jonathan Marchini and Richard Durbin are currently constructing a haplotype catalogue based on available whole genome data. This will further improve our imputation accuracy. However, as most of the haplotype project is using low coverage sequence data (2‐6X) the variant calling accuracy of rare variants will still not be superb. This is especially challenging for variants that are rare in the general European population but are enriched through bottleneck effects in either the Finnish or Estonian populations. As is well documented, the Finnish bottleneck effects are strong resulting in enrichment of some low frequency variants that are very rare elsewhere. Some of these variants are contributing to disease phenotypes but are so rare in most populations that they are not within reach of disease association studies. However, when enriched in an isolate like Finland, the frequency might be boosted to 0‐5‐5% as demonstrated in Figure 2 below and become analyzable disease association targets. Of special interest is that within the range of 0.5‐5% population frequency in Finland there is an excess of loss of function (LoF) variants. LoFs are of special interest in association studies as they represent human knockouts. In our recent study by Lim et al (PLoS Genetics 2014 Jul 31;10(7):e1004494) we analyzed 83 LoF variants enriched in Finland and linked them to National Health Record data. We identified several disease associations including a LoF in the LpA gene protective for coronary heart disease. Protective LoFs are interesting potential drug targets and thus of special value.