1. |
|
|
2. |
|
|
3. |
- Anava, Sarit, et al.
(författare)
-
Illuminating Genetic Mysteries of the Dead Sea Scrolls
- 2020
-
Ingår i: Cell. - : CELL PRESS. - 0092-8674 .- 1097-4172. ; 181:6, s. 1218-
-
Tidskriftsartikel (refereegranskat)abstract
- The discovery of the 2,000-year-old Dead Sea Scrolls had an incomparable impact on the historical understanding of Judaism and Christianity. "Piecing together'' scroll fragments is like solving jigsaw puzzles with an unknown number of missing parts. We used the fact that most scrolls are made from animal skins to "fingerprint'' pieces based on DNA sequences. Genetic sorting of the scrolls illuminates their textual relationship and historical significance. Disambiguating the contested relationship between Jeremiah fragments supplies evidence that some scrolls were brought to the Qumran caves from elsewhere; significantly, they demonstrate that divergent versions of Jeremiah circulated in parallel throughout Israel (ancient Judea). Similarly, patterns discovered in non-biblical scrolls, particularly the Songs of the Sabbath Sacrifice, suggest that the Qumran scrolls represent the broader cultural milieu of the period. Finally, genetic analysis divorces debated fragments from the Qumran scrolls. Our study demonstrates that interdisciplinary approaches enrich the scholar's toolkit.
|
|
4. |
- Ausmees, Kristiina, et al.
(författare)
-
An empirical evaluation of genotype imputation of ancient DNA
- 2022
-
Ingår i: G3. - : Oxford University Press. - 2160-1836. ; 12:6
-
Tidskriftsartikel (refereegranskat)abstract
- With capabilities of sequencing ancient DNA to high coverage often limited by sample quality or cost, imputation of missing genotypes presents a possibility to increase the power of inference as well as cost-effectiveness for the analysis of ancient data. However, the high degree of uncertainty often associated with ancient DNA poses several methodological challenges, and performance of imputation methods in this context has not been fully explored. To gain further insights, we performed a systematic evaluation of imputation of ancient data using Beagle v4.0 and reference data from phase 3 of the 1000 Genomes project, investigating the effects of coverage, phased reference, and study sample size. Making use of five ancient individuals with high-coverage data available, we evaluated imputed data for accuracy, reference bias, and genetic affinities as captured by principal component analysis. We obtained genotype concordance levels of over 99% for data with 1× coverage, and similar levels of accuracy and reference bias at levels as low as 0.75×. Our findings suggest that using imputed data can be a realistic option for various population genetic analyses even for data in coverage ranges below 1×. We also show that a large and varied phased reference panel as well as the inclusion of low- to moderate-coverage ancient individuals in the study sample can increase imputation performance, particularly for rare alleles. In-depth analysis of imputed data with respect to genetic variants and allele frequencies gave further insight into the nature of errors arising during imputation, and can provide practical guidelines for postprocessing and validation prior to downstream analysis.
|
|
5. |
- Ausmees, Kristiina
(författare)
-
Efficient computational methods for applications in genomics
- 2019
-
Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract
- During the last two decades, advances in molecular technology have facilitated the sequencing and analysis of ancient DNA recovered from archaeological finds, contributing to novel insights into human evolutionary history. As more ancient genetic information has become available, the need for specialized methods of analysis has also increased. In this thesis, we investigate statistical and computational models for analysis of genetic data, with a particular focus on the context of ancient DNA.The main focus is on imputation, or the inference of missing genotypes based on observed sequence data. We present results from a systematic evaluation of a common imputation pipeline on empirical ancient samples, and show that imputed data can constitute a realistic option for population-genetic analyses. We also discuss preliminary results from a simulation study comparing two methods of phasing and imputation, which suggest that the parametric Li and Stephens framework may be more robust to extremely low levels of sparsity than the parsimonious Browning and Browning model.An evaluation of methods to handle missing data in the application of PCA for dimensionality reduction of genotype data is also presented. We illustrate that non-overlapping sequence data can lead to artifacts in projected scores, and evaluate different methods for handling unobserved genotypes.In genomics, as in other fields of research, increasing sizes of data sets are placing larger demands on efficient data management and compute infrastructures. The last part of this thesis addresses the use of cloud resources for facilitating such analysis. We present two different cloud-based solutions, and exemplify them on applications from genomics.
|
|
6. |
- Ausmees, Kristiina
(författare)
-
Methodology and Infrastructure for Statistical Computing in Genomics : Applications for Ancient DNA
- 2022
-
Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
- This thesis concerns the development and evaluation of computational methods for analysis of genetic data. A particular focus is on ancient DNA recovered from archaeological finds, the analysis of which has contributed to novel insights into human evolutionary and demographic history, while also introducing new challenges and the demand for specialized methods.A main topic is that of imputation, or the inference of missing genotypes based on observed sequence data. We present results from a systematic evaluation of a common imputation pipeline on empirical ancient samples, and show that imputed data can constitute a realistic option for population-genetic analyses. We also develop a tool for genotype imputation that is based on the full probabilistic Li and Stephens model for haplotype frequencies and show that it can yield improved accuracy on particularly challenging data. Another central subject in genomics and population genetics is that of data characterization methods that allow for visualization and exploratory analysis of complex information. We discuss challenges associated with performing dimensionality reduction of genetic data, demonstrating how the use of principal component analysis is sensitive to incomplete information and performing an evaluation of methods to handle unobserved genotypes. We also discuss the use of deep learning models as an alternative to traditional methods of data characterization in genomics and propose a framework based on convolutional autoencoders that we exemplify on the applications of dimensionality reduction and genetic clustering.In genomics, as in other fields of research, increasing sizes of data sets are placing larger demands on efficient data management and compute infrastructures. The final part of this thesis addresses the use of cloud resources for facilitating data analysis in scientific applications. We present two different cloud-based solutions, and exemplify them on applications from genomics.
|
|
7. |
- Babiker, Hiba, et al.
(författare)
-
Genetic variation and population structure of Sudanese populations as indicated by 15 Identifiler sequence-tagged repeat (STR) loci.
- 2011
-
Ingår i: Investigative Genetics. - : Springer Science and Business Media LLC. - 2041-2223. ; 2:1
-
Tidskriftsartikel (refereegranskat)abstract
- BACKGROUND: There is substantial ethnic, cultural and linguistic diversity among the people living in east Africa, Sudan and the Nile Valley. The region around the Nile Valley has a long history of succession of different groups, coupled with demographic and migration events, potentially leading to genetic structure among humans in the region.RESULT: We report the genotypes of the 15 Identifiler microsatellite markers for 498 individuals from 18 Sudanese populations representing different ethnic and linguistic groups. The combined power of exclusion (PE) was 0.9999981, and the combined match probability was 1 in 7.4 × 1017. The genotype data from the Sudanese populations was combined with previously published genotype data from Egypt, Somalia and the Karamoja population from Uganda. The Somali population was found to be genetically distinct from the other northeast African populations. Individuals from northern Sudan clustered together with those from Egypt, and individuals from southern Sudan clustered with those from the Karamoja population. The similarity of the Nubian and Egyptian populations suggest that migration, potentially bidirectional, occurred along the Nile river Valley, which is consistent with the historical evidence for long-term interactions between Egypt and Nubia.CONCLUSION: We show that despite the levels of population structure in Sudan, standard forensic summary statistics are robust tools for personal identification and parentage analysis in Sudan. Although some patterns of population structure can be revealed with 15 microsatellites, a much larger set of genetic markers is needed to detect fine-scale population structure in east Africa and the Nile Valley.
|
|
8. |
- Blum, Michael G. B., et al.
(författare)
-
Deep Divergences of Human Gene Trees and Models of Human Origins
- 2011
-
Ingår i: Molecular biology and evolution. - : Oxford University Press (OUP). - 0737-4038 .- 1537-1719. ; 28:2, s. 889-898
-
Tidskriftsartikel (refereegranskat)abstract
- Two competing hypotheses are at the forefront of the debate on modern human origins. In the first scenario, known as the recent Out-of-Africa hypothesis, modern humans arose in Africa about 100,000-200,000 years ago and spread throughout the world by replacing the local archaic human populations. By contrast, the second hypothesis posits substantial gene flow between archaic and emerging modern humans. In the last two decades, the young time estimates-between 100,000 and 200,000 years-of the most recent common ancestors for the mitochondrion and the Y chromosome provided evidence in favor of a recent African origin of modern humans. However, the presence of very old lineages for autosonnal and X-linked genes has often been claimed to be incompatible with a simple, single origin of modern humans. Through the analysis of a public DNA sequence database, we find, similar to previous estimates, that the common ancestors of autosomal and X-linked genes are indeed very old, living, on average, respectively, 1,500,000 and 1,000,000 years ago. However, contrary to previous conclusions, we find that these deep gene genealogies are consistent with the Out-of-Africa scenario provided that the ancestral effective population size was approximately 14,000 individuals. We show that an ancient bottleneck in the Middle Pleistocene, possibly arising from an ancestral structured population, can reconcile the contradictory findings from the mitochondrion on the one hand, with the autosomes and the X chromosome on the other hand.
|
|
9. |
- Breton, Gwenna, et al.
(författare)
-
Comparison of sequencing data processing pipelines and application to underrepresented African human populations
- 2021
-
Ingår i: BMC Bioinformatics. - : BioMed Central (BMC). - 1471-2105. ; 22:1
-
Tidskriftsartikel (refereegranskat)abstract
- Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its "Best Practices" bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK "Best Practices" are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK "Best Practices" pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.
|
|
10. |
- Breton, Gwenna, et al.
(författare)
-
Comparison of sequencing data processing pipelines and application to underrepresented human populations
-
Ingår i: BMC Bioinformatics. - 1471-2105.
-
Tidskriftsartikel (refereegranskat)abstract
- Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture human diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification.We started by surveying 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies, that the GATK “Best Practices” are seldom followed strictly and that processing pipelines are often not reported in full details. We then compared three versions of the GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipeline on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called.We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for a coverage of >30X, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.
|
|