1.1 INTRODUCTION
Modern human genetic diversity is the result of the emergence of new variants by mutation, demographic history of humanity as a whole, and selective effects that have acted to adapt different populations to their environments. Extant patterns of diversity at the global level are now considered mainly the legacy of an Out of Africa model for the evolution and dispersal of anatomically modern humans. On top of this, more local processes, such as migration, admixture, adaptation, isolation, and drift, have molded the genome pools of local human groups, generating a kaleidoscopic distribution of variants across space.
Human genetic diversity has been long explored at the protein level, characterizing individuals and populations for electrophoretically or serologi-cally detectable variants (Cavalli-Sforza et al., 1994). In just few decades of DNA sequencing, the milestone where approximately 0.1% of living humans will have had their genomes resequenced to some degree is being reached, whereas resequencing of the genomes of our ancestors and other hominins is reshaping our understanding of human history (Shendure et al., 2017). This has produced catalogs of genetic variants that are growing at a thrilling pace and are freely available to the scientific community. These variations can be of several types, from simple substitutions that do not affect sequence length, to those that result in minor length differences, to those that affect multiple genes and multiple chromosomes (Kitts et al., 2014). In this chapter, we discuss some key aspects of genetic variation as far as the molecular bases, quantitative impact, and population distribution are concerned. The immense bibliography produced on these issues in recent years prevents any comprehensive presentation of the works, and references in the text should be all considered illustrative. As recent as they can be, we gave preference to seminal reviews, and the reader will find therein indications for a wealth of additional readings.
Sequence variation is of scientific interest to a variety of disciplines. Population geneticists analyze genetic diversity to work out phenomena as diverse as the descent of human groups (including the introgression with archaic hominins [Sankararaman et al., 20141), and the effects, duration, and intensity of natural selection on different portions of the genome, possibly in response to specific environmental conditions (Itan et al., 2010; Brown 2012; Yi et al., 2010; Hancock et al., 2011; Perry 2014). Genetic mapping of Mendelian traits in humans could only be pursued by linking specific traits to genetic variations spontaneously present in segregating pedigrees (Strachan and Read, 2011). Historically, the need for an advancement in mapping human genes has been a main driver for improving the description of genetic variation in all parts of the genome. Additionally, the investigation of relationships between variation and phenotype leveraged the available catalogs of variants in at least three main lines: to analyze the association between variant alleles and phenotypes in cohorts of unrelated cases and controls, according to the so-called Genome-wide Association Study (GWAS) approach (Altshuler et al., 2008; Mackay et al., 2009; Rosenberg et al., 2010); to obtain a compilation of coding variants by sequencing of massive numbers of exomes (Lek et al., 2016); to establish precise relationships between the presence of specific variant alleles and the level of expression of genes in a large array of tissues (GTEx Consortium 2017).
Progress in the description of DNA variation has been heavily dependent on the scaling up of the power of typing technologies (including genotyping arrays, exome capture, and massive parallel sequencing). However, its impact on everyday practice would have been minor without a parallel development of proper and easily accessible and searchable catalogs. Different databases have been implemented, each tailored on the specific features of different types of variants. These have been now integrated into genome browsers that allow the visualization of the occurrence and organization of variants onto the genome reference sequence (see Table 1.1).
The main leap forward toward a genome-wide description of variation at the level of DNA sequence has been produced by the 1000 Genome Project (1KGP), launched to discover, genotype, and provide accurate haplotype information on all forms of human DNA polymorphism in multiple human populations. Specifically, the goal was to characterize over 95% of variants that have allele frequency of 1% or higher (the classical definition of polymorphism) in each of five major population groups (populations with ancestry from Europe, East Asia, South Asia, West Africa, and the Americas) (The 1000 Genomes Project Consortium 2010). During its performance, the Project has grown both in the number of populations (and hence subjects) and depth of sequencing, reaching 2504 subjects at a mean depth of 7.4x (The 1000 Genomes Project Consortium 2015). The results were reported separately for molecularly distinct sources of genetic variation. In the rest of this text we will keep this distinction, referring mainly to the results of this study.
1.2 SINGLE-NUCLEOTIDE POLYMORPHISMS (SNPS)
An SNP is a variation, typically of a single base position in DNA, in which the less common form (allele) has a frequency of at least 1% in the population (see above). Indeed, the acronym SNP has now been extended also to variants that have been observed in a single instance (singletons), variants consisting in change of more nucleotides in a row, and variants consisting in the presence/absence of one or a few nucleotides (small indels). The majority of SNPs are biallelic, that is, there is only a reference and an alternative base at the variable position. Only multiple mutational hits can generate SNPs with three or four allelic forms. Note that the reference allele is the base represented in the genome reference sequence and is not necessarily the most frequent or the ancestral allele. In order to identify the ancestral allele, a comparison with an outgroup (nonhuman) species is necessary to determine which allele is shared between the two. This may reveal that the ancestral allele is the reference, or the alternative, or, in some cases, a third allele not observed in humans.