Garrido-Ramos MA (ed): Repetitive DNA.
Genome Dyn. Basel, Karger, 2012, vol 7, pp 1â28
______________________
The Repetitive DNA Content of Eukaryotic Genomes
I. López-Flores · M.A. Garrido-Ramos
Departamento de Genética, Facultad de Ciencias, Universidad de Granada, Granada, Spain
______________________
Abstract
Eukaryotic genomes are composed of both unique and repetitive DNA sequences. These latter form families of different classes that may be organized in tandem or may be dispersed within genomes with a moderate to high degree of repetitiveness.The repetitive DNA fraction may represent a high proportion of a particular genome due to correlation between genome size and abundance of repetitive sequences, which would explain the differences in genomic DNA contents of different species. In this review, we analyze repetitive DNA diversity and abundance as well as its impact on genome structure, function, and evolution.
Copyright © 2012 S. Karger AG, Basel
The Repetitive Fraction of Eukaryotic Genomes
Pioneering work by Britten and Kohne [1] revealed that in addition to unique sequences the eukaryotic genomes contain large quantities of repetitive DNA, classified into moderately or highly repetitive sequences according to their degree of repetitiveness. Later, the repetitive DNA sequences were grouped according to other criteria such as their organization (tandemly arrayed or dispersed) or their functional role. Although repetitive DNA sequences include several types of RNA or protein-coding sequences, most of the repetitive part of the genome was earlier considered âjunk DNAâ with no known function. Today, with many genomes completely sequenced and the background research of more than 40 years, we have ample information on the significance of the repetitive DNA within eukaryotic genomes and concepts are changing. Figure 1 shows a classification of the several types of repetitive DNA according to an organizational criterion, which has been followed in this review. Among tandem repetitive DNA, there are moderately repetitive DNAs, such as ribosomal RNA (rRNA) and protein-coding gene families or short tandem telomeric repeats, as well as highly repetitive non-coding microsatellite and satellite DNAs, including centromeric DNA . Among dispersed repeats, transposable elements (TEs) such as DNA transposons and retrotransposons (mainly long terminal repeat (LTR) retrotransposons and long interspersed elements, LINEs) stand out, constituting a fraction of highly repetitive DNA as a whole. In addition, genomes contain retrotransposed sequences such as short interspersed elements (SINEs; moderately to highly repetitive DNA), retrogenes and retropseudogenes, as well as several gene families composed of dispersed members (moderately repetitive DNA). In addition, many genomes are characterized by segmental duplications (SDs), duplicated DNA fragments greater than 1 kb, with both dispersed and tandem organization.
Gene Families
Gene families are groups of paralogous genes, typically exhibiting related sequences and functions. A gene family is produced when a single gene is copied one or more times by a gene-duplication event, such as whole-genome duplication (ancient poly-ploidy is common in plant lineages and is considered a key factor in eukaryote evolution) and SD (see below). Over time, duplications may occur several times and produce many copies of a particular gene. Family sizes range from 2 members up to several hundred [2]. Depending on their organization, gene families are classified into dispersed and tandem gene families. Dispersed genes include for example the families of olfactory receptor genes from mammals (forming the largest known multigene family in the human genome: 802 genes, 388 potentially functional and 414 apparent pseudogenes), the MADS box genes, the fatty acid-binding protein genes or the tRNA genes (see [3] for references). Among tandem gene families, some examples are globins, histones, and rRNA genes.
Ribosomal RNA genes (rDNA) are probably the best-known example of a multigene family. rRNA plays a vital role in protein synthesis, as it constitutes the main structural and the catalytic component of the ribosomes. In most eukaryotes, rDNA consists of tandemly arrayed repeat units, containing 3 of the 4 genes encoding nuclear rRNA, located in the nucleolar organizer region (NOR) on 1 or more chromosomes. Each repeat unit contains the 28S large subunit, the 18S small subunit, the 5.8S gene, as well as 2 external transcribed spacers (ETS ) and 2 internal transcribed spacers (ITS1 and ITS2) and a large non-transcribed spacer (NTS ). Thus, the nuclear rRNA genes are typically arranged as a 5â-ETS-18S-ITS1-5.8S-ITS2-28S-ETS-3â transcription unit, organized in tandem repeats and separated by the NTS. The ETS plus the NTS constitute the intergenic spacer (IGS ). This is known as the major rDNA family. The number of repeat units varies between eukaryotes, from 39 to 19,300 in animals and from 150 to 26,000 in plants [4]. The different components forming rDNA are known to evolve generally at different rates. The 18S rDNA is among the slowest-evolving genes found in living organisms, contrary to the spacers, which are rapidly evolving sequences (they are not the subject to selective constraints) with the NTS evolving faster than the ITSs and ETSs [2]. The 28S rRNA gene also evolves relatively slowly. The evolution of the rRNA gene complex at varying rates has different phylogenetic utilities. The 18S and 28S rRNA genes allow the inference of phylogenetic history across a broad taxonomic range, whereas the spacers can be useful in determining relationships between closely related species, sometimes intraspecific relationships, and at times have been suitable for population studies. Nucleotide sequences of spacers are very similar among repeats of the same species but differ greatly between species. The model of concerted evolution should explain this observation in which the individual repeats do not evolve independently (see below). Instead, the molecular drive force tends to homogenize repeated sequences within genomes and among the genomes of an entire species, leading to divergence between species [5]. However, nucleotide sequences of the rRNA coding regions are almost identical between closely related species, and they are similar even among distantly related species. This similarity should be maintained by strong purifying selection that operates for the coding regions. Thus, we can explain the entire set of observations concerning the rRNA gene family in terms of mutation, homogenization, and purifying selection [3]. The fourth rRNA gene is the gene encoding 5S rRNA, which forms another family known as the minor rDNA family, which comprises tandem repetitions of the gene separated by an NTS. In most eukaryotes, the 5S rRNA genes are found at another location of the nuclear genome, although e.g. in sturgeons, the 2 rDNA families are in the same chromosome pair and in some species of protozoa, fungi, and algae the 5S ribosomal genes are located between the 28S and the 18S genes (within the IGS) [6]. The 5S rRNA genes were also believed to undergo concerted evolution. However, it has been found recently that the 5S genes located at different loci might evolve by the birth-and-death evolution model. This model predicts that new genes in a family are formed by gene duplication (diversification), and some of these duplicate genes specialize (differentiate) and are maintained in the genome for a long period of time, while others are inactivated or deleted in different species (pseudogenization) [3]. In this sense, Freire et al. [7] found that the 5S genes of mussels showed a mixed mechanism, involving the generation of genetic diversity through birth-and-death, followed by a process of local homogenization resulting from concerted evolution in order to maintain the genetic identities of the different 5S genes.
Histone genes provide another widely known example of tandemly arrayed genes. Histones are highly conserved eukaryotic proteins that have a crucial role in the function and formation of the nucleosome. There are 5 major histone genes- H1, H2A, H2B, H3, and H4- which are separated from each other by non-coding IGSs. Each major histone gene includes some minor variant forms. Some variants originate from changes in only a few amino acids (for example mouse H3.1 and H3.2 differ only in 1 amino acid), while other variants originate from changes affecting larger portions of the protein (e.g. mouse H3.1/H3.2 and H3.3) [8, 9]. The number of histone genes varies between species. For example, the yeast Saccharomyces cerevisiae has 2 copies of each major histone gene, whereas some urchin species contain up to 1,000 copies. Although histone genes are generally arranged in tandem arrays, in some species they are clustered but not tandemly organized (e.g. the mouse genome contains 2 clusters located on different chromosomes) or found scattered across different chromosomes (e.g. in Caenorhabditis elegans and in Zea mays)[8]. In Drosophila melanogaster,the 5 major genes are arranged in a repeating unit which is tandemly repeated 110 times on chromosome 2L. In addition, variant histone genes are located in other parts of the fly genome [3].
Among higher eukaryotic species, H4 and H3 proteins are highly conserved and even distantly related species such as animals and plants have very similar protein sequences. For example, only 3 out of 135 residues differentiate animal and plant H3 protein [3]. This high sequence identity might indicate that multigene families encoding histones evolve by concerted evolution. Nevertheless, histone genes as well as other multigene families (such as the major histocompatibility complex or MHC, immunoglobulin, and olfactory receptor genes) evolve primarily by the birth-and-death model of evolution [3, 8, 10]. This model promotes genetic diversity under recurrent gene duplication events and ...