Genome assembly of the cichlid fish Astatotilapia latifasciata with focus in population genomics of B chromosome polymorphism Maryam Jehangir Botucatu, July, 2017 UNIVERSIDADE ESTADUAL PAULISTA “Julio de Mesquita Filho” INSTITUTO DE BIOCIÊNCIAS DE BOTUCATU GENOME ASSEMBLY OF THE CICHLID FISH ASTATOTILAPIA LATIFASCIATA WITH FOCUS IN POPULATION GENOMICS OF B CHROMOSOME POLYMORPHISM MARYAM JEHANGIR DR. CESAR MARTINS Dissertation presented to the Institute of Biosciences, Campus of Botucatu, UNESP, to obtain the title of Master in the Postgraduate Program -University graduate in Biological Sciences (Genetics). Botucatu- SP 2017 Palavras-chave: Cromossomos B; Evolução; Montagem do genoma; Peixe cíclido; Polimorfismo. Jehangir, Maryam. Genome assembly of the cichlid fish astatotilapia latifasciata with focus in population genomics of B chromosome polymorphism / Maryam Jehangir. - Botucatu, 2017 Dissertação (mestrado) - Universidade Estadual Paulista "Júlio de Mesquita Filho", Instituto de Biociências de Botucatu Orientador: Cesar Martins Coorientador: Guilherme Targino Valente Capes: 20204000 1. Peixe - Genética. 2. Ciclídeos. 3. Cromossomos - Polimorfismo. 4. Genoma. 5. Genômica. 6. Citogenética. DIVISÃO TÉCNICA DE BIBLIOTECA E DOCUMENTAÇÃO - CÂMPUS DE BOTUCATU - UNESP BIBLIOTECÁRIA RESPONSÁVEL: ROSEMEIRE APARECIDA VICENTE-CRB 8/5651 FICHA CATALOGRÁFICA ELABORADA PELA SEÇÃO TÉC. AQUIS. TRATAMENTO DA INFORM. Acknowledgements In the name of Allah, the Most Gracious and the Most Merciful, for the strengths and His blessings in completing this thesis. I would like to thank all the people who contributed in some way to the work described in this thesis. First and foremost, I thank my academic advisor, Professor Cesar Martins, for accepting me into his group. During my tenure, he gave me intellectual freedom in my work, supporting my attendance at various conferences, engaging me in new ideas, and demanding a high quality of work in all my endeavors. My appreciation to my co-advisor professor Guilherme Targino Valente, for beneficial ideas and knowledge regarding my research project. Additionally, I would like to thank my committee members for their interest in my work. Every result described in this thesis was accomplished with the help and support of fellow lab mates and collaborators especially, Syed Farhan Ahmad, Erica Ramos and Adauto Lima Cardoso. I greatly benefited from their keen scientific insight, their knack for solving seemingly intractable practical difficulties, and their ability to put complex ideas into simple terms. I am indebted to UNESP administration, for providing much needed assistance with administrative tasks, reminding us of impending deadlines, and keeping our work running smoothly. I am grateful for the FAPESP (process number: 2014/17683-6), which provides me research resources to pursue my master project. I would like to acknowledge the Department of Genetics for arrangement of different courses and and the high-quality seminars, benefited greatly from these. I must express my gratitude to Syed Farhan Ahmad, my husband, for his continued support, encouragement and collaborating in my research. Finally, I would like to acknowledge my mother, father, sisters and brother for their love and prayers. Also not forgetting my daughter, Inshirah Farhan, her cute smile gives me more courage. Maryam Jehangir (Master student) Contents list 1. Introduction.............................................................................................................................….....7 1.1. B chromosome.......................................................................................................................…….7 1.2. Cichlid fish as a model organism..............................................................................................…..9 1.3. Astatotilapia latifasciata as a model species for B chromosomes studies...........................….…..9 1.4. Next generation sequencing (NGS) as a powerful tool to reconstruct the Astatotilapia latifasciata genome.............................................................................….....…......................….….....10 1.4.1. Illumina Sequencing....................................…...........................…...................................……12 1.4.2. De novo genome assembly...............................................................................................……..13 1.5. DNA polymorphism..............................................................................................…...............….13 1.5.1. Single nucleotide polymorphism (SNP)..............................................................................…..14 1.5.2. Insertion or deletion polymorphism(INDELs).....................................................................…..14 1.5.3. Genome rearrangement.....................................................................................................….....15 2. Objectives...........................................................................................................................…..…...17 2.1. General objective..............................................................................................................….…...17 2.2. Specific objectives................................................................................................................…....17 3. Material and methods...........................................................................................................….....18 3.1. The model organism, Astatotilapia latifasciata..................................................................….…..18 3.2. Chromosome preparation, DNA sampling and sequencing..........................................………....18 3.3. Illumina Next-Generation Sequencing.....…….................................................................….…..19 3.4. Pre-processing step of NGS data.......................................................................................……...19 3.5. Genome assemblies and quality evaluation.........…............................................................….…19 3.6. Structural annotation of genes............................................................................................……..20 3.7. Genome diversity and genome structural variations analysis.........................................….….....21 3.8. Analysis of B localized sequences....................................................................................….…...22 3.9. Primer design, probes construction and FISH mapping of Ihhb and 45S rRNA genes.….…..…23 4. Results.............................................…............................................................…..........………......24 4.1. Illumina next generation sequencing.....................…......................................................….........24 4.2. De novo based B- and B+ genome assemblies......…....................…..........................….............24 5. Discussion.................................................................…..................................................………....26 6. Chapter 1…………………………………….……………………………………………...…....27 7. Thesis conclusion……………………….…………………………………….…………………..52 8. Recommendations........................................................................................….…...........………..53 9. Supplementary materials…………………………………………………….………..…….......54 10. References.................................................................................................….....................…...…64 Abstract B chromosomes (Bs) are additional to the standard regular chromosome set (As), and present in all groups of eukaryotes. A reference genome is key to understand genomics aspects of an organism. Here, we present the de novo genome assembly of the cichlid fish A. latifasciata: a well known model to study Bs. The assembly of A. latifasciata genome has not been performed so far. The main focus of this study is to analyze and assemble the A. latifasciata genome with no B (B-) and with B (B+) chromosomes. The assembled draft B- and B+ genomes comprised of 774 Mb and 781 Mb with 1.8 Mb and 2.5Mb of N50 value of scaffolds respectively, and spanning 23,391 number of genes. High coverage data with Illumina sequencing was obtained for males and females with 0B, 1B and 2B chromosomes to provide information regarding the population polymorphism of these genomes. We observed a high scale genomic diversity in all analyzed genomes showing a high rate/frequency of population polymorphism with no evident effect of B chromosome presence. However, the B specific single nucleotide polymorphisms were found in the sequences that were located on B chromosome. While, the whole-genome rearrangements (inter chromosomal translocations) were detected in B+ genome, and structural variations including insertions, deletions, inversions and duplications were predicted in a representative genomic region of B chromosome. These results bring an evidence that existence of Bs in a genome should favour the accumulations of mutations and structural polymorphisms in the amlified genomic regions present on B chromosmes. In addition, we also performed the coverage based sequence study coupled with FISH mapping which revealed: 1) the existence of high copy number of inactive Indian Hedgehog b (Ihhb) gene on B chromosome emerging as a pseudogene after series of duplication events ultimately becoming a major structural component of B; 2) B chromosome have incorporated the entire 45S RNA cluster (18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA) from the A complement set. The assembly of A. latifasciata genome will serve as a reference for genetic analysis and the approach presented in this paper opens the perspective to advance understanding B chromosomes biology. Keywords: Genome Assembly, Cichlid fish, B chromosome, Genome, Sequencing, polymorphism, evolution 7 1. Introduction 1.1. B chromosomes B chromosomes (Bs) were first reported more than a century ago by E. B. Wilson in the leaf-footed plant bug insect (Metapodius) (Wilson, 1907). Bs are accessories to the standard regular chromosome set (As) and also named as extra chromosomes that are present in some individuals of species. B chromosomes behave different from normal set of chromosomes because they do not pair or undergo recombination with the A chromosomes during meiosis. They are inherited clonally and do not obey Mendelian law (Jones and Houben, 2003). Two key factors are involve in maintenance of Bs, its transmission rate (i.e., drive) and effects on fitness. It is believed unlikely that young extra chromosomes lacking drive or beneficial effects (even being neutral) might invade a population and become B chromosomes (Camacho et al. 1997; Camacho, 2005). Today more than 15% of eukaryotic species including 500 animal species have been reported to posses B chromosomes (Bs) (Camacho, 2005). Within the same population not all individuals carry B chromosomes and their number can differ between individuals (e.g. Vulpes vulpes, 2n = 34 As + 0–8 Bs; Rattus rattus, 2n = 42 + 0–5 Bs). Some species have different morphological types of Bs exist within a single species . They can surprisingly exceed the number of As in some species (e.g. Zea mays, 2n = 20 As + 0–34 Bs) (Liehr et al. 2008). In most species which carry Bs, the mitotic transmission of Bs during growth and development is normal and hence all cells carry the same number of Bs within the individual. However, several exceptional studies stated that some Bs are mitotically unstable and can vary in numbers in specific tissues and/or organs. For example, in the grasses (Aegilops speltoides and A. mutica), Bs exist in aerial organs but not in roots (Mendelson and Zohary, 1972; Ohta, 1996). Bs can also be distributed differently between genders from species to species. Normally Bs exist in both gender, but sometime the frequency of Bs are higher in one sex. In some species the Bs are present either in males only (eg. Moenkhausia sanctaefilomenae) (Portela-Castro et al. 2000) or females only (eg. Astyanax scabripinnis paranae) (Maistro et al. 1992; Mizoguchi and Martins- Santos, 1997). B chromosomes are composed of repetitive DNAs such as rRNA genes (Houben et al. 2005; Ruiz-Estévez et al. 2012), tandemly arranged repetitive elements (Potapov et al. 1990), LINEs (long interspersed nuclear elements), SINEs (short interspersed nuclear elements) (Peppers et al. 1997), interstitial telomeric sequences (Wurster et al. 1988; Szczerbal et al. 2003). The heterochromatic 8 nature of B chromosomes gives the idea that these elements were mediated genetically inert and their presence were not needed for survival or reproduction of the individuals (Camacho, 2005). But recently comparative sequence analysis of the A and B chromosomes of rye plant (Secale cereale) and cichid fish (Astatotilapia latifaciata) claim the inert nature of supernumerary chromosomes. These studies concluded that B chromosome has gained a diverse range of repeat sequences and protein-coding genes (Martis et al. 2012; Valente et al. 2014). Different studies have revealed that B chromosomes keep transcriptionally active DNA sequences that could play some role in variety of functions, such as the discovery of proto- oncogenes and tumor-suppressor genes in the B chromosomes of canid species(Graphodatsky et al. 2005; Makunin et al. 2014), H3 and H4 histone genes in those of the migratory locust (Teruel et al. 2010), other protein-coding genes in the B chromosomes of a cichlid fish and the Siberian roe deer (Capreolus pygargus) (Trifonov et al. 2013; Yoshida et al. 2011). In addition, Valente et al. investigated the gene content of B chromosomes in cichlid fish (Astatotilapia latifasciata) accomplice with different functions. The numerical frequency of Bs carrying individuals shows phenotypic effects in some species (Jones, 1982; Green, 1990). The small number of the B chromosomes seems to have no impact on the phenotype while in a high number they can effect the phenotype (Bosemark, 1957b; Gonzalez-Sanchez. et al. 2004). B chromosomes are linked with both negative and positive effects. In grasshopper (Myrmeleotettix maculatus), they likely prevent animal development (Harvey and Hewitt, 1979) and sperm dysfunction (Hewitt et al. 1987). A positive behavior was also found in some organisms for example as in rice (Oryza sativa), Bs play role on plant height, weight of grain, and length of its panicle (Cheng et al. 2000). The presence of B chromosomes in maize (Zea mays L), alters the recombination frequency of A chromosomes (Rhoades, 1968). The B chromosome originated as a by-product of A chromosome evolution of either the same or related species (Camacho et al. 2000). The study on Bs origin of Canidae identified that it carry several chromosomal regions of domestic dog that show co-hybridization to wild canid B chromosomes (Becker et al. 2015). Recently, the rise of next generation sequencing confessed that the B chromosomes of fish species Astatolilapia latifasciata and Astyanax paranae were evolved from multiple As (Silva et al. 2014; Valente et al. 2014). Sequencing of rye B chromosome showed that the B chromosome was originated from A chromosomes 3R and 7R(Martis et al. 2012). Kao et al. (2015) using the Random Amplified Polymorphic DNA (RAPD) technology, revealed that four short repetitive sequences were found to locate on both A and B chromosomes. 9 1.2. Cichlid fish as a model organism Cichlid fishes belong to the most rich species family of Perciformes and represent one of the best model for studying different genetic fields, such as evolution and cytogenetics (Kocher, 2004; Ijiri et al. 2007; Poletto et al. 2010a, b). There is approximately 3000 species of cichlid fishes spread through out the different regions, from Central and South America, Africa to Madagascar, the Middle East, and Southern India. An intense level of adaptive radiation has characterized the evolution of cichlids, where closely 2,000 species have been diversified in the last 10 million years (Stiassny, 1991; Kocher, 2004). The evolutionary tree of this family are divided into four subfamilies: Etroplinae (cichlid from Madagascar and South Asia - India and Sri Lanka), Ptychochrominae (Madagascar), Cichlinae (species of the Neotropics) and Pseudocrenilabrinae (cichlids from African). Further more, Cichlinae and Pseudocrenilabrinae comprise of more diverse species (Stiassny, 1991; Sparks and Smith, 2004; Genner, 2005). The African-Neotropical cichlids represent an ancestral group of Madagascar and Indian Cichlids which is regarded as the most basal group of divergence (Smith et al. 1994). Although more than 60% of the species present a karyotype with 2n = 48, the diploid number ranges from 2n = 32 to 2n = 60. African cichlids have a modal diploid number of 44 chromosomes, whereas the Neotropical cichlids have 2n = 48 chromosomes (Poletto et al. 2010). In the genomic level, different cichlids genomes (Oreochromis niloticus, Astatotilapia burtoni, Neolamprologus Brichardi, Pundamilia nyerereie and Metriaclima zebra) have been sequenced to interpret its role and evolutionary pathways (Brawand et al. 2014). These genomes are freely available online in different databases on (available http://cichlid.umd.edu/cichlidlabs / kocherlab / bouillabase.html). The presence of B chromosomes in cichlids is of particular interest. In our field of concern, Astatotilapia latifasciata is one of the appropriate model for examination of B chromosomes function and evolution. 1.3. Astatotilapia latifasciata as a model species for B chromosomes studies Cichlid fishes are considered a well known model to study B chromomsomes. Bs have been determined in 7 South Americans (Feldberg and Bertollo, 1984; Martins-Santos et al. 1995; 10 Feldberg et al. 2004) and 14 African (Poletto et al. 2010b; Fantinatti et al. 2011; Yoshida et al. 2011; Kuroiwa et al. 2014) species of cichlids. Among the African species, B chromosomes were first described in Astatotilapia latifasciata from Lake Nawampasa, a satellite lake of the Lake Kyoga, part of Lake Victoria system (Poletto et al. 2010a). B chromosomes in A. latifasciata have been studied with a focus on cytogenetics. Different cytogenetic techniques such as mapping ribosomal DNA sequences, the repetitive element SATA, BACs (Bacterial Artificial Chromosomes), genomic DNA fraction containing highly repeating (C0t- 1 DNA) and comparative genomic hybridization (CGH) indicated that no differences specific to sex were detected and these chromosomes have many repetitive DNA sequences. One or two similar Bs were observed in both sexes of A. latifasciata (Poletto et al. 2010a; Fantinatti et al. 2011). Large scale genomic analysis in the cichlid fish A. latifasciata (Valente et al. 2014) studied the B chromosome origin of the species. A better concept was drawn to understand the gene content, genome pattern and biological role of B chromosome by the application of NGS. The B chromosome of A. latifasciata composed of mostly fragmented genes with a few largely intact. Among these high intact sequences, genes involved with cell cycle (microtubule organization, kinetochore structure, recombination and progression through the cell cycle) were detected and may play a role in driving the transmission of the B chromosome in their hosts (Valente et al. 2014). 1.4. Next generation sequencing (NGS) as a powerful tool to reconstruct the Astatotilapia latifasciata The automated Sanger method is considered as a ‘first-generation’ technology, and newer methods are referred to as next-generation sequencing (NGS). Next-generation sequencing (also known as massively parallel sequencing) technologies can be used as a means for in-depth genomic analysis (Lin Liu et al. 2012; Wold et al. 2008). There are few common methodologies to be performed during sequencing such as preparation of template, imaging and sequencing and analysis of data. Single DNA molecules templates or clonally amplified templates are required for the preparation of NGS process. The two well known approaches “sequencing by synthesis” and “sequencing by ligation” are applied to sequence DNA. The former is based on numerous DNA polymerase-dependent step while later describe the use of DNA ligase instead of DNA polymerase. The integration of imaging techniques and these sequencing approaches range from measuring bioluminescent signals to four-colour imaging of single molecular events. Due to generation of huge amount of data by NGS, information 11 technology is becoming a substantial demand in terms of quality control, tracking and data storage (Metzker, 2005; Fan et al. 2006; Pop et al. 2008). Many challenges have arisen for bioinformatics to analyze the short-reads and gene variation effectively due to short-read strategy of NGS (Wold and Myers, 2008; Yang et al. 2009). Advances in bioinformatics field such as assembly and alignments would increase the chance of accurate and error free interpretation of short reads (Pop and Salzberg, 2008). Normally a run of typical NGS platform produces tens to hundreds of Gbp short reads. Finally, a raw data of terabytes is generated in an average next generation sequencing experiment, hence causing problems to manage and analyze the data. Creation of bioinformatics tools, advancement in management and development of data storage can lead to successful and useful application of NGS. Another important problem for bioinformatics analysis is the variabilities among the different NGS platforms. Because of differences in read length and data format, many factors such as data processing, sequence quality scoring, assembly and alignment of softwares need to be diversified accordingly etc. Various features of NGS is the reason of coexistence of multiple platforms in the marketplace with some having prominent benefits for specific applications over others. Commercially available technologies of NGS are Roche/454, Illumina/Solexa, Life/APG and Helicos BioSciences, the Polonator instrument and the near-term technology of Pacific Biosciences, who aim to bring their sequencing device to the market (Branton et al. 2008). NGS technologies have rapidly changed our minds in respect to various scientific aspects such as clinical, basic and applied research. In some aspects, the potential of NGS is comparable to the early days of PCR, with one’s imagination being the primary limitation to its use. In some cases NGS can exceed of one billion short reads per instrument run (Wold et al. 2008; Wang et al. 2009). Due to innovation in sequencing technology, progress in genomics has been progressed at a rapid pace. With more and more organisms being sequenced, a big stream of genetic data is overloading the world every day. Not only do these studies provide the knowledge for basic research, but also they afford immediate application benefits (Lin Liu et al. 2012). Next generation sequencing is now a useful tool in routine applications to be used for wide areas of biology, enabling researchers to uncover important biological mysteries (Soon et al. 2012). The arrival of different NGS platforms is opening opportunities to produce large number of sequence reads at low cost with wide range of applications. These include variant discovery by resequencing targeted regions of interest or whole genomes, de novo assemblies of bacterial and lower eukaryotic genomes, cataloguing the transcriptomes of cells, tissues and organisms (RNA–seq), genome-wide profiling of epigenetic marks and chromatin structure using other seq-based methods (ChIP–seq, methyl–seq and DNase– 12 seq), and species classification and/or gene discovery by metagenomics studies (Petrosino et al. 2009). Although large-scale genomic analyzes have already been explored in understanding B chromosome biology, such approach can be applied to a wide range of research questions as sequencing genomes, comparative biology studies, public health, epidemiology, physiology, and gene expression. High-scale analysis provides detailed information on the composition of genomes and representing important tools to advance over structural and functional analysis of cells, tissues and organisms. 1.4.1. Illumina sequencing The Illumina platform employs cyclic reversible termination (CRT) chemistry for DNA sequencing. The process relies on growing nascent DNA strands complementary to template DNA strands with modified nucleotides, while tracking the emitted signal of each newly added nucleotide. Each nucleotide has a 3’ removable block and is attached to one of four different fluorophores depending on its type. Sequencing occurs in repetitive cycles, each consisting of three steps: (a) extension of a nascent strand by adding a modified nucleotide; (b) excitation of the fluorophores using two different lasers, one that excites the A and C labels and one that excites the G and T labels; (c) cleavage of the fluorophores and removal of the 3’ block in preparation for the next synthesis cycle. Using this approach, each cycle interrogates a new position along the template strands. Illumina provides a large number of intermediate input files that can be used to perform a fine-scale analysis of distortion factors. The most useful among them are imaging files, intensity files, and sequencing files (Naiara and Michael, 2012). Illumina Sequencing increasing throughput exponentially over first-generation Sanger sequencing. There are certain drawbacks of second generation instruments despite of several useful benefits. Amplification of template DNA is required before sequencing which result in biased coverage. Another major disadvantage of these technologies is the production of short length reads making assembly and related analysis problematic. Average length of Illumina reads is 100 bp is not appropriate for an ideal assembly as a theoretical model assumes that reduction of read lengths from 1,000 bp to 100 bp may result a sixfold or more decline in continuity (Kingsford et al. 2010). The new chemistry for Illumina equipment promises reads of 300bp, but the quality of longer reads is lower. 13 1.4.2. De novo genome assembly In bioinformatics, de novo genome assembly refers as the procedure of aligning short DNA fragments(reads) together into longer segments (contigs or scaffolds) and further joined into linkage groups or placed on chromosomes (Ellegren, 2014; Simpson and Pop, 2015). Adequate overlap between the sequence reads in the genome at every position is essential for an accurate assembly. Longer reads may cause more overlap hence decreasing the depth of unnecessary raw reads. Paired end sequencing technology is more efficient in assembly because of its accurate placement of reads (Robert et al. 2014; Ellergren, 2014). A well assembled genome is needed to detect vital functional and structural genomic elements and essential for better analysis of genetic variations (Mahul et al. 2015). Large scale genomic data were previously obtained for the cichlid fish A. latifascita using Illumina NGS platform (Valente et al. 2014). Although it was generated a high coverage (over 30x of coverage) of 0B (genomes with no B chromosome) and 2B (genome with 2B chromosomes) genomes, the data were not enough to reconstruct the genome of the species. Although the generation of high amount of nucleotide sequence data becomes available with the NGS technologies, the assembly of genomes continues to be one of the central problems of bioinformatics. This owes, in large part, to the technologies that provide ‘reads’ of short sequences of DNA, from which the genome is inferred. Larger sets of data, and changes in the properties of reads such as length and errors, bring with them new challenges for assembly (Henson et al. 2014). 1.5. DNA polymorphism DNA polymorphisms are broad type of genetic variations among individuals of same species or a different species. In another words, if a few alleles of a polymorphic site has the most widely recognized variation among them happening with under 99% recurrence in the population(Cavalli- Sforza, 1971; Schork et al. 2000). Substantial variation can be found at different points in the genome while performing a comparative analysis of genomic DNA sequences of two individuals on equivalent chromosome. Genes can comprise of many polymorphisms which have impact on many phenotypes traits such as hair color and height. Some of these polymorphisms may cause disease susceptibility, affect drug responses and considered important in other medical aspects. Although several polymorphisms occur outside of genes and do not show any effect (Bentley, 2000; Barnes and Grey, 2003). 14 DNA polymorphism emerges as a result of mutation in the DNA level. The different types of polymorphism are specified according to the type of mutation that they are generated in the sequence. Polymorphism is broadly divided into different types, include single nucleotide polymorphism (SNPs), insertions and deletions (INDELs), and other larger rearrangements such as (inversion, translocation and transversion) (Botstein et al. 1980; Schork et al. 2000). 1.5.1. Single Nucleotide Polymorphism (SNP) Single nucleotide polymorphism(SNP) is simplest type of polymorphism results from alteration of a single nucleotide base adenine (A), thymine (T), cytosine (C) or guanine (G) by other nucleotide and bring mutation in genome. SNPs are the most common form of genetic variation, accounting for 90 percent of all human genetic variations. For instance, the nucleotide diversity (the SNP rate) between humans is about 0.1%, while this rate can be as high as 3-5% in some fish and other sea organisms (Riva et al. 2002). Few studies have predicted an average range of 0.3-1,000 kb of genome can detect SNPs, however most of the data was based on specific genes and therefore fail to address the question of SNP frequency. There is no clear cut evidence for the correct estimation of SNPs occurrence in the rest of genomes (Halushka et al. 1999; Cargill et al 1999). Most SNPs have only 2 alleles (biallelic SNP) and they may occur in both coding and non coding regions of the genome (99% not in genes). The SNPs located in coding regions are little harmful and inherit changes to DNA while the remaining non-coding SNPs become silent, harmless and does not effect (Strittmatter et al. 1996). 1.5.2. Insertion or deletion Polymorphism(INDELs) Insertions or deletions (INDELs) is another common form of genetic polymorphism, as the name indicates it result when a section of DNA is either deleted or inserted. This type of polymorphism mostly occurs due to the existence of nucleotide pattern or variable number of repeated base (Cooper et al. 1999). The size of repeat base pattern vary from few (two, three or four) nucleotides repeats known as 'micro-satellites' up to several hundreds base pairs called 'mini- satellites or 'variable tandem repeats' (VNTRs). Many alleles are referred as 'highly polymorphic' due to frequent number of repeat polymorphisms having several different repeat sizes. This is very much helpful to study population genetics because the probability of having the same number of 15 repeats in two individuals from, suppose, different population (healthy vs diseased, ethnic groups) may be much lower (Shriver et al. 1997). 1.5.3. Genome rearrangement A large amount of DNA segments are displaced as a result of a unique category of mutation known as genome rearrangement. This happens during the process of chromosomal breakdown at certain sites also called breakpoints, followed by reassembling of chromosomal pieces in a incorrect order. The genome rearrangements can have deleterious effects or even lethal if they occur in the coding sequence, making the gene inactive. On other hand, a rearrangement whose breakpoint fall in non-functional region is likely thought to be neutral in effects (Mathieu, 2001). The related studies conducted on genome rearrangement until now has concluded two categories: (i) intra-chromosomal rearrangements, have subtypes: transpositions, reversals/inversions, and block-interchanges/generalized transpositions, and (ii) inter-chromosomal rearrangements, have subtypes: translocations, fusions and fissions. Transpositions shift a segment from one location to another on a chromosome or, equivalently, exchange two adjacent and non- overlapping segments on the chromosome. Inversions or Reversals reverse a chromosomal segment and also exchange its strands. Exchanging two non-overlapping segments which are not necessary adjacent on chromosome is called generalized transposition or Block-interchanges. The exchange of a telomeric segment of one chromosome with that of other chromosome is termed as translocations. Fusions combine two separate smaller chromosomes into a single bigger one and fissions involve breakage of a chromosome into two smaller ones (Alekseyev, 2008; Alekseyev and Pevzner, 2008). Advanced and low-error analytical strategies should be designed to use DNA polymorphism for latest genetic applications. Approximately all DNA polymorphisms can be captured now with the help of modern development of methods for detection of structural variants (Svs) and SNPs including NGS. Still, there are many challenges to meet for highly accurate identification of DNA polymorphisms since short-read sequencing strategy is difficult for analysis to infer chromosomal context (Kidd et al. 2008; Graubert et al. 2007; Emerson et al. 2008). DNA polymorphism analysis were also studied in B chromosomes of rye (Secale cereale) (Martis et al. 2012) and Grasshopper (Eyprepocnemis plorans) (Muñoz-Pajares et al. 2011). Numerical polymorphisms have been described in a large number of maize landraces (Mcclintock et al. 1981; Chiavarino et al. 1995; Naranjo et al. 1995; Rosato et al. 1998), with differences in the number of B's being one of the main factors contributing to intraspecific genome-size variation. http://www.genetics.org/content/177/2/895#ref-63 http://www.genetics.org/content/177/2/895#ref-48 http://www.genetics.org/content/177/2/895#ref-15 http://www.genetics.org/content/177/2/895#ref-46 http://www.genetics.org/content/177/2/895#ref-46 16 Silva & Yonenaga-Yassuda (1998) reported a conspicuous heterogeneity of size, morphology, constitutive heterochromatin patterns and localization of telomeric sequences of B chromosomes for the rodent Nectomys, which allowed them to suggest differences in the composition of these chromosomes. A considerable amount of chromosome variability involving the presence of Bs and polymorphisms of autosome pairs was detected microteiid lizard (Nothobachia ablephara) (Pellegrino et al. 1999). Reports on the nucleotide level polymorphism of Bs are still extremely scarce , therefore our work is a novel contribution to understand the association and significance of polymorphism in B chromosome. 17 2. Objectives 2.1. General Objective  Reconstruction of A. latifasciata genomes (B+ and B-) to analyze B chromosome polymorphism and genomic diversity 2.2. Specific Objectives  Generation of high coverage data using Illumina sequencing  Assembly of B+ and B- genomes of A. latifasciata  Prediction of genes (structural annotation)  Analysis of genome diversity  Study of B chromosomes polymorphism  Search of B specific genes in A. latifasciata genome 18 3. Material and Methods 3.1. The model organism, Astatotilapia latifasciata The cichlid fish A. latifasciata (Pseudocrenilabrinae) is native from the lakes Kyoga and Nawampasa (East Africa) and classical and molecular cytogenetic studies detected the presence of one or two B chromosomes in some individuals of populations farmed in Brazil for aquarium hobbyist market (Poletto et al. 2010a; Fantinatti et al. 2011). The presence of B chromosomes in A. latifasciata associated to the release of whole genome sequences of several other African cichlids (The International Cichlid Genome Consortium, 2006) (Brawand et al. 2014) makes this species an excellent model for genomic studies. Furthermore, A. latifasciata can be easily maintained in captivity allowing directed crosses and sampling of tissues for chromosome, DNA, RNA and protein analysis. Currently, populations of A. latifasciata without B chromosomes (0B) and with 1 (1B) or 2 B chromosomes (2B) have being maintained at the fish facility of the Integrative Genomics Laboratory at Institute of Biosciences/UNESP, Botucatu, SP, Brazil. The use of these animals for biological samples were held according to ethical conventions utilized by Brazilian College for animal experiments accepted by ethic committee of the Biosciences Institute, UNESP-Sao Paulo State University. 3.2. Chromosome preparation, DNA sampling and sequencing The liver tissues of A. latifasciata samples were collected and karyotyped by classical chromosome preparation protocols employed for fish to check the presence of 0, 1 or 2B chromosomes. Additionally, these samples from males and females with 0B, 1B or 2 B chromosome genomes were checked via polymerase chain reaction (PCR) and real time PCR (Fantinatti et al. 2016). High quality genomic DNA (gDNA) from liver tissues samples of A. latifasciata males and females with 0B, 1B and 2B chromosomes selected for next generation sequencing (NGS). Seven individuals including males and females with 0B, 1B and 2B chromosomes were subject to Illumina sequencing using the service sequencing facility of University of Maryland, USA ( "http://www.ibbr.umd.edu/facilities/sequencing) . Additionally NGS were also performed in a MiSeq Illumina equipment available to our use in the Institute of Biosciences/UNESP, Botucatu. All the generated data are already available at Sacibase (www.sacibase.ibb.unesp.br). http://www.sacibase.ibb.unesp.br/ 19 3.3. Illumina Next-Generation Sequencing For the Illumina libraries, gDNA from the individuals were sheared to an average size of 350-550 bp using an S220 focused ultrasonicator (Covaris Inc., Woburn, MA). Separate libraries were constructed for the gDNA samples using the TruSeq DNA sample preparation kit ver.2 rev.C (Illumina Inc., San Diego, CA). Paired end (100-190 bp) reads sequencing of each library were performed in separate lanes on an Illumina HiSeq 1000 sequencer. 3.4. Pre-processing step of NGS data At the point when the sequences data are prepared, its need to checks whether a set of sequence reads in a .fastq file exhibit any bizarre qualities (which might indicate either low sequence quality, or interesting biological features in sample). The reads quality was checked by the software FastQC (website: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and the ambiguous data need to filtered out, just leaving only a small fraction of usable, unique read pairs for assembly. The filtration of the data was done according to the requirement of genome assemblies by FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) tool kit (-q 30, -p 90 parameters). The illumina adapters checking using blastn search (e-value ≤10e -5 and 90% of identity were the cutoff parameters), which reads with some hit were eliminated through customized python programming script. The not paired reads (singletons) were discarded by Pairfq software for de novo assembling. Libraries related to one fragment paired-end were merged and make two datasets for de novo assembling. While the sequences coverage( defines as a the average number of reads per site) of all samples were calculated by coverage formula given as:coverage = (read count x read length ) / total genome size, Read Counts= Total reads of genome, Read length= Expected length of reads, Total genome= Total genome size / close related genome size (Metriaclima. Zebra) 3.5. Genome assemblies and quality evaluation For choosing appropriate assembly software, it is important to consider both the amount of sequencing data and which computational resources are available (Schatz et al. 2010a). For that 20 reason we used Assamblathon2 (Bradnam et al. 2013) recommended strategy for fish genome, using the Velvet (v 1.2.08) (Zerbino and Birney, 2008) and SOAPdenovo2 softwares (Luo et al. 2012). To estimates the best k-mer length for genome de novo assembly, we ran KMERGENIE (Chikhi and Medvedev, 2014) on pre-processed reads with putative k values of 31-91. The optimal values of k were predicted to be 71-mer and using this k value for assembling. The clean reads comes from pre-proceesing step consisting of all B- and B+ samples having male and female were inserted in to Velvet (Zerbino and Birney, 2008) and SOAPdenovo2 assemblers to construct scaffold level de novo assemblies. The two different assembler were used to compare both assembler generated assemblies and at the end choose the best possible assembly. Velvet assembler with following parameter settings for command line run; "-ins_length 500 , exp-cov auto, -unused_reads yes, read_trkg yes" were used for both assemblies. Other assembler SOAPdenovo was also set for genome assemblies by using parameters: “insert length 350, 500 and 550, asm_flags=3” were setted up to high coverage cleaned reads. Both assemblers generated a scaffold level assemblies (B+ and B-). To close the gaps within scaffolds of B+ and B- assemblies to be more accurate assemblies, GAPFILLER software (Boetzer and Pirovano, 2012) was operated. The program was executed by ('–m’ = 80, ‘-t’= 10, '-g' = 5) settings. The final gap closed scaffolds assemblies were constructed and used for further analysis. QUAST software (Gurevich et al. 2013) was applied for evaluation of the generated scaffolds before and after gaps filled to computing several metric values (length, number, length variation, N50, gap length). This software was executed in two seperate runs, one for independent evaluation and other with to compare our genomes with closest reference, M. zebra genome. The program was executed for eukaryotic genes comparison. We therefore, evaluated assembly using two common quality measures: the contig N50 length, the assembly size. Finally, the B- scaffold level genome assembly was selected for future analysis and used as a reference. 3.6. Structural annotation of genes The GeneMark-ES (Lukashin and Borodovsky, 1998) was used for eukaryotic ab-initio gene prediction, and identification of a set of 453 core genes that are supposed to be highly conserved in all eukaryotes, using CEGMA pipeline (version 2.4) . For identification of protein coding genes in assembled genome, We used three approaches: homology-based, de novo and transcript sequences-based by using a pipeline MAKER v2.31.8 21 (Cantarel et al. 2008); we used repetitive elements of custom libraries of fish and Metazoan elements, Danio rerio CDS and proteome files from (http://www.ensembl.org/Danio_rerio), A. latifasciata assembled transcriptomes accessed from (http://sacibase.ibb.unesp.br/), gene prediction based on Lamprey training set and Blastn to NCBI database (National Center for Biotechnology Information – http://www.ncbi.nlm.nih.gov/). 3.7. Genome diversity and genome structural variations analysis The genomic diversity(Identification of unique and common SNPs & INDELs among different individuals) was determined using the following procedures. The Illumina reads generated for the different A. latifasciata genomes (males and females with 0B, 1B and 2B chromosomes) were first filtered for alignments by FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) tool kit (-q 20, -p 80 parameters). The trim reads were aligned against our draft B- assembly as a reference using Bowtie2 (Langmead and Steven, 2012) --very-sensitive” option. Nucleotide polymorphism was identified using SAMtools tool to search for genome variations among the samples. Nucleotide polymorphism was identified using SAMtools (http://samtools.sourceforge.net/) to search for genome variations among the samples. The output files (VCF format) were subjected to VCFtools (vcf-stats and vcf-compare) (http://vcftools.sourceforge.net/) for statistical analysis to discover the frequency of single nucleotide polymorphism (SNPs), insertion/deletion (INDELs) and to compare these factors among indiviuals to find out the shared, unique or B related population polymorphism . Filtration was carried out to eliminate the lower quality (Q ≤ 20 and DP ≥ 100) SNPs and INDELs using vcffilter (https://github.com/vcflib/). Parameters for each step of this analysis were established according to the standard requirements. Similar approach was followed to detect polymorphisms in genic sequences (CDS, exons and introns) of A. latifasciata B+ and B- genomes aligned against de novo transcriptome assembly (Marques et al. unpublished). The structural variations (translocations) were detected in 1B sequencing data by Delly (Rausch et al. 2012). Delly integrates short insert paired-ends and split-read alignments to accurately delineate genomic rearrangements throughout the genome. This pipeline was applied to both B+ and B- reads in Bam format (generated using bowtie tool) considered as sample and control respectively. Denovo genome assembly of A. latifasciata was utilized as reference to locate the variations on the scaffolds regions. The detected translocations (breakpoints) were visualized by ClicO (Cheong et al. 2015), an online web-service based on Circos (Martin et al. 2009). To uncover 22 nature of these transclocation, We extracted randemly a few of these regions (515-520 MB) and subjected to NCBI Blast (https://blast.ncbi.nlm.nih.gov) for annotation. More specifically, structural variations such as deletions, insertions, transversions, inversions and duplications in genomic regions related to B chromosome (B block: B+ reads having higher coverage than B- reads) were analyzed by inGAP-sv tool (Qi et al. 2011). InGAP-sv detects the structural variations (SVs) on the basis of paired end mapped reads pattern and coverage of depth strategies. We applied this pipeline to B+ sam file generated using BWA (Li and Durbin, 2009) and A. latifasciata genome was used as a reference. After the SAM file was loaded into inGAP-sv, user-defined threshold of mapping quality (default value: 20) was applied to filter non- uniquely mapped reads. Illustrations of paired-end mapping (PEM) patterns for different types of identified SVs were generated according to Qi et al (2011). 3.8. Analysis of B localized sequences The nucleotide sequences of several previously identified B chromosomes genes of vertebrates (Makunin et al. 2014) supplementary (Table 1) were retrieved from the NCBI database (https://www.ncbi.nlm.nih.gov/). Consensus sequences constructions were obtained using Geneious v. 4.8.5 software (Drummond et al. 2009) for genes with more than one sequence available. All final sequences were used as queries against the A. latifasciata genome in a standard blastn search. Number of hits, E values and percent identity were considered to further proceed with B related analysis. This analysis confirmed the existence of Ihhb (Indian Hedgehog B gene) and 45S rRNA (18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA) gene sequences in A. latifasciata genome while the remaining genes were eliminated from the project because of partial or complete absence. The generated Illumina high coverage reads of all B- (0B male and 0B female) and B+ (1B Males, 1B Female and 2B Male) samples were aligned against both reference genes 45S rRNA and Ihhb of the cichlids Oreochromis aureus and Lithocromis rubripinnis respectively, using paired-end mode of Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) with the –very sensitive option. The output aligned files were converted to binary format and indexed using samtools. Each file was normalized using RPKM package of deeptools (Ramírez et al. 2014) to fix bias of initial coverage. These files were then visualized by integrated genome browser IGB (http://bioviz.org/igb/index.html) to compare coverage of both genes in B- and B+ samples. SNPs at different sites of nucleotide reads were detected. According to Marques et al. (unpublished) a total https://blast.ncbi.nlm.nih.gov/ 23 of 36 libraries were generated from 18 samples of transcriptomes of A. latifasciata. These transcriptomes were divided into triplicates of gonads, brain and muscle of male and female individuals. The uploaded transcriptomics data (http://sacibase.ibb.unesp.br/) and genomics data (aligned files) were visualized and screened. We did BLAST alignment of the genes against transcriptome assembly to locate them on specific scaffolds. The B specific SNPs were searched for simultaneous viewing all tracks of available data against transcriptome assembly as a reference. 3.9. Primer design, probes construction and FISH mapping of Ihhb and 45S rRNA genes Primers listed in the supplementary (Table 2) were constructed using PrimerQuest (http://www.idtdna.com/primerquest/home/index) tool, checked for specificity by Primer-Blast (http://www.ncbi.nlm.nih.gov/tools/primer-blast/) and evaluated by Primer Stat (http://www.bioinformatics.org/sms2/pcr_primer_stats.html). Genomic DNA was subjected to polymerase chains reaction (PCR) to obtain DNA segments to be used as probes for fluorescent in situ hybridization (FISH) mapping. DNA fragments obtained by PCR were sequenced (Sanger et al. 1977) using an ABI Prism 3100 automatic DNA sequencer (Applied Biosystems, Foster City, CA, USA) with a Dynamic Terminator Cycle Sequencing Kit (Applied Biosystems) as per the manufacturers’ instructions. Nucleic acid sequences were subjected to BLAST (Altschul et al. 1990) searches at the NCBI database to check for similarities to deposited sequences of corresponding genes. Probes were labelled with digoxigenin-11-dUTP (Roche Applied Science) and the signal was detected with anti-digoxigenin-rhodamine (Roche Applied Science). FISH was performed using the protocol described by Pinkel et al. (1986) with modifications (Cabral-de-Mello et al. 2012). The slides were denatured in 70% formamide/2xSSC, pH 7, for 36 s, and dehydrated in an ice-cold ethanol series (70%, 85%, and 100%). The images were captured with an Olympus DP71 digital camera coupled to a BX61 Olympus microscope and were optimized for brightness and contrast using Adobe Photoshop CS2. http://www.bioinformatics.org/sms2/pcr_primer_stats.html http://www.ncbi.nlm.nih.gov/tools/primer-blast/ 24 4. Results 4.1. Illumina next generation sequencing A total of seven B- and B+ samples were sequenced using Illumina HiSeq and Miseq platforms respectively. Each sample has different coverage of the corresponding genomes over the entire length (848,776,495bp) of the M. zebra genome. Each sample has been given a name such as: “M1-5” having different male samples and “F1-2” for female samples; 0-2 for B-, 1B and 2B, respectivelly; in (Table 1). Table.1. Illumina data obtained for A latifasciata including mapped over M. zebra genome. Samples Read length Coverage Coverage after filtration Total reads Remained reads after filtration Reference M1-0B 101 47.7x 38x 401,017,570 323226972(80%) Valente et al. 2014 F1-0B 191 75.5x 42.5x 337,349,994 190214688(56%) Present work M2-1B 35-250 2x 1.6x 716,030,6 5911265 (82%) Present work M3-1B 101 16.8x 13.4x 143,441,264 114064786(79%) Present work M4-1B 101 43.1x 34.8x 366,602,572 296161988(80%) Present work F2-1B 191 70.2x 40.1x 313,818,884 179286280(57%) Present work M5-2B 101 43.6x 30x 306,823,512 254993955 (83%) Valente et al. 2014 The raw data was then filtered and illlumina adapters were trimmed and a total of filtrated 513,441,660 (77.05%) of B- reads with 80.5x coverage and 850,418,274 (74.7%) with 120x coverage of B+ reads were remained after filtration (Table 1). 4.2. De novo based B- and B+ genome assemblies The B- and B+ draft assemblies were 771,316,069 bp and 781,068,509 bp of genome size respectively, which indicates that 84% of the B- and 80% of B+ genomes (This values correspond to the total used reads by assembler during assembly) are captured in scaffolds. The de novo assembly yielded 197,652 number of scaffolds for the B+ genome, with N50 of 25.54 kb, and 218,259 number of scaffolds for B- genome with N50 of 18.640 kb being the longest one with 23.86 kb and 23.36 kb respectively in (Table 2 and Supplementary Table 3). Our scaffold level genome assemblies corresponding to the closed reference M. zebra genome consisting of 25 848,776,495 bp cumulative length and GC contents of 40.5% given in (Supplementary Figure 1a,b). About (75.329%) and (74.556%) of M. zebra sequences had a significant hit in the B+ and B- genome assemblies respectively. The scaffold assembling gave a total of 3,998 and 3789 gaps per 100-kb of the assembled genomes(B- and B+) filled to 2,076 and 1884 respectively. The comparative analysis by QUAST (Gurevich et al. 2013) indicated that there was an improvement in assemblies after the gaps filling. Total length of B- and B+ genomes increased up to 774,140,944 bp and 782,153,984 bp respectively. Additional statistics is given in Supplementary Table 3 and Table 2. Table 2. QUAST evaluations statistic of B+ genome. Statistic B+ assembly B+ assembly after gaps filled # of contigs 322328 - Total Length of contigs 767906325 - N50 of contigs 7997 - GC (%) contigs 40.49 - Largest contig 101668 - # of scaffolds (>=500bp) 78064 78078 # scaffold(>= 0 bp) 197652 197652 Total length of scaffolds(>= 0 bp) 781068509 782153984 Total length of scaffolds(>=500bp) 756378434 757469399 Largest scaffold 238637 238740 GC (%) 40.48 40.51 Scaffolds N50 25546 25546 Scaffolds NG50 21574 21623 Scaffolds N75 11565 11570 Scaffolds NG75 7368 7429 Scaffolds L50 8086 8095 Scaffolds LG50 10075 10059 Scaffolds L75 19087 19109 Scaffolds LG75 26612 26517 # unaligned scaffold 22147 20980 Unaligned length 30818517 4751825 Genome fraction (%) 75.329 76.255 Duplication ratio 1.134 1.125 # N's per 100 kbp 3789.43 1884.55 Largest alignment 173293 180244 Remaining results are given in the manuscript attached as a chapter 1 in this thesis. The attached manuscript will be reviewed and submitted to publication in scientific journal. 26 5. Discussion The chapter 1 (manuscript) encompasses the discussion as a separate section. 27 6. Chapter 1 De novo genome assembly of the cichlid fish Astatotilapia latifasciata with focus in B chromosome. M. Jehangira, S. F. Ahmada, A. L. Cardosoa, G. T. Valenteb , C. Martinsa aDepartment of Morphology, Institute of Bioscience, UNESP -São Paulo State University, Botucatu, SP, Brazil bBioprocess and Biotechnology Department, Agronomical Science Faculty, UNESP Sao Paulo State University, Botucatu SP, Brazil. Email:maryam.bioinfo.unesp@gmail.com Abstract B chromosomes (Bs) are additional to the standard regular chromosome set (As), and present in all groups of eukaryotes. The origin and role of Bs have been the objective of genome-wide studies. A reference genome is key to understand genomics aspects of an organism. Here, we present the de novo genome assembly of the cichlid fish A. latifasciata: a well known model to study Bs. The assembled draft genome comprised of 774 Mb with 1.8 Mb of N50 value of scaffolds and spanning 23,391 protein coding genes. High coverage data with Illumina sequencing was obtained for males and females with 0B, 1B and 2B chromosomes to provide information regarding the population polymorphism of these genomes. We observed a high scale genomic diversity in all analyzed genomes showing a high rate/frequency of population polymorphism with no evident effect of B chromosome presence. However, the B specific single nucleotide polymorphisms were found in the sequences specially located on B chromosome. Only whole-genome rearrangements (inter chromosomal translocations) were detected in B+ genome, and structural variations including insertions, deletions, inversions and duplications were predicted in a representative genomic region of B chromosome. These results bring an evidence that existence of Bs in a genome should favour the accumulations of mutations and structural polymorphisms in the amlified genomic regions present on B chromosomes. In addition, we also performed the coverage based sequence study coupled with FISH mapping which revealed: 1) the existence of high copy number of inactive Indian Hedgehog b (Ihhb) gene on B chromosome emerging as pseudogene after series of duplication events ultimately becoming a major structural component of B; 2) B chromosome have incorporated the entire 45S RNA cluster (18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA) from the A complement mailto:maryam.bioinfo.unesp@gmail.com 28 set. The assembly of A. latifasciata genome will serve as a reference for genetic analysis and the approach presented in this paper opens the perspective to advance understanding B chromosomes biology. Keywords: Genome Assembly, Cichlid fish, B chromosome, Genome, Sequencing, polymorphism, evolution Introduction B chromosomes (Bs are accessories to the standard regular chromosome set (As) that are present in some individuals of more than 15% of eukaryotic species. These extra chromosomes do not recombine with members of the basic A chromosomes complement and do not pursue the rules of the Mendelian segregation law (Jones and Houben, 2003). They are mostly heterochromatic, comprising of a large amount of repetitive DNA and their presence are not needed for survival or reproduction of the individuals (Camacho, 2005). But recently different studies have revealed that B chromosomes keep transcriptionally active DNA sequences that could play some role in variety of functions (Graphodatsky et al. 2005; Makunin et al. 2014; Teruel et al. 2010; Trifonov et al. 2013; Yoshida et al. 2011). Several studies have also focused to investigate the origin of B chromosome in various species of plants and animals, and found it be a derivative of standard A chromosomes. (Alfenito and Birchler, 1993; Martis et al. 2012; Silva et al. 2014; Valente et al. 2014). Chromosome biology, has gained remarkable advancement due to the recent development of genomics. The applications of genomics, such as next generation sequencing (NGS) to the study of B chromosome is on the rise. Recently, NGS technology has contributed to many studies involving B chromosome analysis in plants, animals, and fungi (Camacho, 2005). The use of this technology joined with increasing information of genome organization together open a new chapter in B chromosome studies (Makunin et al. 2014). The development of NGS has made a key impact in assembling genomes of diverse organisms and studying their genomics features such as genetic polymorphism and variations (Imelfort et al. 2009; Mahul et al. 2015). Approximately all DNA polymorphisms can now be captured with the help of modern development of methods for detection of structural variants (SVs) and SNPs using NGS data (Kidd et al. 2008). Despite an increase in many evidences about their origin, the gene content and pattern of evolution of Bs, these genomic elements still remain an open question for the majority of species. One of them is a cichlid fish (Astatotilapia latifasciata). Among the African species, B 29 chromosomes were first described in Astatotilapia latifasciata from Lake Nawampasa, a satellite lake of the Lake Kyoga system (Poletto et al. 2010). Previous analysis applied to the B chromosome in A. latifasciata were based on cytogenetics and comparative genomics studies. Cytogenetics confirmed the both sexes of A. latifasciata can have either one or two similar B chromosomes enriched with many repetitive DNA sequences with no sex-specific differences (Poletto et al. 2010; Fantinatti et al. 2011). While the cytogenomics revealed that the B chromosome of A. latifasciata has intact genes and degenerated sequences derived from most of the A chromosome set; some of the intact genes are potentially active for transcription. (Valente et al. 2014). However, a complete overview regarding certain genomic features of the B chromosomes was not accomplished due to lack of an assembled genome. Here, we present a de novo genome assembly of A. latifasciata, and its annotation and genetic level polymorphisms. This genomic assembly not only contributes to uncover the important aspects of B chromosome but will also provide an extra resource for studying genomics features such as variations, genes identification and sequence mapping in closely related species. Further, in this study, we have explored the genomic diversity of different individuals and discussed the association of structural variations with B chromosome. The detection of extensively distributed interchromosomal rearrangements allow us to uncover unknown evolutionary breakpoints that occurred in the A. latifasciata genome with B chromosome. We have also performed an analysis of B chromosome sequences originated from A complement and their physical mapping. 30 Methods Chromosome preparation, DNA sampling and next generation sequencing data The kidney tissues of A. latifasciata samples were collected and karyotyped by classical chromosome preparation protocols employed for fish to check the presence of 0, 1 or 2B chromosomes. Additionally, these samples from males and females with 0B, 1B or 2 B chromosome genomes were checked via polymerase chain reaction (PCR) and real time PCR (Fantinatti et al. 2016). High quality genomic DNA (gDNA) from liver tissues samples of A. latifasciata males and females with 0B, 1B and 2B chromosomes selected for next generation sequencing (NGS). Seven individuals including males and females with 0B, 1B and 2B chromosomes were subjected to Illumina sequencing using the service sequencing facility of University of Maryland, USA (http://www.ibbr.umd.edu/facilities/sequencing). For the Illumina libraries, gDNA from the individuals were sheared to an average size of 350-550 bp using an S220 focused ultrasonicator (Covaris Inc., Woburn, MA). Separate libraries were constructed for the gDNA samples using the TruSeq DNA sample preparation kit ver.2 rev.C (Illumina Inc., San Diego, CA). Paired-end (100- 190 bp) sequencing of each library were performed in separate lanes on an Illumina HiSeq 1000 plataform. Additionally, one sample 1B male was sequenced by Miseq paltform equipment available in the Institute of Biosciences/UNESP, Botucatu. The reads quality was checked by the software FastQC (website: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and filtration of the data was done according to the requirement of genome assemblies by FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) tool kit (-q 30, -p 90 parameters). The illumina adapters checking using blastn search (e-value ≤10e -5 and 90% of identity were the cutoff parameters), which reads with some hit were eliminated through customized python programming script. The not paired reads (singletons) were discarded by Pairfq software for de novo assembling. Libraries related to one fragment paired-end were merged and make two datasets for de novo assembling. The coverage of all samples were calculated by coverage formula given as: coverage = (read count x read length)/total genome size, being read count = total reads of genome, read length = expected length of reads and the total genome = total genome size/close related genome size (Metriaclima zebra). 31 Genome assemblies and quality evaluation For choosing appropriate assembly software, it is important to consider both the amount of sequencing data and which computational resources are available (Schatz et al. 2010a). For that reason we used Assamblathon2 (Bradnam et al. 2013) recommended strategy for fish genome, using the Velvet (v 1.2.08) (Zerbino and Birney, 2008) and SOAPdenovo2 softwares (Luo et al. 2012). To estimates the best k-mer length for genome de novo assembly, we ran KMERGENIE (Chikhi and Medvedev, 2014) on pre-processed reads with putative k values of 31-91. The optimal values of k were predicted to be 71-mer and using this k value for assembling. The clean reads comes from pre-proceesing step consisting of all B- and B+ samples having male and female were inserted in to Velvet (Zerbino and Birney, 2008) and SOAPdenovo2 assemblers to construct scaffold level de novo assemblies. The two different assembler were used to compare both assembler generated assemblies and at the end choose the best possible assembly. Velvet assembler with following parameter settings for command line run; "-ins_length 500 , exp-cov auto, -unused_reads yes, read_trkg yes" were used for both assemblies. Other assembler SOAPdenovo was also set for genome assemblies by using parameters: “insert length 350, 500 and 550, asm_flags=3” were setted up to high coverage cleaned reads. Both assemblers generated a scaffold level assemblies (B+ and B-). To close the gaps within scaffolds of B+ and B- assemblies to be more accurate assemblies, GAPFILLER software (Boetzer and Pirovano, 2012) was operated. The program was executed by ('–m’ = 80, ‘-t’= 10, '-g' = 5) settings. The final gap closed scaffolds assemblies were constructed and used for further analysis. QUAST software (Gurevich et al. 2013) was applied for evaluation of the generated scaffolds before and after gaps filled to computing several metric values (length, number, length variation, N50, gap length). This software was executed in two separated runs, one for independent evaluation and other with to compare our genomes with closest reference, M. zebra genome. The program was executed for eukaryotic genes comparison. We therefore, evaluated assembly using two common quality measures: the contig N50 length, the assembly size. Finally, the B- scaffold level genome assembly was selected for future analysis and used as a reference. Structural annotation of genes 32 The GeneMark-ES (Lukashin and Borodovsky, 1998) was used for eukaryotic ab-initio gene prediction, and identification of a set of 453 core genes that are supposed to be highly conserved in all eukaryotes, using CEGMA pipeline (version 2.4) (Parra et al. 2007). For identification of protein coding genes in assembled genome, We used three approaches: homology-based, de novo and transcript sequences-based by using a pipeline MAKER v2.31.8 (Cantarel et al. 2008); we used repetitive elements of custom libraries of fish and Metazoan elements, Danio rerio CDS and proteome files from (http://www.ensembl.org/Danio_rerio), A. latifasciata assembled transcriptomes accessed from (http://sacibase.ibb.unesp.br/), gene prediction based on Lamprey training set and Blastn to NCBI database (National Center for Biotechnology Information – http://www.ncbi.nlm.nih.gov/). Genome diversity and genome structural variations analysis The Illumina reads generated for the different A. latifasciata genomes (males and females with 0B, 1B and 2B chromosomes) were first filtered for alignments by FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) tool kit (-q 20, -p 80 parameters). The trim reads were aligned against our draft B- assembly as a reference using Bowtie2 (Langmead and Steven, 2012) --very-sensitive” option. Nucleotide polymorphism was identified using SAMtools tool to search for genome variations among the samples. Nucleotide polymorphism was identified using SAMtools (http://samtools.sourceforge.net/) to search for genome variations among the samples. The output files (VCF format) were subjected to VCFtools (vcf-stats and vcf-compare) (http://vcftools.sourceforge.net/) for statistical analysis to discover the frequency of single nucleotide polymorphism (SNPs), insertion/deletion (INDELs) and to compare these factors among indiviuals to find out the shared, unique or B related population polymorphism. Filtration was carried out to eliminate the lower quality (Q ≤ 20 and DP ≥ 100) SNPs and INDELs using vcffilter (https://github.com/vcflib/). Parameters for each step of this analysis were established according to the standard requirements. Similar approach was followed to detect polymorphisms in genic sequences (CDS, exons and introns) of A. latifasciata B+ and B- genomes aligned against de novo transcriptome assembly (Marques et al. unpublished). The structural variations (translocations) were detected in 1B sequencing data by Delly (Rausch et al. 2012). Delly integrates short insert paired-ends and split-read alignments to accurately delineate genomic rearrangements throughout the genome. This pipeline was applied to both B+ and B- reads in Bam format (generated using bowtie tool) considered as sample and 33 control respectively. Denovo genome assembly of A. latifasciata was utilized as reference to locate the variations on the scaffolds regions. The detected translocations (breakpoints) were visualized by ClicO (Cheong et al. 2015), an online web-service based on Circos (Martin et al. 2009). To uncover nature of these transclocation, We extracted randemly a few of these regions (515-520 MB) and subjected to NCBI Blast (https://blast.ncbi.nlm.nih.gov) for annotation. More specifically, structural variations such as deletions, insertions, transversions, inversions and duplications in genomic regions related to B chromosome (B block: B+ reads having higher coverage than B- reads) were analyzed by inGAP-sv tool (Qi et al. 2011). InGAP-sv detects the structural variations (SVs) on the basis of paired end mapped reads pattern and coverage of depth strategies. We applied this pipeline to B+ sam file generated using BWA (Li and Durbin, 2009) and A. latifasciata genome was used as a reference. After the SAM file was loaded into inGAP-sv, user-defined threshold of mapping quality (default value: 20) was applied to filter non- uniquely mapped reads. Illustrations of paired-end mapping (PEM) patterns for different types of identified SVs were generated according to Qi et al (2011). 3.8. Analysis of B localized sequences The nucleotide sequences of several previously identified B chromosomes genes of vertebrates (Makunin et al. 2014) supplementary (Table 1) were retrieved from the NCBI database (https://www.ncbi.nlm.nih.gov/). Consensus sequences constructions were obtained using Geneious v. 4.8.5 software (Drummond et al. 2009) for genes with more than one sequence available. All final sequences were used as queries against the A. latifasciata genome in a standard blastn search. Number of hits, E values and percent identity were considered to further proceed with B related analysis. This analysis confirmed the existence of Ihhb (Indian Hedgehog B gene) and 45S rRNA (18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA) gene sequences in A. latifasciata genome while the remaining genes were eliminated from the project because of partial or complete absence. The generated Illumina high coverage reads of all B- (0B male and 0B female) and B+ (1B Males, 1B Female and 2B Male) samples were aligned against both reference genes 45S rRNA and Ihhb of the cichlids Oreochromis aureus and Lithocromis rubripinnis respectively, using paired-end mode of Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) with the –very sensitive option. The output aligned files were converted to binary format and indexed using samtools. Each file was normalized using RPKM package of deeptools (Ramírez et al. 2014) to fix bias of initial coverage. https://blast.ncbi.nlm.nih.gov/ 34 These files were then visualized by integrated genome browser IGB (http://bioviz.org/igb/index.html) to compare coverage of both genes in B- and B+ samples. SNPs at different sites of nucleotide reads were detected. According to Marques et al. (unpublished) a total of 36 libraries were generated from 18 samples of transcriptomes of A. latifasciata. These transcriptomes were divided into triplicates of gonads, brain and muscle of male and female individuals. The uploaded transcriptomics data (http://sacibase.ibb.unesp.br/) and genomics data (aligned files) were visualized and screened. We did BLAST alignment of the genes against transcriptome assembly to locate them on specific scaffolds. The B specific SNPs were searched for simultaneous viewing all tracks of available data against transcriptome assembly as a reference. Primer design, probes construction and FISH mapping of Ihhb and 45S rRNA genes Primers listed in the supplementary (Table 2) were constructed using PrimerQuest (http://www.idtdna.com/primerquest/home/index) tool, checked for specificity by Primer-Blast (http://www.ncbi.nlm.nih.gov/tools/primer-blast/) and evaluated by Primer Stat (http://www.bioinformatics.org/sms2/pcr_primer_stats.html). Genomic DNA was subjected to polymerase chains reaction (PCR) to obtain DNA segments to be used as probes for fluorescent in situ hybridization (FISH) mapping. DNA fragments obtained by PCR were sequenced (Sanger et al. 1977) using an ABI Prism 3100 automatic DNA sequencer (Applied Biosystems, Foster City, CA, USA) with a Dynamic Terminator Cycle Sequencing Kit (Applied Biosystems) as per the manufacturers’ instructions. Nucleic acid sequences were subjected to BLAST (Altschul et al. 1990) searches at the NCBI database to check for similarities to deposited sequences of corresponding genes. Probes were labelled with digoxigenin-11-dUTP (Roche Applied Science) and the signal was detected with anti-digoxigenin-rhodamine (Roche Applied Science). FISH was performed using the protocol described by Pinkel et al. (1986) with modifications (Cabral-de-Mello et al. 2012). The slides were denatured in 70% formamide/2xSSC, pH 7, for 36 s, and dehydrated in an ice-cold ethanol series (70%, 85%, and 100%). The images were captured with an Olympus DP71 digital camera coupled to a BX61 Olympus microscope and were optimized for brightness and contrast using Adobe Photoshop CS2. http://www.bioinformatics.org/sms2/pcr_primer_stats.html http://www.ncbi.nlm.nih.gov/tools/primer-blast/ 35 Results De novo assembly and gene predictions A total of seven B- and B+ samples were sequenced using Illumina HiSeq and Miseq platforms respectively. Each sample has different coverage of the corresponding genomes over the entire length (848,776,495bp) of the M. zebra genome. Each sample has been given a name such as; “M1- 5” and “F1-2” for males and females, respectively; 0-2 for B-, 1B and 2B, respectivelly; S1-3 for different individuals of male samples (Table 1). Table.1. Illumina data obtained for A latifasciata including mapped over M. zebra genome. Samples Read length Coverage Coverage after filtration Total reads Remained reads after filtration Reference M1-0B 101 47.7x 38x 401,017,570 323226972(80%) Valente et al. 2014 F1-0B 191 75.5x 42.5x 337,349,994 190214688(56%) Present work M2-1B 35-250 2x 1.6x 716,030,6 5911265 (82%) Present work M3-1B 101 16.8x 13.4x 143,441,264 114064786(79%) Present work M4-1B 101 43.1x 34.8x 366,602,572 296161988(80%) Present work F2-1B 191 70.2x 40.1x 313,818,884 179286280(57%) Present work M5-2B 101 43.6x 30x 306,823,512 254993955 (83%) Valente et al. 2014 The raw data was then filtered and illlumina adapters were trimmed and a total of filtrated 513,441,660 (77.05%) of B- reads with 80.5x coverage and 850,418,274 (74.7%) with 120x coverage of B+ reads were remained after filtration (Table 1). The B- assembling to accomplish primary assembly of A. latifasciata genome having 771,316,069 bp, which indicates that 84% (This value corresponds to the total used reads by assembler during assembly) of genome was captured in scaffolds. The de novo assembly yielded 218,259 scaffolds with N50 of 18.640 kb (Supplementary Table 3) being the longest one with 23.36 kb. A. latifasciata draft genome has correspond to the M. zebra genome consisting of 848,776,495 bp cumulative length, GC contents of 40.5% (Supplementary Figure 1a,b). About 75.329% of M. zebra sequences had a significant hit against our performed assembly. The scaffold assembling gave a total of 3,998 gaps per 100-kb of the assembled genome filled to 2,076. The comparative analysis indicated that there was an improvement in assembly after the gaps filling. 36 Total length of genome slightly increased up to 774,140,944 bp. Additional statistics is given in Supplementary Table 3. We used the CEGMA pipeline (Parra et al. 2007) to evaluate the completeness of our assembly. The total number of complete core eukaryotic genes (CEGs) are 183 with the percentage of 73%, and number of partially complete CEGs are with the percentage of 94% supplementary data (Table 4). The ab-initio gene model predicted 71,917 genes of eukaryotes, of them 23,391 are protein coding genes. The 23,391 annotated genes in the A. latifasciata genome contain 204,656 exons and 177,219 with longest gene size is 66982 bp. On average, there are 7 exons and 6 introns per gene. Additional information (size, number and length) of major structure genome componets are in (Figure 1; Supplementary Data Table 5). All the structural annotation has been uploaded to Sacibase. Figure 1. Structural annotations of A. latifasciata genome. The bar graph shows the number of major genomic structural component. 37 Genomic diversity analysis In the genomes of six individuals (B- and B+), it was identified total 2,395,658 raw SNPs and 888,060 INDELs with respect to reference B- assembly. All the individuals carried higher number of SNPs than INDELs as given in (Supplementary Figure 2). These SNPs were forwarded to filtration and detected total of 17,875 high quality SNPs in all genomes (Figure 2a). Out of these total SNPs, the genome of 1B-female called significantly high number of SNPs (5,181). In case of B+ individuals signaled 11,978 SNPs against B- reference genome. Although we mapped the same reads of B- samples (male and female) against the same reference, still there appeared many SNPs of 0B male and 0B female 3,800 and 2,097 respectively. Furthermore, Comparative analysis of different SNPs combinations from five different samples confirm unique and shared SNPs variation (Figure 2b). We found that 1B-female has shared a higher number of SNPs with all other five individuals genomes and 2,936 SNPs are shared among all individual genomes (Figure 2b). Furthermore, we made use of genic sequences of Astatotilapia latifasciata B+ and B- genomes to determine the mutation frequency. 38 Figure 2. Population genomic polymorphism.Venn diagram using different colors congruent ellipses, each ellipse showing different individual and the overlapping curves among and between samples representing shared SNPs while the points outside the boundary represents unique SNPs. The bar graphs show the frequency of high quality filtered SNPs. Figures (a) and (c) represent the diversity analysis of genomic data aligned against our assembled genome, (b) and (d) represent genomic data aligned against the transcriptome assembly. To analyze the differences in SNP frequencies of genic sequences, which should reflect the presence or absence of selective pressure (Martis et al. 2012). This analysis was restricted to only 4 samples (two from each B+ and B- both male and female) in (Figure 2c). We found that all four 39 samples shared a total number of 2,047 SNPs (16.4%) and 0B male recorded the highest 1,766 (14.1%) unique SNPs as represented in the venn diagram (Figure 2d). Our results (Figure 2c) suggest the males were comparatively under lower SNPs rate than female with no evident effect of B sequences. Whole Genome rearrangement and structural variations Structural variations of the genome involve kilobase to megabase-sized deletions, duplications, insertions, inversions, and complex combinations of rearrangements (Korbel et al. 2007). A total of 625 interchromosomal translocations (breakpoints) were detected in the whole genome of 1B individual (Figure 3). We annotated a few of these regions with identified translocations, most of them were fragmented genes and non coding RNAs (Supplementary Figures 3, 4). Genomic regions related to B chromosome or B blocks was also subjected to reads-orientation based on SVs detection method. Interestingly, we found duplications, insertions and inversions at different sites in B blocks (Figure 4). These results contribute to understand the mechanism of evolutionary process of B chromosome. However, these polymorphic events might also be noticed in B- genomes in different number and pattern than B+ genomes. Several sequence duplications detected in the B block indicated that these sequences from A chromosome accumulated on B chromosome due to frequent duplication events. Sequence analysis and physical mapping of B sequences Blastn result summarized the highest identity and number of hits for Ihhb and 45S rRNA genes in complete set of B chromosome genes of vertebrates confirming their existence in the genome of A. latifaciata (Supplementary Table 5). The bowtie2 alignments of B + and B- Illumina reads from A. latifasciata shows higher coverage from the B+ than the B- samples for both genes (Figures 5a). The higher coverage of these genes in B+ sequence data shows their duplicated copies presence on B chromosomes. The FISH mapping (the probes had 98-99% identity with Ihhb and 45S rRNA genes of different vertebrates) revealed extensive markings of Ihhb over the two B chromosomes in 2B metaphases, representing its repetitive nature. The duplicated copies of Ihhb gene emerged as a major structural component of B chromosomes in FISH results (Figure 6). The available RNA-seq data was analyzed for both genes, the absence of Ihhb gene transcripts indicated 40 that it was not transcribed. We constructed separate probes for 18S, 5.8S and 28S rRNAs to investigate if the complete cluster of 45S rRNA has moved from A complement to B chromosome. Positive sites of 18S rRNA were observed over the pericentromeric and subtelomeric areas of the B chromosome and on pericentromeric regions of some A chromosomes. The 5.8S rRNA probe were marked on pericentomeric regions of B chromosome and in telomeric and subtelomeric regions of autosomes. The 28S rRNA produced signals on telomeric regions and centromeric regions of A chromosomes and that of B chromosomes. The results of mapping all three probes on B chromosome provide evidence that Bs have incorporated the entire 45S rRNA cluster from the A complement set. While transcript analysis indicated that there were some transcript found for 45S rRNS gene. We also conducted a survey to screen SNPs, INDELs at both genomics and transcriptomics level. The Ihhb gene has encountered few B specific SNPs and INDELs. A high number of population SNPs were found in 45S rRNA cluster at genomic level but these population SNPs were located in spacer Dna, the 18S, 28 and 5.8 regions were no SNPs found viewed in Figure 5b,c and Supplementary data Figure 6 & 7. 41 Figure 3. Circos visualization of whole genome rearrangements of B+ genome. The outermost purple ring represents the A. latifaciata de novo assembled genome in Mega bases. The second fragmented ring represents those scaffolds with B+ rearrangement. The red links show genomic regions of B+ reads with Intra-chromosomal rearrangements(translocations) or breakpoints. 42 Figure 4. Different structural variations including duplications, deletions, inversions and transversions in a randomly selected B block (lower line) against the reference genome (upper line) Illustrations of paired end mapping (PEM) patterns for different types of SVs are described as follow: Grey links indicate normally mapped read pairs with proper read orientation and distance. Light blue links represent read pairs with proper read orientation but longer distance, which may indicate a deletion event in the query sequence. Green links represent read pairs with proper orientation but shorter distance, and thus indicate an insertion. Dark blue links show read pairs with abnormal orientation, in which paired ends are mapped to the wrong strand(s). Yellow lines indicate single-end mapped reads (SE reads), in which only one of the paired reads is mapped. For a small insertion (< the insert size), a fraction of paired reads (in green) that span the insertion is mapped too closely in the reference. The insertion is surrounded by a set of single-end mapped reads (in yellow). A translocation is represented by two sets of distantly mapped pairs and one set of inverted mapped pair (in dark blue). An inversion causes the paired reads to change the orientation, and both ends will map to the same strand. 43 Segmental tandem duplication is represented by one set of distantly mapped reads and one set of inverted mapped reads. Figure 5. Sequence analysis of B related genes. (a) Reads coverage of Ihhb and 45S rRNA gene sequences. Notice the scale bar on left to differentiate the coverage rate between 0B and 2B samples. (b) Number of population and B specific SNPs are shown as red and blue respectively in the bars for both genes. (c) Model representation of genes to elaborate their structure and localize the positions of SNPs in regions with comparison of B+ and B- individuals. 44 Figure 6. FISH mapping of Ihhb and 45S rRNA genes. Images of metaphases stained with DAPI, probes and merged are shown for each sequence. Note the signals markings on the B chromosomes. 45 Discussion Our analysis bring several contributions including: 1) generation of a A. latifasciata reference genome; 2) screening the high coverage sequencing data of different individuals for population polymorphism, genetic diversity and to discover B specific structural variations; 3) searching different B-linked genes in the de novo genome of A. latifaciata and show a relative copy number of selected genes between B+ and B- samples by coverage analysis; 4) physically mapping these genes on B chromosomes and validate the sequence analysis. De novo scaffolding of A. latifasciata genome In the present study, we obtained a high deep dataset of B- and B+ individuals for genome assembling of A. latifasciata. The evolutionary aspects of fish genomes, especially cichlids, are being extensively studied (Brawand et al. 2014). Our motivation to sequence the first A. latifasciata genome stemmed from the fact that the species is rapidly becoming important for B chromosome evolution and good model for genetic diversity with other cichlid species. The comparison of our work with other african cichlid fish assemblies presented by the International Cichlid Genome Consortium project (Brawand et al. 2014), in terms of N50, genome size, genes number and % of GC in Table 2, showed corresponding values with other cichlids. It indicated that our reconstructed genomes can be assumed as the possible de novo draft assembly of our model species performed. According to Assemblathon 2 project (Bradnam et al. 2013) the results should not be only relied on the basis of a single assembly and parameter, therefore we also reconstructed our genomes by SOAPdeno2 (Luo et al. 2012) assembler to assure much better results. However, we neglected the genome given as output by SOAPdenovo2 because of its lower accuracy as compared to Velvet. 46 Table 2. Comparison of the current assembly to others african cichlid fish assemblies presented by the International Cichlid Genome Consortium project (Brawand et al. 2014). Locality Species African Riverine Lake Victoria Lake Tanganyika Lake Malawi Latimeria chalumnae Gasteosteus aculeatus A. latifasciata (Our draft assembly) O. niloticus A. burtoniP. nyererei N. bricherdi M. zebra Estimated genome size(Gb) 1.01 0.923 0.993 0.98 0.946 2.85 0.53 0.771 Scaffold N50(kb) 2.8 1.2 2.5 4.4 3.7 0.92 10.8 1.8 % GC content 40.42 40.51 40.6 40.44 40.54 41.15 44.6 40.5 Protein coding gene count 24,559 23,436 20,611 20,119 21,673 19,033 20,787 23, 391 Genomic diversity analysis between B- and B+ individual genomes The Illumina reads coverage obtained for several samples (including males and females and B+ and B- genotypes) were useful to generate a population genomics scenario for the species, and made it possible to analyze the variants specific to B chromosomes and comparative analysis of genomic diversity between B+ and B- genomes. Previous studies have utilized the same strategy to to screen whole genome for distribution of SNPs and INDELs using illumina sequencing data and proposed this method to be useful in understanding genomic diversity (Loh et al. 2008; Gudbjartsson et al. 2015). Our study is in attempt to deal with investigation of the similarities and differences between the two genomes (B+ and B-) of the same species using the same approach as described in aforementioned reports. B chromosome studies might be interesting to analyze different polymorphism such as SNPs, INDELs, duplications and rearrangements. B chromosome is originated by duplications and expansion of DNA copies from A chromosomal set (Kellis et al. 2004; Martis et al. 2012; Valente et al. 2014). We explored the genomics data in order to present a unique brief description about the whole genome polymorphism concerned with B chromosomes, considering that the hypothesis that B+ genome would probably have accumulated many polymorphic DNA sites as compared to B- genome. Our results indicated a higher polymorphism (SNPs and INDELs) in B+ female as compared to B- female at genomic level. However, in other B+ individuals, no association of B chromosome was detected to effect the frequency of SNPs. We applied a similar approach suggested by (Martis et al. (2012) to find out the presence or absence of selective pressure on the genic sequences. Our analysis concluded no effect either on selective 47 pressure with respect to the B chromosome existence in the genome (B is not related to the higher frequency of SNPs in A. latifasciata). This is in contrast to the findings of Martis et al (2012), which identified a higher frequency of SNPs due to the presence of B chromosome signifying the lower selective pressure on B genic sequences. For better and precise understanding the relationship of polymorphism and variation with supernumerary chromosomes, we needed a more specific analysis, which rather focus on individuals DNA sequences or genes. This way we carried out an analysis specific to Ihhb gene and interestingly found B specific SNPs as discussed laterin next section. The whole genome diversity analysis was a novel strategy to study the B related and population based polymorphism in A. latifasciata and has never been performed. This approach was aimed to generalize an overall view about the mutual similarities and differences between the B+ and B- genomes. Exploring duplicates of Ihhb and 45S rRNA genes The findings of our study are helpful to add another evidence in explaining chromosomal origin of the supernumerary elements. As previously described by numerous reports that the B is derived from autosomes (Alfenito and Birchler 1993; Houben et al. 2001; Peng et al. 2005; Martis et al. 2012), it has been proposed that all autosomes have contributed sequences to the B chromosome of A. latifasciata through gene duplication (Valente et al. 2014). Among the genes discovered on B chromosome of vertebrates, there is a morphogenesis-related gene named indian hedgehog b (Ihhb) gene detected in the genome of the cichlid Lithocromis rubripinnis (Yoshida et al. 2011). This study found more than 40 copies of the Ihhb paralogs on B chromosomes and a single copy of Ihhb ortholog on the A chromosomes by qPCR in L. rubripinnis. Our genome analysis of this gene in A. latifasciata followed by FISH mapping provide a novelty of its organization and duplication on B chromosome. In general, B chromosomes are rich in several classes of repetitive DNA, including 5S and 45S ribosomal DNA (rDNA), satellite DNA, histone genes, small nuclear DNA, mobile elements, and organellar sequences (Friebe et al. 1995; Camacho, 2005; Houben et al. 2013). The presence of 18S rRNA gene sequences in the B chromosomes suggests a possible origin from the A chromosomes, which carries rRNA genes. The B chromosomes of A. latifasciata have accumulated 18S rRNA gene sequences on the subtelomeric and pericentromeric regions (Fantinatti et al. 2011). Based on previous description of ribosomal genes there rises a hypothesis that the complete 45S ribosomal DNA (rDNA) cluster would have started amplifying its copies from A complement and moved to B chromosomes during initial stages 48 of its evolution and ultimately lead to formation of proto-B chromosome. From the sequencing data analysis, we found that some regions of the cluster indeed showed a higher coverage in B+ samples as compared to B-, however no distinct differences were found in overall coverage. Our FISH mapping results of 18S rRNA and 28S rRNA confirmed the previous findings made by Poletto et al. (2010) and Fantinatti et al. (2011) . Further, the 5.8S rRNA marks on B chromosome enabled our hypothesis about the movement of whole cluster from A to B chromosome to be correct. A search for the transcripts of both Ihhb and 45S rRNA genes was performed in RNA-seq data of different tissues including brain, gonads and muscle to check if their B copies are expressed or inactive. We find transcripts of 45S rRNA in some tissues, which prove that the cluster is transcribed in A. latifasciata genome. Similarly, this analysis suggests that Ihhb gene is not transcribed and evolved as a pseudogene as a result of its series of sequence duplication. Previous studies also confirmed the occurence of pseudogenization of B-located gene-like fragments with only 15% of these pseudogenes transcribed in rye plant (Banaei-Moghaddam et al. 2013). Genome-wide rearrangements, structural variations and Polymorphisms in B+ genome. One of the key aspects in the B chromosomes studies is to understand the nature of genetic polymorphisms/variations and their evolutionary importance in the origin of B chromosome (Martis et al. 2012; Muñoz-Pajares et al. 2011). Although, most of the genomic studies are focused to identify single-nucleotide polymorphisms (SNPs) or small insertion deletion (INDELs), different types of structural variations (SVs) including complex rearrangements, duplications, inversions, large deletions and insertions can also be detected using a variety of computational approaches (Keane et al. 2014). Indeed, the SVs play a major role in evolutionary processes such as adaptation (Iskow et al. 2012) and speciation (Noor et al. 2001). A study reported by Fan and Meyer (2014) gives a detailed overview of different kinds of genomic variation across four recently diverged cichlid lineages and postulate on the correspondence of SVs for one of the largest adaptive radiations in vertebrates. A remarkable contribution to understand the evolutionary biology of chromosome is achieved by revelation of SVs in many organisms (Bickhart and Liu, 2014; Keane et al. 2014). The identification of many rearrangements has also been applied to explain the mechanism of sex chromosomes evolution (Rogers, 2015). There are several reports that superpose B and sex chromosomes, with evidences of sex chromosomes origin from domesticated B chromosomes (Martins et al. 2011) and also the B presence influencing the sex development https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4165313/#B11 http://journal.frontiersin.org/article/10.3389/fgene.2014.00326/full#B6 http://journal.frontiersin.org/article/10.3389/fgene.2014.00326/full#B6 http://journal.frontiersin.org/article/10.3389/fgene.2014.00326/full#B13 http://journal.frontiersin.org/article/10.3389/fgene.2014.00326/full#B9 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3753381/#def1 49 (Yoshida et al. 2011). Therefore, B and sex chromosomes have become one of the major focuses in chromosome biology studies because they offer an excellent model to investigate mechanisms of chromosome changes during evolution that, in some cases, are incipient. In this sense, we have carried out the whole genome rearrangement analysis to understand the molecular mechanism of B chromosome evolution. In addition to the classical application of NGS to recover whole genomes, the paired-end approach can locate chromosomal breakpoints (Chen et al. 2010). Several repo