Genome assembly of the cichlid fish Astatotilapia latifasciata with focus in
population genomics of B chromosome polymorphism

Maryam Jehangir 

Botucatu, July, 2017


UNIVERSIDADE ESTADUAL PAULISTA

“Julio de Mesquita Filho”

INSTITUTO DE BIOCIÊNCIAS DE BOTUCATU

GENOME ASSEMBLY OF THE CICHLID FISH ASTATOTILAPIA LATIFASCIATA
WITH FOCUS IN POPULATION GENOMICS OF B CHROMOSOME

POLYMORPHISM

MARYAM JEHANGIR

DR. CESAR MARTINS 

Dissertation  presented  to  the  Institute  of
Biosciences,  Campus  of  Botucatu,  UNESP,  to
obtain  the  title  of  Master  in  the  Postgraduate
Program  -University  graduate  in  Biological
Sciences (Genetics).

Botucatu- SP


2017

Palavras-chave: Cromossomos B; Evolução; Montagem do

genoma; Peixe cíclido; Polimorfismo.

Jehangir, Maryam.

   Genome assembly of the cichlid fish astatotilapia

latifasciata with focus in population genomics of B

chromosome polymorphism / Maryam Jehangir. - Botucatu,

2017

   Dissertação (mestrado) - Universidade Estadual Paulista

"Júlio de Mesquita Filho", Instituto de Biociências de

Botucatu

   Orientador: Cesar Martins

   Coorientador: Guilherme Targino Valente

   Capes: 20204000

   1. Peixe - Genética. 2. Ciclídeos. 3. Cromossomos -

Polimorfismo. 4. Genoma. 5. Genômica. 6. Citogenética.

DIVISÃO TÉCNICA DE BIBLIOTECA E DOCUMENTAÇÃO - CÂMPUS DE BOTUCATU - UNESP

BIBLIOTECÁRIA RESPONSÁVEL: ROSEMEIRE APARECIDA VICENTE-CRB 8/5651

FICHA CATALOGRÁFICA ELABORADA PELA SEÇÃO TÉC. AQUIS. TRATAMENTO DA INFORM.


Acknowledgements

In the name of Allah, the Most Gracious and the Most Merciful, for the strengths and His blessings

in completing this thesis.

I would like to thank all the people who contributed in some way to the work described in this thesis.

First and foremost, I thank my academic advisor, Professor Cesar Martins, for accepting me into his

group. During my tenure, he gave me intellectual freedom in my work, supporting my attendance at

various conferences, engaging me in new ideas, and demanding a high quality of work in all my

endeavors. My appreciation to my co-advisor professor Guilherme Targino Valente, for beneficial

ideas and knowledge regarding my research project. 

Additionally, I would like to thank my committee members for their interest in my work. 

Every result described in this thesis was accomplished with the help and support of fellow lab mates

and collaborators especially, Syed Farhan Ahmad, Erica Ramos and Adauto Lima Cardoso. I greatly

benefited from their keen scientific insight, their knack for solving seemingly intractable practical

difficulties, and their ability to put complex ideas into simple terms. 

I am indebted to UNESP administration, for providing much needed assistance with administrative

tasks, reminding us of impending deadlines, and keeping our work running smoothly. 

I  am  grateful  for  the  FAPESP (process  number:  2014/17683-6),  which  provides  me  research

resources to pursue my master project. 

I would like to acknowledge the Department of Genetics for arrangement of different courses and

and the high-quality seminars, benefited greatly from these.

I  must  express  my  gratitude  to  Syed  Farhan  Ahmad,  my  husband,  for  his  continued  support,

encouragement and collaborating in my research. Finally, I would like to acknowledge my mother,

father,  sisters  and brother  for  their  love  and prayers.  Also not  forgetting  my daughter,  Inshirah

Farhan, her cute smile gives me more courage.

Maryam Jehangir

(Master student)


Contents list

1. Introduction.............................................................................................................................….....7

1.1. B chromosome.......................................................................................................................…….7

1.2. Cichlid fish as a model organism..............................................................................................…..9

1.3. Astatotilapia latifasciata as a model species for B chromosomes studies...........................….…..9

1.4. Next generation sequencing (NGS) as a powerful tool to reconstruct the Astatotilapia 

latifasciata genome.............................................................................….....…......................….….....10

1.4.1. Illumina Sequencing....................................…...........................…...................................……12

1.4.2. De novo genome assembly...............................................................................................……..13

1.5. DNA polymorphism..............................................................................................…...............….13

1.5.1. Single nucleotide polymorphism (SNP)..............................................................................…..14

1.5.2. Insertion or deletion polymorphism(INDELs).....................................................................…..14

1.5.3. Genome rearrangement.....................................................................................................….....15

2. Objectives...........................................................................................................................…..…...17

2.1. General objective..............................................................................................................….…...17

2.2. Specific objectives................................................................................................................…....17

3. Material and methods...........................................................................................................….....18

3.1. The model organism, Astatotilapia latifasciata..................................................................….…..18

3.2. Chromosome preparation, DNA sampling and sequencing..........................................………....18

3.3. Illumina Next-Generation Sequencing.....…….................................................................….…..19

3.4. Pre-processing step of NGS data.......................................................................................……...19

3.5. Genome assemblies and quality evaluation.........…............................................................….…19

3.6. Structural annotation of genes............................................................................................……..20

3.7. Genome diversity and genome structural variations analysis.........................................….….....21

3.8. Analysis of B localized sequences....................................................................................….…...22

3.9. Primer design, probes construction and FISH mapping of Ihhb and 45S rRNA genes.….…..…23

4. Results.............................................…............................................................…..........………......24

4.1. Illumina next generation sequencing.....................…......................................................….........24

4.2. De novo based B- and B+ genome assemblies......…....................…..........................….............24

5. Discussion.................................................................…..................................................………....26

6. Chapter 1…………………………………….……………………………………………...…....27

7. Thesis conclusion……………………….…………………………………….…………………..52

8. Recommendations........................................................................................….…...........………..53

9. Supplementary materials…………………………………………………….………..…….......54

10. References.................................................................................................….....................…...…64


Abstract

B chromosomes (Bs) are additional to the standard regular chromosome set (As), and present in all

groups of eukaryotes. A reference genome is key to understand genomics aspects of an organism.

Here, we present the  de novo genome assembly of the cichlid fish  A. latifasciata: a well known

model to study Bs. The assembly of A. latifasciata genome has not been performed so far. The main

focus of this study is to analyze and assemble the A. latifasciata genome with no B (B-) and with B

(B+) chromosomes. The assembled draft  B- and B+ genomes comprised of 774 Mb and 781 Mb

with 1.8 Mb and 2.5Mb of N50 value of scaffolds respectively, and spanning 23,391 number of

genes. High coverage data with Illumina sequencing was obtained for males and females with 0B,

1B and 2B chromosomes to provide information regarding the population polymorphism of these

genomes. We observed a high scale  genomic diversity  in  all  analyzed genomes showing a high

rate/frequency  of  population  polymorphism with  no  evident  effect  of  B  chromosome presence.

However, the B specific single nucleotide polymorphisms were found in the sequences that were

located  on  B  chromosome.  While,  the  whole-genome  rearrangements  (inter  chromosomal

translocations) were detected in B+ genome, and structural variations including insertions, deletions,

inversions and duplications were predicted in a representative genomic region of B chromosome.

These results bring an evidence that existence of Bs in a genome should favour the accumulations of

mutations and structural polymorphisms in the amlified genomic regions present on B chromosmes.

In addition,  we also performed the coverage based sequence study coupled with FISH mapping

which revealed: 1) the existence of high copy number of inactive Indian Hedgehog b (Ihhb) gene on

B chromosome emerging as a pseudogene after series of duplication events ultimately becoming a

major structural component of B; 2) B chromosome have incorporated the entire 45S RNA cluster

(18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer

2, and 28S ribosomal RNA) from the A complement set. The assembly of A. latifasciata genome will

serve  as  a  reference  for  genetic  analysis  and  the  approach  presented  in  this  paper  opens  the

perspective to advance understanding B chromosomes biology.

Keywords: Genome Assembly, Cichlid fish, B chromosome, Genome, Sequencing, polymorphism,

evolution


7

1. Introduction

1.1. B chromosomes

B chromosomes (Bs) were first reported more than a century ago by E. B. Wilson in the

leaf-footed plant bug insect (Metapodius) (Wilson, 1907). Bs are accessories to the standard regular

chromosome set (As) and also named as extra chromosomes that are present in some individuals of

species. B chromosomes behave different from normal set of chromosomes because they do not pair

or undergo recombination with the A chromosomes during meiosis. They are inherited clonally and

do not obey Mendelian law (Jones and Houben, 2003). Two key factors are involve in maintenance

of Bs, its transmission rate (i.e., drive) and effects on fitness. It is believed unlikely that young extra

chromosomes lacking drive or beneficial effects (even being neutral) might invade a population and

become B chromosomes (Camacho et al. 1997; Camacho, 2005).

Today  more  than  15%  of  eukaryotic  species  including  500  animal  species  have  been

reported  to  posses  B chromosomes (Bs)  (Camacho, 2005).  Within  the  same population  not  all

individuals  carry B chromosomes and their  number can differ  between individuals  (e.g.  Vulpes

vulpes, 2n = 34 As + 0–8 Bs;  Rattus rattus,  2n = 42 + 0–5  Bs).  Some species  have different

morphological types of Bs exist within a single species . They can surprisingly exceed the number

of As in some species  (e.g. Zea mays, 2n = 20 As + 0–34 Bs) (Liehr et al. 2008). 

In  most  species  which  carry  Bs,  the  mitotic  transmission of  Bs  during  growth  and

development  is  normal and hence all  cells  carry the same number of Bs within the individual.

However, several exceptional studies stated that some Bs are mitotically unstable and can vary in

numbers in specific tissues and/or organs. For example, in the grasses (Aegilops speltoides and A.

mutica), Bs exist in aerial organs but not in roots (Mendelson and Zohary, 1972; Ohta, 1996). Bs

can also be distributed differently between genders from species to species. Normally Bs exist in

both gender, but sometime the frequency of Bs are higher in one sex. In some species the Bs are

present either in males only (eg.  Moenkhausia sanctaefilomenae) (Portela-Castro et al. 2000) or

females only (eg.  Astyanax scabripinnis paranae) (Maistro et al. 1992; Mizoguchi and Martins-

Santos, 1997).

B chromosomes are composed of repetitive DNAs such as rRNA genes (Houben et al. 2005;

Ruiz-Estévez et al. 2012), tandemly arranged repetitive elements (Potapov et al. 1990), LINEs (long

interspersed nuclear elements), SINEs (short interspersed nuclear elements) (Peppers et al. 1997),

interstitial telomeric sequences (Wurster et al. 1988; Szczerbal et al. 2003). The heterochromatic


8

nature of B chromosomes gives the idea that these elements were mediated genetically inert and

their presence were not needed for survival or reproduction of the individuals (Camacho, 2005). But

recently comparative sequence analysis of the A and B chromosomes of rye plant (Secale cereale)

and cichid fish (Astatotilapia latifaciata) claim the inert nature of supernumerary chromosomes.

These studies concluded that B chromosome has gained a diverse range of repeat sequences and

protein-coding genes (Martis et al. 2012; Valente et al. 2014).

 Different studies have revealed that B chromosomes keep transcriptionally active DNA

sequences  that  could  play  some  role  in  variety  of  functions,  such  as  the  discovery  of  proto-

oncogenes and tumor-suppressor genes in the B chromosomes of canid species(Graphodatsky et al.

2005; Makunin et al. 2014), H3 and H4 histone genes in those of the migratory locust
 
(Teruel et al.

2010), other protein-coding genes in the B chromosomes of a cichlid fish and the Siberian roe deer

(Capreolus  pygargus)  (Trifonov  et  al.  2013;  Yoshida  et  al.  2011).  In  addition,  Valente  et  al.

investigated  the  gene  content  of  B  chromosomes  in  cichlid  fish  (Astatotilapia latifasciata)

accomplice with different functions. 

The  numerical  frequency  of  Bs  carrying  individuals  shows  phenotypic  effects  in  some

species (Jones, 1982; Green, 1990). The small number of the B chromosomes seems to have no

impact on the phenotype while in a high number they can effect the phenotype (Bosemark, 1957b;

Gonzalez-Sanchez. et al. 2004). B chromosomes are linked with both negative and positive effects.

In grasshopper (Myrmeleotettix maculatus), they likely prevent animal development (Harvey and

Hewitt, 1979) and sperm dysfunction (Hewitt et al. 1987). A positive behavior was also found in

some organisms for example as in rice (Oryza sativa), Bs play role on plant height, weight of grain,

and length of its panicle (Cheng et al. 2000). The presence of B chromosomes in maize (Zea mays

L), alters the recombination frequency of A chromosomes (Rhoades, 1968).

The B chromosome originated as a by-product of A chromosome evolution of either the

same or related species (Camacho et al. 2000). The study on Bs origin of Canidae identified that it

carry several chromosomal regions of domestic dog that show co-hybridization to wild canid B

chromosomes (Becker et al. 2015). Recently, the rise of next generation sequencing confessed that

the B chromosomes of fish species  Astatolilapia latifasciata and  Astyanax paranae were evolved

from multiple As (Silva et al. 2014; Valente et al. 2014). Sequencing of rye B chromosome showed

that the B chromosome was originated from A chromosomes 3R and 7R(Martis et al. 2012). Kao et

al. (2015) using the Random Amplified Polymorphic DNA (RAPD) technology, revealed that four

short repetitive sequences were found to locate on both A and B chromosomes. 


9

1.2. Cichlid fish as a model organism

Cichlid fishes belong to the most rich species family of Perciformes and represent one of the

best model for studying different genetic fields, such as evolution and cytogenetics (Kocher, 2004;

Ijiri et al.  2007;  Poletto et al. 2010a, b). There is approximately 3000 species of cichlid fishes

spread through out the different regions, from Central and South America, Africa to Madagascar,

the Middle East, and Southern India. An intense level of adaptive radiation has characterized the

evolution of cichlids, where closely 2,000 species have been diversified in the last 10 million years

(Stiassny, 1991; Kocher, 2004).

The evolutionary tree of this family are divided into four subfamilies: Etroplinae (cichlid

from Madagascar and South Asia - India and Sri Lanka), Ptychochrominae (Madagascar), Cichlinae

(species  of  the  Neotropics)  and  Pseudocrenilabrinae  (cichlids  from  African).  Further  more,

Cichlinae and Pseudocrenilabrinae comprise of more diverse species  (Stiassny, 1991; Sparks and

Smith,  2004;  Genner,  2005).  The  African-Neotropical  cichlids  represent  an  ancestral  group  of

Madagascar and Indian Cichlids which is regarded as the most basal group of divergence (Smith et

al. 1994). 

Although more than 60% of  the species  present  a  karyotype with 2n = 48,  the diploid

number ranges from 2n = 32 to 2n = 60. African cichlids have a modal diploid number of 44

chromosomes, whereas the Neotropical cichlids have 2n = 48 chromosomes (Poletto et al. 2010). In

the  genomic  level,  different  cichlids  genomes  (Oreochromis  niloticus,  Astatotilapia  burtoni,

Neolamprologus Brichardi, Pundamilia nyerereie and Metriaclima zebra) have been sequenced to

interpret  its  role  and  evolutionary  pathways  (Brawand  et  al.  2014). These  genomes  are  freely

available online in different databases on (available http://cichlid.umd.edu/cichlidlabs / kocherlab /

bouillabase.html). 

The presence of B chromosomes in cichlids is of particular interest. In our field of concern,

Astatotilapia  latifasciata is  one  of  the  appropriate  model  for  examination  of  B  chromosomes

function and evolution. 

1.3. Astatotilapia latifasciata as a model species for B chromosomes studies

Cichlid fishes are considered a well known model to study B chromomsomes. Bs have been

determined  in  7  South  Americans  (Feldberg  and  Bertollo,  1984; Martins-Santos  et  al.  1995;


10

Feldberg et al. 2004) and 14 African   (Poletto et al. 2010b; Fantinatti et al. 2011; Yoshida et al.

2011; Kuroiwa et al. 2014) species of cichlids. Among the African species, B chromosomes were

first  described in  Astatotilapia  latifasciata from Lake Nawampasa,  a  satellite  lake of  the  Lake

Kyoga, part of Lake Victoria system (Poletto et al. 2010a). 

B chromosomes in A. latifasciata have been studied with a focus on cytogenetics. Different

cytogenetic techniques such as mapping ribosomal DNA sequences, the repetitive element SATA,

BACs (Bacterial Artificial Chromosomes), genomic DNA fraction containing highly repeating (C0t-

1 DNA) and comparative genomic hybridization (CGH) indicated that no differences specific to sex

were detected and these chromosomes have many repetitive DNA sequences. One or two similar Bs

were observed in both sexes of A. latifasciata (Poletto et al. 2010a; Fantinatti et al. 2011). 

Large scale genomic analysis in the cichlid fish A. latifasciata (Valente et al. 2014) studied

the B chromosome origin of the species. A better concept was drawn to understand the gene content,

genome  pattern  and  biological  role  of  B  chromosome  by  the  application  of  NGS.  The  B

chromosome of  A. latifasciata composed of mostly fragmented genes with a few largely intact.

Among these  high  intact  sequences,  genes involved  with  cell  cycle  (microtubule  organization,

kinetochore structure, recombination and progression through the cell cycle) were detected and may

play a role in driving the transmission of the B chromosome in their hosts (Valente et al. 2014). 

1.4. Next generation sequencing (NGS) as a powerful tool to reconstruct the Astatotilapia 

latifasciata

The automated Sanger method is considered as a ‘first-generation’ technology, and newer

methods are referred to as next-generation sequencing (NGS). Next-generation sequencing (also

known as massively parallel sequencing) technologies can be used as a means for in-depth genomic

analysis (Lin Liu et al. 2012; Wold et al. 2008).

There  are  few  common  methodologies  to  be  performed  during  sequencing  such  as

preparation  of  template,  imaging  and sequencing  and analysis  of  data.  Single  DNA molecules

templates or clonally amplified templates are required for the preparation of NGS process. The two

well  known approaches “sequencing by synthesis” and “sequencing by ligation” are applied to

sequence DNA. The former is  based on numerous DNA polymerase-dependent step while later

describe the use of DNA ligase instead of DNA polymerase. The integration of imaging techniques

and  these  sequencing  approaches  range  from  measuring  bioluminescent  signals  to  four-colour

imaging of single molecular events. Due to generation of huge amount of data by NGS, information


11

technology is becoming a substantial demand in terms of quality control, tracking and data storage

(Metzker, 2005; Fan et al. 2006; Pop et al. 2008).

Many  challenges  have  arisen  for  bioinformatics  to  analyze  the  short-reads  and  gene

variation effectively due to short-read strategy of NGS (Wold and Myers, 2008; Yang et al. 2009).

Advances in bioinformatics field such as assembly and alignments would increase the chance of

accurate and error free interpretation of short reads (Pop and Salzberg, 2008). Normally a run of

typical NGS platform produces tens to hundreds of Gbp short reads. Finally, a raw data of terabytes

is  generated  in  an  average  next  generation  sequencing  experiment,  hence  causing  problems  to

manage and analyze the data. Creation of bioinformatics tools, advancement in management and

development  of  data  storage  can  lead  to  successful  and  useful  application  of  NGS.  Another

important  problem for  bioinformatics  analysis  is  the  variabilities  among  the  different  NGS

platforms.  Because  of  differences  in  read  length  and  data  format,  many  factors  such  as  data

processing, sequence quality scoring, assembly and alignment of softwares need to be diversified

accordingly etc. Various features of NGS is the reason of coexistence of multiple platforms in the

marketplace  with  some  having  prominent  benefits  for  specific  applications  over  others.

Commercially  available  technologies  of  NGS  are  Roche/454,  Illumina/Solexa,  Life/APG  and

Helicos BioSciences, the Polonator instrument and the near-term technology of Pacific Biosciences,

who aim to bring their sequencing device to the market (Branton et al. 2008).

NGS technologies have rapidly changed our minds in respect to various scientific aspects

such as clinical, basic and applied research. In some aspects, the potential of NGS is comparable to

the early days of PCR, with one’s imagination being the primary limitation to its use. In some cases

NGS can exceed of one billion short reads per instrument run (Wold et al. 2008; Wang et al. 2009).

Due to innovation in sequencing technology, progress in genomics has been progressed at a rapid

pace. With more and more organisms being sequenced, a big stream of genetic data is overloading

the world every day. Not only do these studies provide the knowledge for basic research, but also

they afford immediate application benefits (Lin Liu et al. 2012). Next generation sequencing is now

a useful tool in routine applications to be used for wide areas of biology, enabling researchers to

uncover important biological mysteries (Soon et al. 2012). The arrival of different NGS platforms is

opening opportunities to produce large number of sequence reads at low cost with wide range of

applications. These include variant discovery by resequencing targeted regions of interest or whole

genomes,  de  novo assemblies  of  bacterial and  lower  eukaryotic  genomes,  cataloguing  the

transcriptomes of cells,  tissues and organisms (RNA–seq),  genome-wide profiling of epigenetic

marks and chromatin structure using other seq-based methods (ChIP–seq, methyl–seq and DNase–


12

seq), and species classification and/or gene discovery by metagenomics studies (Petrosino et al.

2009). 

Although  large-scale  genomic  analyzes  have  already  been  explored  in  understanding  B

chromosome  biology,  such  approach  can  be  applied  to  a  wide  range  of  research  questions  as

sequencing genomes,  comparative biology studies, public health,  epidemiology, physiology, and

gene expression. High-scale analysis provides detailed information on the composition of genomes

and  representing important tools to advance over structural and functional analysis of cells, tissues

and organisms.

1.4.1. Illumina sequencing

The Illumina  platform employs cyclic  reversible  termination  (CRT)  chemistry  for  DNA

sequencing. The process relies on growing nascent DNA strands complementary to template DNA

strands  with  modified  nucleotides,  while  tracking  the  emitted  signal  of  each  newly added

nucleotide.  Each nucleotide  has  a  3’ removable  block and  is  attached to  one  of  four different

fluorophores depending on its type. Sequencing occurs in repetitive cycles, each consisting of three

steps:  (a)  extension of  a  nascent  strand by adding a  modified nucleotide;  (b)  excitation of  the

fluorophores using two different lasers, one that excites the A and C labels and one that excites the

G and T labels; (c) cleavage of the fluorophores and removal of the 3’ block in preparation for the

next synthesis cycle. Using this approach, each cycle interrogates a new position along the template

strands. Illumina provides a large number of intermediate input files that can be used to perform a

fine-scale analysis of distortion factors. The most useful among them are imaging files, intensity

files, and sequencing files (Naiara and Michael, 2012).

Illumina  Sequencing  increasing  throughput  exponentially  over  first-generation  Sanger

sequencing. There are certain drawbacks of second generation instruments despite of several useful

benefits.  Amplification  of  template  DNA is  required  before  sequencing  which  result  in  biased

coverage. Another major disadvantage of these technologies is the production of short length reads

making assembly and related analysis problematic. Average length of Illumina reads is 100 bp is not

appropriate for an ideal assembly as a theoretical model assumes that reduction of read lengths from

1,000 bp to 100 bp may result a sixfold or more decline in continuity (Kingsford et al. 2010). The

new chemistry for Illumina equipment promises reads of 300bp, but the quality of longer reads is

lower.


13

1.4.2. De novo genome assembly

In bioinformatics, de novo genome assembly refers as the procedure of aligning short DNA

fragments(reads) together into longer segments (contigs or scaffolds) and further joined into linkage

groups or placed on chromosomes (Ellegren, 2014; Simpson and Pop, 2015). Adequate overlap

between the sequence reads in the genome at every position is essential for an accurate assembly.

Longer reads may cause more overlap hence decreasing the depth of unnecessary raw reads. Paired

end sequencing technology is more efficient in assembly because of its accurate placement of reads

(Robert et al. 2014; Ellergren, 2014). A well assembled genome is needed to detect vital functional

and structural genomic elements and essential for better analysis of genetic variations (Mahul et al.

2015). 

Large scale genomic data were previously obtained for the cichlid fish A. latifascita using

Illumina NGS platform (Valente et al. 2014). Although it was generated a high coverage (over 30x

of coverage) of 0B (genomes with no B chromosome) and 2B (genome with 2B chromosomes)

genomes,  the  data  were  not  enough  to  reconstruct  the  genome  of  the species.  Although  the

generation  of  high  amount  of  nucleotide  sequence  data  becomes available  with  the  NGS

technologies,  the  assembly  of  genomes  continues  to  be  one  of  the central  problems  of

bioinformatics. This owes, in large part, to the technologies that provide ‘reads’ of short sequences

of DNA, from which the genome is inferred. Larger sets of data, and changes in the properties of

reads such as length and errors, bring with them new challenges for assembly (Henson et al. 2014).

1.5. DNA polymorphism

DNA polymorphisms are broad type of genetic variations among individuals of same species

or a different species. In another words, if a few alleles of a polymorphic site has the most widely

recognized variation among them happening with under 99% recurrence in the population(Cavalli-

Sforza,  1971;  Schork et  al.  2000). Substantial  variation can be found at  different  points  in  the

genome while performing a comparative analysis of genomic DNA sequences of two individuals on

equivalent chromosome. Genes can comprise of many polymorphisms which have impact on many

phenotypes traits such as hair color and height. Some of these polymorphisms may cause disease

susceptibility, affect drug responses and considered important in other medical aspects. Although

several polymorphisms occur outside of genes and do not show any effect (Bentley, 2000; Barnes

and Grey, 2003). 


14

DNA polymorphism emerges as a result of mutation in the DNA level. The different types of

polymorphism  are  specified  according  to  the  type  of  mutation  that  they  are  generated  in  the

sequence.  Polymorphism  is  broadly  divided  into  different  types,  include  single  nucleotide

polymorphism (SNPs), insertions and deletions (INDELs), and other larger rearrangements such as

(inversion, translocation and transversion) (Botstein et al. 1980; Schork et al. 2000).

1.5.1. Single Nucleotide Polymorphism (SNP)

Single  nucleotide  polymorphism(SNP)  is  simplest  type  of  polymorphism  results  from

alteration of a single nucleotide base adenine (A), thymine (T), cytosine (C) or guanine (G) by other

nucleotide and bring mutation in genome. SNPs are the most common form of genetic variation,

accounting for 90 percent of all human genetic variations. For instance, the nucleotide diversity (the

SNP rate) between humans is about 0.1%, while this rate can be as high as 3-5% in some fish and

other sea organisms (Riva et al. 2002). Few studies have predicted an average range of 0.3-1,000 kb

of genome can detect SNPs, however most of the data was based on specific genes and therefore

fail  to  address  the  question  of  SNP frequency.  There  is  no  clear  cut  evidence  for  the  correct

estimation of SNPs occurrence in the rest of genomes (Halushka et al. 1999; Cargill et al 1999).

Most SNPs have only 2 alleles (biallelic SNP) and they may occur in both coding and non coding

regions of the genome (99% not in genes). The SNPs located in coding regions are little harmful

and inherit changes to DNA while the remaining non-coding SNPs become silent, harmless and

does not effect (Strittmatter et al. 1996).

1.5.2. Insertion or deletion Polymorphism(INDELs)

Insertions or deletions (INDELs) is another common form of genetic polymorphism, as the

name  indicates  it  result  when  a  section  of  DNA is  either  deleted  or  inserted.  This  type  of

polymorphism mostly  occurs  due  to  the  existence  of  nucleotide  pattern  or  variable  number  of

repeated base (Cooper et al. 1999). The size of repeat base pattern vary from few (two, three or

four) nucleotides repeats known as 'micro-satellites' up to several hundreds base pairs called 'mini-

satellites or 'variable tandem repeats' (VNTRs). Many alleles are referred as 'highly polymorphic'

due to frequent number of repeat polymorphisms having several different repeat sizes. This is very

much helpful to study population genetics because the probability of having the same number of


15

repeats in two individuals from, suppose, different population (healthy vs diseased, ethnic groups)

may be much lower (Shriver et al. 1997). 

1.5.3. Genome rearrangement

A large amount of DNA segments are displaced as a result of a unique category of mutation

known as genome rearrangement. This happens during the process of chromosomal breakdown at

certain sites also called breakpoints, followed by reassembling of chromosomal pieces in a incorrect

order. The genome rearrangements can have deleterious effects or even lethal if they occur in the

coding sequence, making the gene inactive. On other hand, a rearrangement whose breakpoint fall

in non-functional region is likely thought to be neutral in effects (Mathieu, 2001).

The  related  studies  conducted  on  genome  rearrangement  until  now  has  concluded  two

categories:  (i)  intra-chromosomal  rearrangements,  have  subtypes:  transpositions,

reversals/inversions, and block-interchanges/generalized transpositions, and (ii) inter-chromosomal

rearrangements, have subtypes: translocations, fusions and fissions. Transpositions shift a segment

from one location to another on a chromosome or, equivalently, exchange two adjacent and non-

overlapping segments on the chromosome. Inversions or Reversals reverse a chromosomal segment

and also exchange its strands. Exchanging two non-overlapping segments which are not necessary

adjacent on chromosome is called generalized transposition or Block-interchanges. The exchange of

a telomeric segment of one chromosome with that of other chromosome is termed as translocations.

Fusions combine two separate smaller chromosomes into a single bigger one and fissions involve

breakage of a chromosome into two smaller ones (Alekseyev, 2008; Alekseyev and Pevzner, 2008).

Advanced and low-error analytical strategies should be designed to use DNA polymorphism

for latest genetic applications. Approximately all DNA polymorphisms can be captured now with

the help of modern development of methods for detection of structural variants (Svs) and SNPs

including NGS. Still, there are many challenges to meet for highly accurate identification of DNA

polymorphisms since short-read sequencing strategy is difficult for analysis to infer chromosomal

context (Kidd et al. 2008; Graubert et al. 2007; Emerson et al. 2008).

DNA polymorphism analysis were also studied in B chromosomes of rye (Secale cereale)

(Martis  et  al.  2012)  and  Grasshopper (Eyprepocnemis  plorans)  (Muñoz-Pajares  et  al. 2011).

Numerical polymorphisms have been described in a large number of maize landraces (Mcclintock et

al. 1981;  Chiavarino et al. 1995;  Naranjo et al. 1995;  Rosato et al. 1998), with differences in the

number of B's being one of the main factors contributing to intraspecific genome-size variation.

http://www.genetics.org/content/177/2/895#ref-63
http://www.genetics.org/content/177/2/895#ref-48
http://www.genetics.org/content/177/2/895#ref-15
http://www.genetics.org/content/177/2/895#ref-46
http://www.genetics.org/content/177/2/895#ref-46


16

Silva  &  Yonenaga-Yassuda  (1998)  reported  a  conspicuous  heterogeneity  of  size,  morphology,

constitutive heterochromatin patterns and localization of telomeric sequences of B chromosomes for

the  rodent  Nectomys,  which  allowed  them to  suggest  differences  in  the  composition  of  these

chromosomes. A considerable amount of chromosome variability involving the presence of Bs and

polymorphisms  of  autosome  pairs  was  detected  microteiid  lizard  (Nothobachia  ablephara)

(Pellegrino et al. 1999).  Reports on the nucleotide level polymorphism of Bs are still extremely

scarce , therefore our work is a novel contribution to understand the association and significance of

polymorphism in B chromosome. 


17

2. Objectives

2.1. General Objective

 Reconstruction of  A. latifasciata genomes (B+ and B-) to analyze B chromosome

polymorphism and genomic diversity

2.2. Specific Objectives

 Generation of high coverage data using Illumina sequencing

 Assembly of B+ and B- genomes of A. latifasciata 

 Prediction of genes (structural annotation)

 Analysis of genome diversity

 Study of B chromosomes polymorphism

 Search of B specific genes in A. latifasciata  genome


18

3. Material and Methods

3.1. The model organism, Astatotilapia latifasciata

The cichlid fish  A. latifasciata (Pseudocrenilabrinae) is native from the lakes Kyoga and

Nawampasa (East Africa) and classical and molecular cytogenetic studies detected the presence of

one or  two B chromosomes in  some individuals  of  populations  farmed in Brazil  for  aquarium

hobbyist market (Poletto et al. 2010a; Fantinatti et al. 2011).

The presence of B chromosomes in A. latifasciata associated to the release of whole genome

sequences of several other African cichlids (The International Cichlid Genome Consortium, 2006)

(Brawand et al. 2014) makes this species an excellent model for genomic studies. Furthermore, A.

latifasciata can be easily maintained in captivity allowing directed crosses and sampling of tissues

for chromosome, DNA, RNA and protein analysis. Currently, populations of A. latifasciata without

B chromosomes (0B) and with 1 (1B) or 2 B chromosomes (2B) have being maintained at the fish

facility of the Integrative Genomics Laboratory at Institute of Biosciences/UNESP, Botucatu, SP,

Brazil. The use of these animals for biological samples were held according to ethical conventions

utilized  by  Brazilian  College  for  animal  experiments  accepted  by  ethic  committee  of  the

Biosciences Institute, UNESP-Sao Paulo State University.

3.2. Chromosome preparation, DNA sampling and sequencing

The  liver  tissues  of  A.  latifasciata samples  were  collected  and  karyotyped  by  classical

chromosome  preparation  protocols  employed  for  fish  to  check  the  presence  of  0,  1  or  2B

chromosomes. Additionally, these samples from males and females with 0B, 1B or 2 B chromosome

genomes were checked via polymerase chain reaction (PCR) and real time PCR (Fantinatti et al.

2016). High quality genomic DNA (gDNA) from liver tissues samples of A. latifasciata males and

females with 0B, 1B and 2B chromosomes selected for next generation sequencing (NGS).

Seven individuals including males and females with 0B, 1B and 2B chromosomes were

subject to Illumina sequencing using the service sequencing facility of University of Maryland,

USA ( "http://www.ibbr.umd.edu/facilities/sequencing) . Additionally NGS were also performed in

a MiSeq Illumina equipment available to our use in the Institute of Biosciences/UNESP, Botucatu.

All the generated data are already available at Sacibase (www.sacibase.ibb.unesp.br).

http://www.sacibase.ibb.unesp.br/


19

3.3. Illumina Next-Generation Sequencing

For the Illumina libraries, gDNA from the individuals were sheared to an average size of

350-550 bp using an S220 focused ultrasonicator (Covaris Inc., Woburn, MA). Separate libraries

were constructed for the gDNA samples using the TruSeq DNA sample preparation kit ver.2 rev.C

(Illumina Inc.,  San Diego, CA). Paired end (100-190 bp) reads sequencing of each library were

performed in separate lanes on an Illumina HiSeq 1000 sequencer. 

3.4. Pre-processing step of NGS data

At the point when the sequences data are prepared,  its need to checks whether a set  of

sequence  reads  in  a  .fastq  file  exhibit  any  bizarre  qualities  (which  might  indicate  either  low

sequence quality, or interesting biological features in sample). The reads quality was checked by the

software  FastQC  (website:  http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)  and  the

ambiguous data need to filtered out, just leaving only a small fraction of usable, unique read pairs

for  assembly.  The  filtration  of  the  data  was  done  according  to  the  requirement  of  genome

assemblies by FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) tool kit (-q 30, -p 90 parameters).

The illumina adapters  checking using blastn search (e-value ≤10e -5 and 90% of identity were the

cutoff  parameters), which  reads  with  some  hit  were  eliminated  through  customized  python

programming script. The not paired reads (singletons) were discarded  by Pairfq software for  de

novo assembling. Libraries related to one fragment paired-end were merged and make two datasets

for de novo assembling. While the sequences coverage( defines as a the average number of reads

per site) of all samples were calculated by coverage formula given as:coverage = (read count x read

length ) / total genome size, Read Counts= Total reads of genome, Read length= Expected length of

reads, Total genome= Total genome size / close related genome size (Metriaclima. Zebra)

3.5. Genome assemblies and quality evaluation

For choosing appropriate assembly software, it is important to consider both the amount of

sequencing data and which computational resources are available (Schatz et al. 2010a). For that


20

reason we used Assamblathon2 (Bradnam et al. 2013) recommended strategy for fish genome, using

the Velvet (v 1.2.08) (Zerbino and Birney, 2008) and SOAPdenovo2 softwares (Luo et al. 2012). 

To estimates the best k-mer length for genome  de novo  assembly, we ran KMERGENIE

(Chikhi and Medvedev, 2014) on pre-processed reads with putative k values of 31-91. The optimal

values of k were predicted to be 71-mer and using this k value for assembling.  The clean reads

comes from pre-proceesing step  consisting of all B- and B+ samples having male and female were

inserted  in  to  Velvet  (Zerbino  and  Birney,  2008) and  SOAPdenovo2  assemblers   to  construct

scaffold  level  de  novo  assemblies.  The  two  different  assembler  were  used  to  compare  both

assembler generated assemblies and at the end choose the best possible assembly. Velvet assembler

with  following  parameter  settings  for  command  line  run;  "-ins_length  500  ,  exp-cov  auto,

-unused_reads yes, read_trkg yes" were used for both assemblies. Other assembler SOAPdenovo

was  also  set  for  genome  assemblies  by  using  parameters:  “insert  length  350,  500  and  550,

asm_flags=3” were setted up to high coverage cleaned reads. Both assemblers generated a scaffold

level assemblies (B+ and B-). To close the gaps within scaffolds of B+ and B- assemblies to be

more accurate assemblies, GAPFILLER software (Boetzer and Pirovano, 2012) was operated. The

program was executed by ('–m’  = 80, ‘-t’= 10, '-g' = 5) settings. The final gap closed scaffolds

assemblies were constructed and used for further analysis. QUAST software (Gurevich et al. 2013)

was applied for evaluation of the generated scaffolds before and after gaps filled to computing

several  metric  values  (length,  number,  length  variation,  N50,  gap  length).  This  software  was

executed  in  two seperate  runs,  one  for  independent  evaluation  and other  with  to  compare  our

genomes with closest reference, M. zebra genome. The program was executed for eukaryotic genes

comparison. We therefore, evaluated assembly using two common quality measures: the contig N50

length, the assembly size. Finally, the B- scaffold level genome assembly was selected for future

analysis and used as a reference.

3.6. Structural annotation of genes 

The GeneMark-ES (Lukashin and Borodovsky, 1998) was used for eukaryotic ab-initio gene

prediction, and identification of a set of 453 core genes that are supposed to be highly conserved in

all eukaryotes, using CEGMA pipeline (version 2.4) .

For identification of protein coding genes in assembled genome, We used three approaches:

homology-based,  de novo  and transcript  sequences-based by using a  pipeline  MAKER v2.31.8


21

(Cantarel  et  al.  2008);  we  used  repetitive  elements  of  custom libraries  of  fish  and  Metazoan

elements,  Danio  rerio CDS  and  proteome  files  from (http://www.ensembl.org/Danio_rerio),  A.

latifasciata assembled transcriptomes accessed from (http://sacibase.ibb.unesp.br/), gene prediction

based on Lamprey training set and Blastn to NCBI database (National Center for Biotechnology

Information – http://www.ncbi.nlm.nih.gov/).

3.7. Genome diversity and genome structural variations analysis

The  genomic  diversity(Identification  of  unique  and  common  SNPs  &  INDELs  among

different individuals) was determined using the following procedures. The Illumina reads generated

for the different  A. latifasciata  genomes (males and females with 0B, 1B and 2B chromosomes)

were first filtered for alignments by FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) tool kit (-q

20, -p 80 parameters). The trim reads were aligned against our draft B- assembly as a reference

using Bowtie2 (Langmead and Steven, 2012) --very-sensitive” option. Nucleotide polymorphism

was identified using SAMtools tool to search for genome variations among the samples. Nucleotide

polymorphism  was  identified  using  SAMtools  (http://samtools.sourceforge.net/)  to  search  for

genome variations among the samples. The output files (VCF format) were subjected to VCFtools

(vcf-stats and vcf-compare) (http://vcftools.sourceforge.net/) for statistical analysis to discover the

frequency of single nucleotide polymorphism (SNPs), insertion/deletion (INDELs) and to compare

these  factors  among  indiviuals  to  find  out  the  shared,  unique  or  B  related  population

polymorphism . Filtration was carried out to eliminate the lower quality (Q  ≤ 20 and DP ≥ 100)

SNPs and  INDELs using  vcffilter  (https://github.com/vcflib/).  Parameters  for  each  step  of  this

analysis were established according to the standard requirements. Similar approach was followed to

detect polymorphisms in genic sequences (CDS, exons and introns) of  A. latifasciata B+ and B-

genomes aligned against de novo transcriptome assembly (Marques et al. unpublished).

The structural  variations  (translocations)  were  detected  in  1B sequencing data  by Delly

(Rausch  et  al.  2012).  Delly  integrates  short  insert  paired-ends  and  split-read  alignments  to

accurately delineate genomic rearrangements throughout the genome. This pipeline was applied to

both B+ and B- reads  in  Bam format  (generated using bowtie  tool)  considered as  sample and

control respectively. Denovo genome assembly of A. latifasciata was utilized as reference to locate

the variations on the scaffolds regions. The detected translocations (breakpoints) were visualized by

ClicO (Cheong et al. 2015), an online web-service based on Circos (Martin et al. 2009).  To uncover


22

nature of these transclocation, We extracted randemly a few of these regions (515-520 MB) and

subjected to NCBI Blast (https://blast.ncbi.nlm.nih.gov) for annotation.

More  specifically,  structural  variations  such  as  deletions,  insertions,  transversions,

inversions and duplications in genomic regions related to B chromosome (B block: B+ reads having

higher coverage than B- reads) were analyzed by inGAP-sv tool (Qi et al. 2011). InGAP-sv detects

the structural variations (SVs) on the basis of paired end mapped reads pattern and coverage of

depth strategies. We applied this pipeline to B+ sam file generated using BWA (Li and Durbin,

2009) and  A. latifasciata genome was used as a reference.  After the SAM file was loaded into

inGAP-sv, user-defined threshold of mapping quality (default value: 20) was applied to filter non-

uniquely mapped reads. Illustrations of paired-end mapping (PEM) patterns for different types of

identified SVs were generated according to Qi et al (2011). 

3.8. Analysis of B localized sequences

The  nucleotide  sequences  of  several  previously  identified  B  chromosomes  genes  of

vertebrates (Makunin et al. 2014) supplementary (Table 1) were retrieved from the NCBI database

(https://www.ncbi.nlm.nih.gov/). Consensus sequences constructions were obtained using Geneious

v. 4.8.5 software (Drummond et al. 2009) for genes with more than one sequence available. All final

sequences  were used as  queries  against  the  A. latifasciata genome in a  standard blastn search.

Number of hits, E values and percent identity were considered to further proceed with B related

analysis. This analysis confirmed the existence of Ihhb (Indian Hedgehog B gene) and 45S rRNA

(18S  ribosomal  RNA,  internal  transcribed  spacer  1,  5.8S  ribosomal  RNA,  internal  transcribed

spacer 2, and 28S ribosomal RNA) gene sequences in A. latifasciata genome while the remaining

genes  were eliminated  from the project  because of  partial  or  complete  absence.  The generated

Illumina high coverage reads of all B- (0B male and 0B female) and B+ (1B Males, 1B Female and

2B Male) samples were aligned against both reference genes 45S rRNA and  Ihhb of the cichlids

Oreochromis aureus and  Lithocromis rubripinnis respectively, using paired-end mode of Bowtie2

(http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)  with  the  –very  sensitive  option.  The

output aligned files were converted to binary format and indexed using samtools. Each file was

normalized using RPKM package of deeptools (Ramírez et al. 2014) to fix bias of initial coverage.

These  files  were  then  visualized  by  integrated  genome  browser  IGB

(http://bioviz.org/igb/index.html) to compare coverage of both genes in B- and B+ samples. SNPs at

different sites of nucleotide reads were detected. According to Marques et al. (unpublished) a total

https://blast.ncbi.nlm.nih.gov/


23

of  36  libraries  were  generated  from  18  samples  of  transcriptomes  of  A.  latifasciata.  These

transcriptomes  were  divided  into  triplicates  of  gonads,  brain  and  muscle  of  male  and  female

individuals.  The uploaded transcriptomics  data  (http://sacibase.ibb.unesp.br/)  and genomics  data

(aligned  files)  were  visualized  and  screened.  We  did  BLAST alignment  of  the  genes  against

transcriptome assembly to locate them on specific scaffolds. The B specific SNPs were searched for

simultaneous viewing all tracks of available data against  transcriptome assembly  as a reference. 

3.9. Primer design, probes construction and FISH mapping of Ihhb and 45S rRNA genes

Primers  listed  in  the  supplementary  (Table  2) were  constructed  using  PrimerQuest

(http://www.idtdna.com/primerquest/home/index)  tool,  checked  for  specificity  by  Primer-Blast

(http://www.ncbi.nlm.nih.gov/tools/primer-blast/)  and  evaluated  by  Primer  Stat

(http://www.bioinformatics.org/sms2/pcr_primer_stats.html).  Genomic  DNA  was  subjected  to

polymerase chains reaction (PCR) to obtain DNA segments to be used as probes for fluorescent in

situ hybridization (FISH) mapping. DNA fragments obtained by PCR were sequenced (Sanger et al.

1977) using an ABI Prism 3100 automatic DNA sequencer (Applied Biosystems, Foster City, CA,

USA)  with  a  Dynamic  Terminator  Cycle  Sequencing  Kit  (Applied  Biosystems)  as  per  the

manufacturers’ instructions. Nucleic acid sequences were subjected to BLAST (Altschul et al. 1990)

searches at the NCBI database to check for similarities to deposited sequences of corresponding

genes. Probes were labelled with digoxigenin-11-dUTP (Roche Applied Science) and the signal was

detected with anti-digoxigenin-rhodamine (Roche Applied Science). FISH was performed using the

protocol described by Pinkel et al. (1986) with modifications (Cabral-de-Mello et al. 2012). The

slides were denatured in 70% formamide/2xSSC, pH 7, for 36 s, and dehydrated in an ice-cold

ethanol series (70%, 85%, and 100%). The images were captured with an Olympus DP71 digital

camera coupled to a BX61 Olympus microscope and were optimized for brightness and contrast

using Adobe Photoshop CS2.

http://www.bioinformatics.org/sms2/pcr_primer_stats.html
http://www.ncbi.nlm.nih.gov/tools/primer-blast/


24

4. Results

4.1. Illumina next generation sequencing

A total  of  seven  B-  and B+ samples  were  sequenced  using  Illumina  HiSeq and Miseq

platforms respectively. Each sample has different coverage of the corresponding genomes over the

entire length (848,776,495bp) of the M. zebra genome. Each sample has been given a name such as:

“M1-5” having different male samples and “F1-2” for female samples; 0-2 for  B-, 1B and 2B,

respectivelly; in (Table 1).

Table.1. Illumina data obtained for A latifasciata including mapped over M. zebra genome.

Samples Read length Coverage Coverage
after

filtration

 Total reads Remained reads 

after filtration

Reference

M1-0B 101 47.7x 38x 401,017,570 323226972(80%) Valente et al. 2014

F1-0B 191 75.5x 42.5x 337,349,994 190214688(56%) Present work

M2-1B 35-250 2x 1.6x 716,030,6 5911265 (82%) Present work

M3-1B 101 16.8x 13.4x 143,441,264 114064786(79%) Present work

M4-1B 101 43.1x 34.8x 366,602,572 296161988(80%) Present work

F2-1B 191 70.2x 40.1x 313,818,884 179286280(57%) Present work

M5-2B 101 43.6x 30x 306,823,512 254993955 (83%) Valente et al. 2014

The raw data was then filtered and illlumina adapters were trimmed and a total of filtrated

513,441,660  (77.05%)  of  B-  reads  with  80.5x  coverage  and  850,418,274  (74.7%)  with  120x

coverage of B+ reads were remained after filtration (Table 1).

4.2. De novo based B- and B+ genome assemblies

The B- and B+ draft assemblies were 771,316,069 bp and 781,068,509 bp of genome size

respectively, which indicates that 84% of the B- and 80% of B+ genomes (This values correspond

to  the  total  used  reads  by  assembler  during  assembly)  are  captured  in  scaffolds.  The de  novo

assembly yielded 197,652 number of scaffolds for the B+ genome, with N50 of  25.54  kb, and

218,259 number of scaffolds for B- genome with N50 of  18.640  kb  being the longest one with

23.86 kb and  23.36 kb respectively in (Table 2 and Supplementary Table 3). Our scaffold level

genome  assemblies  corresponding  to  the  closed  reference  M.  zebra genome  consisting  of


25

848,776,495 bp cumulative length and GC contents of 40.5% given in (Supplementary Figure 1a,b).

About (75.329%) and (74.556%)  of  M. zebra sequences  had a significant hit in the B+ and B-

genome assemblies respectively. 

The scaffold assembling gave a total of 3,998 and 3789 gaps per 100-kb of the assembled

genomes(B- and B+) filled to 2,076 and 1884 respectively. The comparative analysis by QUAST

(Gurevich et al. 2013) indicated that there was an improvement in assemblies after the gaps filling.

Total  length  of  B-  and  B+  genomes  increased  up  to  774,140,944  bp  and  782,153,984  bp

respectively. Additional statistics is given in Supplementary Table 3 and Table 2. 

Table 2. QUAST evaluations statistic of B+ genome.

Statistic B+ assembly B+ assembly after gaps filled

# of contigs 322328 -

Total Length of contigs 767906325 -

N50 of contigs 7997 -

GC (%) contigs 40.49 -

Largest contig 101668 -

# of scaffolds (>=500bp) 78064 78078

# scaffold(>= 0 bp) 197652 197652

Total length of scaffolds(>= 0 bp) 781068509 782153984

Total length of scaffolds(>=500bp) 756378434 757469399

Largest scaffold 238637 238740

GC (%) 40.48 40.51

Scaffolds N50 25546 25546

Scaffolds NG50 21574 21623

Scaffolds N75 11565 11570

Scaffolds NG75 7368 7429

Scaffolds L50 8086 8095

Scaffolds LG50 10075 10059

Scaffolds L75 19087 19109

Scaffolds LG75 26612 26517

# unaligned scaffold 22147 20980

Unaligned length 30818517 4751825

Genome fraction (%) 75.329 76.255

Duplication ratio 1.134 1.125

# N's per 100 kbp 3789.43 1884.55

Largest alignment  173293 180244

Remaining results are given in the manuscript attached as a chapter 1 in this thesis. The attached

manuscript will be reviewed and submitted to publication in scientific journal.


26

5. Discussion

The chapter 1 (manuscript) encompasses the discussion as a separate section.


27

6. Chapter 1

De novo genome assembly of the cichlid fish Astatotilapia latifasciata with focus in B

chromosome.

M. Jehangira, S. F. Ahmada, A. L. Cardosoa, G. T. Valenteb , C. Martinsa 
aDepartment of Morphology, Institute of Bioscience, UNESP -São Paulo State University, Botucatu,

SP, Brazil 
bBioprocess and Biotechnology Department, Agronomical Science Faculty, UNESP Sao Paulo State

University, Botucatu SP, Brazil. Email:maryam.bioinfo.unesp@gmail.com

Abstract

B chromosomes (Bs) are additional to the standard regular chromosome set (As), and present in all

groups of eukaryotes. The origin and role of Bs have been the objective of genome-wide studies. A

reference genome is key to understand genomics aspects of an organism. Here, we present the de

novo genome assembly of the cichlid fish  A. latifasciata: a well known model to study Bs. The

assembled draft genome comprised of 774 Mb with 1.8 Mb of N50 value of scaffolds and spanning

23,391 protein coding genes. High coverage data with Illumina sequencing was obtained for males

and females with 0B, 1B and 2B chromosomes to provide information regarding the population

polymorphism of  these  genomes.  We observed  a  high  scale  genomic  diversity  in  all  analyzed

genomes showing a high rate/frequency of population polymorphism with no evident effect of B

chromosome presence. However, the B specific single nucleotide polymorphisms were found in the

sequences  specially  located  on  B  chromosome.  Only   whole-genome  rearrangements  (inter

chromosomal  translocations)  were  detected  in  B+  genome,  and  structural  variations  including

insertions, deletions, inversions and duplications were predicted in a representative genomic region

of B chromosome. These results bring an evidence that existence of Bs  in a genome should favour

the  accumulations  of  mutations  and structural  polymorphisms in  the  amlified  genomic  regions

present on B chromosomes. In addition,  we also performed the coverage based sequence study

coupled with FISH mapping which revealed: 1) the existence of high copy number of inactive

Indian  Hedgehog  b  (Ihhb)  gene  on  B  chromosome  emerging  as  pseudogene  after  series  of

duplication events ultimately becoming a major structural component of B; 2) B chromosome have

incorporated the entire 45S RNA cluster (18S ribosomal RNA, internal transcribed spacer 1, 5.8S

ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA) from the A complement

mailto:maryam.bioinfo.unesp@gmail.com


28

set. The assembly of  A. latifasciata genome will serve as a reference for genetic analysis and the

approach presented in this paper opens the perspective to advance understanding B chromosomes

biology.

Keywords: Genome Assembly, Cichlid fish, B chromosome, Genome, Sequencing, polymorphism,

evolution

Introduction

B chromosomes (Bs are accessories to the standard regular chromosome set (As) that are

present in some individuals of more than 15% of eukaryotic species. These extra chromosomes do

not recombine with members of the basic A chromosomes complement and do not pursue the rules

of the Mendelian segregation law (Jones and Houben, 2003).  They are mostly heterochromatic,

comprising of a large amount of repetitive DNA and their presence are not needed for survival or

reproduction of the individuals (Camacho, 2005). But recently different studies have revealed that B

chromosomes keep transcriptionally active DNA sequences that could play some role in variety of

functions (Graphodatsky et al. 2005; Makunin et al. 2014; Teruel et al. 2010; Trifonov et al. 2013;

Yoshida et al. 2011). Several studies have also focused to investigate the origin of B chromosome in

various species of plants and animals, and found it be a derivative of standard A chromosomes.

(Alfenito and Birchler, 1993; Martis  et al. 2012; Silva et al. 2014; Valente et al. 2014). 

Chromosome biology, has gained remarkable advancement due to the recent development of

genomics. The applications of genomics, such as next generation sequencing (NGS) to the study of

B chromosome is on the rise. Recently, NGS technology has contributed to many studies involving

B chromosome analysis in plants, animals, and fungi (Camacho, 2005). The use of this technology

joined  with  increasing  information  of  genome organization  together  open  a  new chapter  in  B

chromosome studies (Makunin et al. 2014). The development of NGS has made a key impact in

assembling genomes of diverse organisms and studying their genomics features such as genetic

polymorphism and variations (Imelfort et al.  2009; Mahul et al.  2015). Approximately all DNA

polymorphisms can now be captured with the help of modern development of methods for detection

of structural variants (SVs) and SNPs using NGS data (Kidd et al. 2008). 

Despite an increase in many evidences about their origin, the gene content and pattern of

evolution of Bs, these genomic elements still remain an open question for the majority of species.

One  of  them  is  a  cichlid  fish  (Astatotilapia  latifasciata).  Among  the  African  species,  B


29

chromosomes were first described in  Astatotilapia latifasciata from Lake Nawampasa, a satellite

lake of the Lake Kyoga system (Poletto et al. 2010). Previous analysis applied to the B chromosome

in  A.  latifasciata were  based  on  cytogenetics  and  comparative  genomics  studies.  Cytogenetics

confirmed the both sexes of  A. latifasciata can have either one or two similar B chromosomes

enriched with many repetitive DNA sequences with no sex-specific differences (Poletto et al. 2010;

Fantinatti et al. 2011). While the cytogenomics revealed that the B chromosome of A. latifasciata

has intact genes and degenerated sequences derived from most of the A chromosome set; some of

the intact genes are potentially active for transcription. (Valente et al. 2014). However, a complete

overview regarding certain genomic features of the B chromosomes was not accomplished due to

lack of an assembled genome.

Here,  we present  a  de novo genome assembly of  A. latifasciata,  and its  annotation and

genetic level polymorphisms. This genomic assembly not only contributes to uncover the important

aspects of B chromosome but will also provide an extra resource for studying genomics features

such as variations, genes identification and sequence mapping in closely related species. Further, in

this  study,  we  have  explored  the  genomic  diversity  of  different  individuals  and  discussed  the

association of structural variations with B chromosome. The detection of extensively distributed

interchromosomal  rearrangements  allow  us  to  uncover  unknown  evolutionary  breakpoints  that

occurred in the A. latifasciata genome with B chromosome. We have also performed an analysis of

B chromosome sequences originated from A complement and their physical mapping. 


30

Methods

Chromosome preparation, DNA sampling and next generation sequencing data

The kidney tissues of  A. latifasciata samples were collected and karyotyped by classical

chromosome  preparation  protocols  employed  for  fish  to  check  the  presence  of  0,  1  or  2B

chromosomes. Additionally, these samples from males and females with 0B, 1B or 2 B chromosome

genomes were checked via polymerase chain reaction (PCR) and real time PCR (Fantinatti et al.

2016). High quality genomic DNA (gDNA) from liver tissues samples of A. latifasciata males and

females with 0B, 1B and 2B chromosomes selected for next generation sequencing (NGS). Seven

individuals  including males  and females  with  0B,  1B and 2B chromosomes  were  subjected  to

Illumina  sequencing  using  the  service  sequencing  facility  of  University  of  Maryland,  USA

(http://www.ibbr.umd.edu/facilities/sequencing).  For  the  Illumina  libraries,  gDNA  from  the

individuals were sheared to an average size of 350-550 bp using an S220 focused ultrasonicator

(Covaris Inc., Woburn, MA). Separate libraries were constructed for the gDNA samples using the

TruSeq DNA sample preparation kit ver.2 rev.C (Illumina Inc., San Diego, CA). Paired-end (100-

190 bp) sequencing of each library were performed in separate lanes on an Illumina HiSeq 1000

plataform.  Additionally,  one  sample  1B  male  was  sequenced  by  Miseq  paltform  equipment

available in the Institute of Biosciences/UNESP, Botucatu. 

The  reads  quality  was  checked  by  the  software  FastQC  (website:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)  and  filtration  of  the  data  was  done

according  to  the  requirement  of  genome  assemblies  by  FASTX

(http://hannonlab.cshl.edu/fastx_toolkit/) tool kit (-q 30, -p 90 parameters). The illumina adapters

checking using blastn search (e-value ≤10e -5 and 90% of identity were the cutoff parameters),

which reads with some hit were eliminated through customized python programming script. The not

paired reads (singletons)  were discarded  by Pairfq software for  de novo  assembling.  Libraries

related to one fragment paired-end were merged and make two datasets for de novo assembling.

 The coverage of all samples were calculated by coverage formula given as: coverage =

(read count x read length)/total genome size, being read count = total reads of genome, read length

= expected length of reads and the total genome = total genome size/close related genome size

(Metriaclima zebra).


31

Genome assemblies and quality evaluation

For choosing appropriate assembly software, it is important to consider both the amount of

sequencing data and which computational resources are available (Schatz et al. 2010a). For that

reason we used Assamblathon2 (Bradnam et al. 2013) recommended strategy for fish genome, using

the Velvet (v 1.2.08) (Zerbino and Birney, 2008) and SOAPdenovo2 softwares (Luo et al. 2012). 

To estimates the best k-mer length for genome  de novo  assembly, we ran KMERGENIE

(Chikhi and Medvedev, 2014) on pre-processed reads with putative k values of 31-91. The optimal

values of k were predicted to be 71-mer and using this k value for assembling.  The clean reads

comes from pre-proceesing step  consisting of all B- and B+ samples having male and female were

inserted  in  to  Velvet  (Zerbino  and  Birney,  2008) and  SOAPdenovo2  assemblers   to  construct

scaffold  level  de  novo  assemblies.  The  two  different  assembler  were  used  to  compare  both

assembler generated assemblies and at the end choose the best possible assembly. Velvet assembler

with  following  parameter  settings  for  command  line  run;  "-ins_length  500  ,  exp-cov  auto,

-unused_reads yes, read_trkg yes" were used for both assemblies. Other assembler SOAPdenovo

was  also  set  for  genome  assemblies  by  using  parameters:  “insert  length  350,  500  and  550,

asm_flags=3” were setted up to high coverage cleaned reads. Both assemblers generated a scaffold

level assemblies (B+ and B-). To close the gaps within scaffolds of B+ and B- assemblies to be

more accurate assemblies, GAPFILLER software (Boetzer and Pirovano, 2012) was operated. The

program was executed by ('–m’  = 80, ‘-t’= 10, '-g' = 5) settings. The final gap closed scaffolds

assemblies were constructed and used for further analysis. QUAST software (Gurevich et al. 2013)

was applied for evaluation of the generated scaffolds before and after gaps filled to computing

several  metric  values  (length,  number,  length  variation,  N50,  gap  length).  This  software  was

executed in two separated runs,  one for independent  evaluation and other with to compare our

genomes with closest reference, M. zebra genome. The program was executed for eukaryotic genes

comparison. We therefore, evaluated assembly using two common quality measures: the contig N50

length, the assembly size. Finally, the B- scaffold level genome assembly was selected for future

analysis and used as a reference.

Structural annotation of genes 


32

The GeneMark-ES (Lukashin and Borodovsky, 1998) was used for eukaryotic ab-initio gene

prediction, and identification of a set of 453 core genes that are supposed to be highly conserved in

all eukaryotes, using CEGMA pipeline (version 2.4) (Parra et al. 2007).

For identification of protein coding genes in assembled genome, We used three approaches:

homology-based,  de novo  and transcript  sequences-based by using a  pipeline  MAKER v2.31.8

(Cantarel  et  al.  2008);  we  used  repetitive  elements  of  custom libraries  of  fish  and  Metazoan

elements,  Danio  rerio CDS  and  proteome  files  from (http://www.ensembl.org/Danio_rerio),  A.

latifasciata assembled transcriptomes accessed from (http://sacibase.ibb.unesp.br/), gene prediction

based on Lamprey training set and Blastn to NCBI database (National Center for Biotechnology

Information – http://www.ncbi.nlm.nih.gov/).

Genome diversity and genome structural variations analysis

The Illumina reads generated for the different  A. latifasciata  genomes (males and females

with  0B,  1B  and  2B  chromosomes)  were  first  filtered  for  alignments  by  FASTX

(http://hannonlab.cshl.edu/fastx_toolkit/)  tool kit  (-q 20,  -p 80 parameters). The trim reads were

aligned against our draft B- assembly as a reference  using Bowtie2 (Langmead and Steven, 2012)

--very-sensitive” option. Nucleotide polymorphism was identified using  SAMtools tool to search

for  genome  variations  among  the  samples.  Nucleotide  polymorphism  was  identified  using

SAMtools (http://samtools.sourceforge.net/) to search for genome variations among the samples.

The  output  files  (VCF  format)  were  subjected  to  VCFtools  (vcf-stats  and  vcf-compare)

(http://vcftools.sourceforge.net/)  for  statistical  analysis  to  discover  the  frequency  of  single

nucleotide polymorphism (SNPs), insertion/deletion (INDELs) and to compare these factors among

indiviuals  to  find  out  the  shared,  unique or  B related  population polymorphism. Filtration  was

carried out to eliminate the lower quality (Q ≤ 20 and DP ≥ 100) SNPs and INDELs using vcffilter

(https://github.com/vcflib/). Parameters for each step of this analysis were established according to

the  standard  requirements.  Similar  approach  was  followed  to  detect  polymorphisms  in  genic

sequences (CDS, exons and introns) of A. latifasciata B+ and B- genomes aligned against de novo

transcriptome assembly (Marques et al. unpublished).

The structural  variations  (translocations)  were  detected  in  1B sequencing data  by Delly

(Rausch  et  al.  2012).  Delly  integrates  short  insert  paired-ends  and  split-read  alignments  to

accurately delineate genomic rearrangements throughout the genome. This pipeline was applied to

both B+ and B- reads  in  Bam format  (generated using bowtie  tool)  considered as  sample and


33

control respectively. Denovo genome assembly of A. latifasciata was utilized as reference to locate

the variations on the scaffolds regions. The detected translocations (breakpoints) were visualized by

ClicO (Cheong et al. 2015), an online web-service based on Circos (Martin et al. 2009).  To uncover

nature of these transclocation, We extracted randemly a few of these regions (515-520 MB) and

subjected to NCBI Blast (https://blast.ncbi.nlm.nih.gov) for annotation.

More  specifically,  structural  variations  such  as  deletions,  insertions,  transversions,

inversions and duplications in genomic regions related to B chromosome (B block: B+ reads having

higher coverage than B- reads) were analyzed by inGAP-sv tool (Qi et al. 2011). InGAP-sv detects

the structural variations (SVs) on the basis of paired end mapped reads pattern and coverage of

depth strategies. We applied this pipeline to B+ sam file generated using BWA (Li and Durbin,

2009) and  A. latifasciata genome was used as a reference.  After the SAM file was loaded into

inGAP-sv, user-defined threshold of mapping quality (default value: 20) was applied to filter non-

uniquely mapped reads. Illustrations of paired-end mapping (PEM) patterns for different types of

identified SVs were generated according to Qi et al (2011). 

3.8. Analysis of B localized sequences

The  nucleotide  sequences  of  several  previously  identified  B  chromosomes  genes  of

vertebrates (Makunin et al. 2014) supplementary (Table 1) were retrieved from the NCBI database

(https://www.ncbi.nlm.nih.gov/). Consensus sequences constructions were obtained using Geneious

v. 4.8.5 software (Drummond et al. 2009) for genes with more than one sequence available. All final

sequences  were used as  queries  against  the  A. latifasciata genome in a  standard blastn search.

Number of hits, E values and percent identity were considered to further proceed with B related

analysis. This analysis confirmed the existence of Ihhb (Indian Hedgehog B gene) and 45S rRNA

(18S  ribosomal  RNA,  internal  transcribed  spacer  1,  5.8S  ribosomal  RNA,  internal  transcribed

spacer 2, and 28S ribosomal RNA) gene sequences in A. latifasciata genome while the remaining

genes  were eliminated  from the project  because of  partial  or  complete  absence.  The generated

Illumina high coverage reads of all B- (0B male and 0B female) and B+ (1B Males, 1B Female and

2B Male) samples were aligned against both reference genes 45S rRNA and  Ihhb of the cichlids

Oreochromis aureus and  Lithocromis rubripinnis respectively, using paired-end mode of Bowtie2

(http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)  with  the  –very  sensitive  option.  The

output aligned files were converted to binary format and indexed using samtools. Each file was

normalized using RPKM package of deeptools (Ramírez et al. 2014) to fix bias of initial coverage.

https://blast.ncbi.nlm.nih.gov/


34

These  files  were  then  visualized  by  integrated  genome  browser  IGB

(http://bioviz.org/igb/index.html) to compare coverage of both genes in B- and B+ samples. SNPs at

different sites of nucleotide reads were detected. According to Marques et al. (unpublished) a total

of  36  libraries  were  generated  from  18  samples  of  transcriptomes  of  A.  latifasciata.  These

transcriptomes  were  divided  into  triplicates  of  gonads,  brain  and  muscle  of  male  and  female

individuals.  The uploaded transcriptomics  data  (http://sacibase.ibb.unesp.br/)  and genomics  data

(aligned  files)  were  visualized  and  screened.  We  did  BLAST alignment  of  the  genes  against

transcriptome assembly to locate them on specific scaffolds. The B specific SNPs were searched for

simultaneous viewing all tracks of available data against  transcriptome assembly as a reference. 

Primer design, probes construction and FISH mapping of Ihhb and 45S rRNA genes

Primers  listed  in  the  supplementary  (Table  2) were  constructed  using  PrimerQuest

(http://www.idtdna.com/primerquest/home/index)  tool,  checked  for  specificity  by  Primer-Blast

(http://www.ncbi.nlm.nih.gov/tools/primer-blast/)  and  evaluated  by  Primer  Stat

(http://www.bioinformatics.org/sms2/pcr_primer_stats.html).  Genomic  DNA  was  subjected  to

polymerase chains reaction (PCR) to obtain DNA segments to be used as probes for fluorescent in

situ hybridization (FISH) mapping. DNA fragments obtained by PCR were sequenced (Sanger et al.

1977) using an ABI Prism 3100 automatic DNA sequencer (Applied Biosystems, Foster City, CA,

USA)  with  a  Dynamic  Terminator  Cycle  Sequencing  Kit  (Applied  Biosystems)  as  per  the

manufacturers’ instructions. Nucleic acid sequences were subjected to BLAST (Altschul et al. 1990)

searches at the NCBI database to check for similarities to deposited sequences of corresponding

genes. Probes were labelled with digoxigenin-11-dUTP (Roche Applied Science) and the signal was

detected with anti-digoxigenin-rhodamine (Roche Applied Science). FISH was performed using the

protocol described by Pinkel et al. (1986) with modifications (Cabral-de-Mello et al. 2012). The

slides were denatured in 70% formamide/2xSSC, pH 7, for 36 s, and dehydrated in an ice-cold

ethanol series (70%, 85%, and 100%). The images were captured with an Olympus DP71 digital

camera coupled to a BX61 Olympus microscope and were optimized for brightness and contrast

using Adobe Photoshop CS2.

http://www.bioinformatics.org/sms2/pcr_primer_stats.html
http://www.ncbi.nlm.nih.gov/tools/primer-blast/


35

Results

De novo assembly and gene predictions

A total of seven B- and B+ samples were sequenced using Illumina HiSeq and Miseq platforms

respectively.  Each sample has  different  coverage of  the corresponding genomes over the entire

length (848,776,495bp) of the M. zebra genome. Each sample has been given a name such as; “M1-

5” and “F1-2” for males and females, respectively; 0-2 for  B-, 1B and 2B, respectivelly; S1-3 for

different individuals of male samples (Table 1).

Table.1. Illumina data obtained for A latifasciata including mapped over M. zebra genome.

Samples Read length Coverage Coverage
after

filtration

 Total reads Remained reads 

after filtration

Reference

M1-0B 101 47.7x 38x 401,017,570 323226972(80%) Valente et al. 2014

F1-0B 191 75.5x 42.5x 337,349,994 190214688(56%) Present work

M2-1B 35-250 2x 1.6x 716,030,6 5911265 (82%) Present work

M3-1B 101 16.8x 13.4x 143,441,264 114064786(79%) Present work

M4-1B 101 43.1x 34.8x 366,602,572 296161988(80%) Present work

F2-1B 191 70.2x 40.1x 313,818,884 179286280(57%) Present work

M5-2B 101 43.6x 30x 306,823,512 254993955 (83%) Valente et al. 2014

The raw data was then filtered and illlumina adapters were trimmed and a total of filtrated

513,441,660  (77.05%)  of  B-  reads  with  80.5x  coverage  and  850,418,274  (74.7%)  with  120x

coverage of B+ reads were remained after filtration (Table 1). The B- assembling to accomplish

primary assembly of A. latifasciata genome having 771,316,069 bp, which indicates that 84% (This

value corresponds to the total used reads by assembler during assembly) of genome was captured in

scaffolds. The de novo assembly yielded 218,259 scaffolds with N50 of 18.640 kb (Supplementary

Table 3) being the longest one with 23.36 kb.

A.  latifasciata  draft  genome  has  correspond to  the  M.  zebra genome  consisting  of

848,776,495 bp cumulative  length,  GC contents  of  40.5% (Supplementary  Figure  1a,b).  About

75.329% of M. zebra sequences had a significant hit against our performed assembly. The scaffold

assembling gave a total of 3,998 gaps per 100-kb of the assembled genome filled to 2,076. The

comparative analysis indicated that there was an improvement in assembly after the gaps filling.


36

Total length of genome slightly increased up to 774,140,944 bp. Additional statistics is given in

Supplementary Table 3. 

We used the CEGMA pipeline (Parra et al. 2007) to evaluate the completeness of our assembly. The

total number of complete core eukaryotic genes (CEGs) are 183 with the percentage of 73%, and

number of partially complete CEGs are with the percentage of 94% supplementary data (Table 4).

The ab-initio gene model predicted 71,917 genes of eukaryotes, of them 23,391 are protein coding

genes.  The  23,391  annotated  genes  in  the  A.  latifasciata  genome  contain  204,656  exons  and

177,219 with longest gene size is 66982 bp. On average, there are 7 exons and 6 introns per gene.

Additional  information  (size,  number  and length)  of  major  structure  genome componets  are  in

(Figure  1;  Supplementary  Data  Table  5).  All  the  structural  annotation  has  been  uploaded  to

Sacibase.

Figure 1. Structural annotations of A. latifasciata genome. The bar graph shows the number of major genomic structural

component.


37

Genomic diversity analysis

In the genomes of six individuals (B- and B+), it was identified total 2,395,658 raw SNPs

and 888,060  INDELs with respect  to  reference B- assembly.  All  the individuals  carried  higher

number of SNPs than INDELs as given in (Supplementary Figure 2). These SNPs were forwarded

to filtration and detected total of 17,875 high quality SNPs in all genomes (Figure 2a). Out of these

total SNPs, the genome of 1B-female called significantly high number of SNPs (5,181). In case of

B+ individuals signaled 11,978 SNPs against B- reference genome. Although we mapped the same

reads of B- samples (male and female) against the same reference, still there appeared many SNPs

of 0B male and 0B female 3,800 and 2,097 respectively. Furthermore, Comparative analysis of

different SNPs combinations from five different samples confirm unique and shared SNPs variation

(Figure  2b).  We found that 1B-female has shared a higher number of SNPs with all  other five

individuals  genomes  and  2,936  SNPs  are  shared  among  all  individual  genomes  (Figure  2b).

Furthermore, we made use of genic sequences of Astatotilapia latifasciata B+ and B- genomes to

determine the mutation frequency. 


38

Figure  2.  Population  genomic  polymorphism.Venn  diagram using  different  colors  congruent  ellipses,  each  ellipse

showing different individual and the overlapping curves among and between samples representing shared SNPs while

the  points outside the boundary represents unique SNPs. The bar graphs show the frequency of high quality filtered

SNPs. Figures (a) and (c) represent the diversity analysis of genomic data aligned against our assembled genome, (b)

and (d) represent genomic data aligned against the transcriptome assembly. 

To analyze the differences in SNP frequencies of genic sequences, which should reflect the

presence or absence of selective pressure (Martis et al. 2012). This analysis was restricted to only 4

samples (two from each B+ and B- both male and female) in (Figure 2c). We found that all four


39

samples shared a total number of 2,047 SNPs (16.4%) and 0B male recorded the highest 1,766

(14.1%) unique  SNPs as  represented in  the venn diagram (Figure 2d).  Our results  (Figure  2c)

suggest the males were comparatively under lower SNPs rate than female with no evident effect of

B sequences.

Whole Genome rearrangement and structural variations

Structural  variations  of  the  genome  involve  kilobase  to  megabase-sized  deletions,

duplications,  insertions,  inversions,  and complex combinations  of  rearrangements (Korbel  et  al.

2007). A total of 625 interchromosomal translocations (breakpoints) were detected in the whole

genome  of  1B  individual  (Figure  3).  We  annotated  a  few  of  these  regions  with  identified

translocations, most of them were fragmented genes and non coding RNAs (Supplementary Figures

3, 4). Genomic regions related to B chromosome or B blocks was also subjected to reads-orientation

based on SVs detection method. Interestingly, we found duplications, insertions and inversions at

different  sites in B blocks  (Figure 4).  These results  contribute to understand the mechanism of

evolutionary  process  of  B  chromosome.  However,  these  polymorphic  events

might also be noticed in B- genomes in different number and pattern than B+ genomes. Several

sequence duplications detected in the B block indicated that these sequences from A chromosome

accumulated on B chromosome due to frequent duplication events. 

Sequence analysis and physical mapping of B sequences 

Blastn result summarized the highest identity and number of hits for  Ihhb and 45S rRNA

genes in  complete set  of B chromosome genes of vertebrates confirming their  existence in the

genome of A. latifaciata (Supplementary Table 5). The bowtie2 alignments of B + and B- Illumina

reads from A. latifasciata shows higher coverage from the B+ than the B- samples for both genes

(Figures 5a). The higher coverage of these genes in B+ sequence data shows their duplicated copies

presence on B chromosomes. The FISH mapping (the probes had 98-99% identity with  Ihhb and

45S rRNA genes  of different  vertebrates)  revealed extensive markings  of  Ihhb over  the two B

chromosomes in 2B metaphases, representing its repetitive nature.  The duplicated copies of Ihhb

gene emerged as a major structural component of B chromosomes in FISH results (Figure 6).  The

available RNA-seq data was analyzed for both genes, the absence of Ihhb gene transcripts indicated


40

that  it  was  not  transcribed.  We constructed  separate  probes  for  18S,  5.8S  and  28S  rRNAs  to

investigate if the complete cluster of 45S rRNA has moved from A complement to B chromosome.

Positive sites of 18S rRNA were observed over the pericentromeric and subtelomeric areas of the B

chromosome and on pericentromeric regions of some A chromosomes. The 5.8S rRNA probe were

marked on pericentomeric regions of B chromosome and in telomeric and subtelomeric regions of

autosomes. The 28S rRNA produced signals on telomeric regions and centromeric regions of A

chromosomes  and  that  of  B  chromosomes.  The  results  of  mapping  all  three  probes  on  B

chromosome provide evidence that Bs have incorporated the entire 45S rRNA cluster from the A

complement set. While transcript analysis indicated that there were some transcript found for 45S

rRNS gene.

We also conducted a survey to screen SNPs, INDELs at both genomics and transcriptomics

level.  The  Ihhb gene  has  encountered  few  B  specific  SNPs  and  INDELs. A high  number  of

population SNPs were found in 45S rRNA cluster at genomic level but these population SNPs were

located in spacer Dna, the 18S, 28 and 5.8 regions were no SNPs found viewed in Figure 5b,c and

Supplementary data Figure 6 & 7. 


41

Figure 3. Circos visualization of whole genome rearrangements of B+ genome. The outermost purple ring represents the

A. latifaciata de novo assembled genome in Mega bases. The second fragmented ring represents those scaffolds with B+

rearrangement.  The  red  links  show  genomic  regions  of  B+  reads  with  Intra-chromosomal

rearrangements(translocations) or breakpoints.


42

Figure 4. Different structural variations including duplications, deletions, inversions and transversions in a randomly

selected B block (lower line) against the reference genome (upper line) Illustrations of paired end mapping (PEM)

patterns for different types of SVs are described as follow: Grey links indicate normally mapped read pairs with proper

read orientation and distance. Light blue links represent read pairs with proper read orientation but longer distance,

which may indicate a deletion event in the query sequence. Green links represent read pairs with proper orientation but

shorter distance, and thus indicate an insertion. Dark blue links show read pairs with abnormal orientation, in which

paired ends are mapped to the wrong strand(s). Yellow lines indicate single-end mapped reads (SE reads), in which only

one of the paired reads is mapped. For a small insertion (< the insert size), a fraction of paired reads (in green) that span

the insertion is mapped too closely in the reference. The insertion is surrounded by a set of single-end mapped reads (in

yellow). A translocation is represented by two sets of distantly mapped pairs and one set of inverted mapped pair (in

dark blue). An inversion causes the paired reads to change the orientation, and both ends will map to the same strand.


43

Segmental tandem duplication is represented by one set of distantly mapped reads and one set of inverted mapped

reads.

Figure 5. Sequence analysis of B related genes. (a) Reads coverage of Ihhb and 45S rRNA gene sequences. Notice the

scale bar on left to differentiate the coverage rate between 0B and 2B samples. (b) Number of population and B specific

SNPs are shown as red and blue respectively in the bars for both genes. (c) Model representation of genes to elaborate

their structure and localize the positions of SNPs in regions with comparison of B+ and B- individuals.  


44

Figure 6. FISH mapping of Ihhb and  45S rRNA genes. Images of  metaphases stained with DAPI, probes and merged

are shown for each sequence. Note the signals markings on the B chromosomes.


45

Discussion

Our analysis  bring  several  contributions  including:  1)  generation  of  a  A.  latifasciata reference

genome; 2) screening the high coverage sequencing data of different individuals for population

polymorphism,  genetic  diversity  and  to  discover  B  specific  structural  variations;  3)  searching

different B-linked genes in the de novo genome of A. latifaciata and show a relative copy number of

selected genes between B+ and B- samples by coverage analysis; 4)  physically mapping these

genes on B chromosomes and validate the sequence analysis.

De novo scaffolding of A. latifasciata genome

In the present study, we obtained a high deep dataset of B- and B+ individuals for genome

assembling of A. latifasciata.  The evolutionary aspects of fish genomes, especially cichlids, are

being extensively studied (Brawand et al. 2014). Our motivation to sequence the first A. latifasciata

genome stemmed from the fact that the species is rapidly becoming important for B chromosome

evolution and good model for genetic diversity with other cichlid species. The comparison of our

work with other  african  cichlid  fish assemblies  presented  by the  International  Cichlid  Genome

Consortium project (Brawand et al. 2014),  in terms of N50, genome size, genes number and % of

GC in Table 2, showed corresponding values with other cichlids. It indicated that our reconstructed

genomes can be assumed as the possible de novo draft assembly of our model species performed.

According to Assemblathon 2 project (Bradnam et al. 2013) the results should not be only relied on

the basis  of  a  single assembly and parameter,  therefore we also reconstructed our  genomes by

SOAPdeno2 (Luo et al. 2012) assembler to assure much better results. However, we neglected the

genome given as output by SOAPdenovo2 because of its lower accuracy as compared to Velvet. 


46

Table 2. Comparison of the current assembly to others african cichlid fish assemblies presented by the International

Cichlid Genome Consortium project (Brawand et al. 2014).

Locality Species African Riverine Lake 
Victoria

Lake 
Tanganyika

Lake 
Malawi

Latimeria 
chalumnae

Gasteosteus
aculeatus

A. latifasciata

(Our  draft

assembly)

O. niloticus   A. burtoniP. nyererei N. bricherdi M. zebra

Estimated 
genome size(Gb)

1.01             0.923 0.993 0.98 0.946 2.85 0.53 0.771

Scaffold 
N50(kb)

2.8                1.2 2.5 4.4 3.7 0.92 10.8 1.8

% GC content 40.42          40.51 40.6 40.44 40.54 41.15 44.6 40.5

Protein coding 
gene count

24,559      23,436 20,611 20,119 21,673 19,033 20,787 23, 391

Genomic diversity analysis between B- and B+ individual genomes

The Illumina reads coverage obtained for several samples (including males and females and B+ and

B- genotypes) were useful to generate a population genomics scenario for the species, and made it

possible to analyze the variants specific to B chromosomes and comparative analysis of genomic

diversity between B+ and B- genomes. Previous studies have utilized the same strategy to to screen

whole genome for distribution of SNPs and INDELs using illumina sequencing data and proposed

this method to be useful in understanding genomic diversity (Loh et al. 2008; Gudbjartsson et al.

2015). Our study is in attempt to deal with investigation of the similarities and differences between

the  two  genomes  (B+  and  B-)  of  the  same  species  using  the  same  approach  as  described  in

aforementioned  reports.  B  chromosome  studies  might  be  interesting  to  analyze  different

polymorphism  such  as  SNPs,  INDELs,  duplications  and  rearrangements.  B  chromosome  is

originated by duplications and expansion of DNA copies from A chromosomal set (Kellis et al.

2004; Martis et al. 2012; Valente et al. 2014). We explored the genomics data in order to present a

unique brief description about the whole genome polymorphism concerned with B chromosomes,

considering  that  the  hypothesis  that  B+  genome  would  probably  have  accumulated  many

polymorphic DNA sites as compared to B- genome. Our results indicated a higher polymorphism

(SNPs and INDELs) in B+ female as compared to B- female at genomic level. However, in other

B+ individuals, no association of B chromosome was detected to effect the frequency of SNPs. We

applied a similar approach suggested by (Martis et al. (2012) to find out the presence or absence of

selective pressure on the genic sequences.  Our analysis  concluded no effect either  on selective


47

pressure with respect to the B chromosome existence in the genome (B is not related to the higher

frequency of SNPs in  A. latifasciata). This is in contrast to the findings of Martis et al (2012),

which identified a higher frequency of SNPs due to the presence of B chromosome signifying the

lower selective pressure on B genic sequences. For better and precise understanding the relationship

of  polymorphism and  variation  with  supernumerary  chromosomes,  we needed  a  more  specific

analysis, which rather focus on individuals DNA sequences or genes. This way we carried out an

analysis specific to  Ihhb gene and interestingly found B specific SNPs as discussed laterin next

section.  The whole genome diversity  analysis  was a  novel  strategy to  study the  B related and

population based polymorphism in A. latifasciata and has never been performed. This approach was

aimed to generalize an overall view about the mutual similarities and differences between the B+

and B- genomes.

Exploring duplicates of Ihhb and 45S rRNA genes

The findings of our study are helpful to add another evidence in explaining chromosomal

origin of the supernumerary elements. As previously described by numerous reports that the B is

derived from autosomes (Alfenito and Birchler 1993; Houben et al. 2001; Peng et al. 2005; Martis

et  al.  2012),  it  has  been  proposed  that  all  autosomes  have  contributed  sequences  to  the  B

chromosome of  A. latifasciata through gene duplication (Valente et al. 2014). Among the genes

discovered on B chromosome of vertebrates, there is a morphogenesis-related gene named indian

hedgehog b (Ihhb) gene detected in the genome of the cichlid Lithocromis rubripinnis (Yoshida et

al. 2011). This study found more than 40 copies of the  Ihhb paralogs on B chromosomes and a

single  copy of  Ihhb ortholog on the  A chromosomes by qPCR in  L.  rubripinnis.  Our  genome

analysis  of  this  gene  in  A.  latifasciata  followed  by  FISH  mapping  provide  a  novelty  of  its

organization and duplication on B chromosome. In general,  B chromosomes are rich in several

classes of repetitive DNA, including 5S and 45S ribosomal DNA (rDNA), satellite DNA, histone

genes,  small  nuclear  DNA,  mobile  elements,  and  organellar  sequences  (Friebe  et  al.  1995;

Camacho,  2005;  Houben  et  al.  2013).  The  presence  of  18S  rRNA gene  sequences  in  the  B

chromosomes suggests a possible origin from the A chromosomes, which carries rRNA genes. The

B chromosomes of A. latifasciata have accumulated 18S rRNA gene sequences on the subtelomeric

and pericentromeric regions (Fantinatti et al. 2011). Based on previous description of ribosomal

genes there rises a hypothesis that the complete 45S ribosomal DNA (rDNA) cluster would have

started amplifying its copies from A complement and moved to B chromosomes during initial stages


48

of its evolution and ultimately lead to formation of proto-B chromosome. From the sequencing data

analysis, we found that some regions of the cluster indeed showed a higher coverage in B+ samples

as compared to B-,  however no distinct  differences were found in overall  coverage.  Our FISH

mapping results of 18S rRNA and 28S rRNA confirmed  the previous findings made by Poletto et

al. (2010) and Fantinatti et al. (2011) . Further, the 5.8S rRNA marks on B chromosome enabled our

hypothesis about the movement of whole cluster from A to B chromosome to be correct. 

A search for the transcripts of both Ihhb and 45S rRNA genes was performed in RNA-seq

data of different tissues including brain, gonads and muscle to check if their B copies are expressed

or  inactive.  We find  transcripts  of  45S rRNA in  some tissues,  which  prove  that  the  cluster  is

transcribed  in  A.  latifasciata  genome.  Similarly,  this  analysis  suggests  that  Ihhb gene  is  not

transcribed and evolved as a pseudogene as a result of its series of sequence duplication. Previous

studies also confirmed the occurence of pseudogenization of  B-located gene-like fragments with

only 15% of these pseudogenes transcribed in rye plant (Banaei-Moghaddam et al. 2013). 

Genome-wide rearrangements, structural variations and Polymorphisms in B+ genome. 

One of the key aspects in the B chromosomes studies is to understand the nature of genetic

polymorphisms/variations and their evolutionary importance in the origin of B chromosome (Martis

et  al.  2012;  Muñoz-Pajares et  al. 2011).  Although, most of the genomic studies are focused to

identify single-nucleotide polymorphisms (SNPs) or small insertion deletion (INDELs), different

types  of structural  variations (SVs)  including complex rearrangements,  duplications,  inversions,

large deletions and insertions can also be detected using a variety of  computational approaches

(Keane et al. 2014). Indeed, the SVs play a major role in evolutionary processes such as adaptation

(Iskow et al. 2012) and speciation (Noor et al. 2001). A study reported by Fan and Meyer (2014)

gives a detailed overview of different kinds of genomic variation across four recently diverged

cichlid  lineages  and  postulate  on  the  correspondence  of  SVs  for  one  of  the  largest  adaptive

radiations  in  vertebrates.  A remarkable  contribution  to  understand  the  evolutionary  biology  of

chromosome is achieved by revelation of SVs in many organisms (Bickhart and Liu, 2014; Keane et

al.  2014).  The  identification  of  many  rearrangements  has  also  been  applied  to  explain  the

mechanism of sex chromosomes evolution (Rogers, 2015). There are several reports that superpose

B  and  sex  chromosomes,  with  evidences  of  sex  chromosomes  origin  from  domesticated  B

chromosomes  (Martins  et  al.  2011)  and  also  the  B  presence  influencing  the  sex  development

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4165313/#B11
http://journal.frontiersin.org/article/10.3389/fgene.2014.00326/full#B6
http://journal.frontiersin.org/article/10.3389/fgene.2014.00326/full#B6
http://journal.frontiersin.org/article/10.3389/fgene.2014.00326/full#B13
http://journal.frontiersin.org/article/10.3389/fgene.2014.00326/full#B9
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3753381/#def1


49

(Yoshida et al. 2011). Therefore, B and sex chromosomes have become one of the major focuses in

chromosome biology studies because they offer an excellent model to investigate mechanisms of

chromosome changes during evolution that, in some cases, are incipient. In this sense, we have

carried out the whole genome rearrangement analysis to understand the molecular mechanism of B

chromosome evolution.

 In addition to the classical application of NGS to recover whole genomes, the paired-end approach

can locate chromosomal breakpoints (Chen et al.  2010). Several repo