FACULDADE DE CIÊNCIAS AGRÁRIAS E VETERINÁRIAS UNIVERSIDADE ESTADUAL PAULISTA CÂMPUS DE JABOTICABAL GENOME-WIDE ASSOCIATION STUDY OF REPRODUCTION TRAITS IN NELORE CATTLE, INCLUDING ADDITIONAL PHENOTYPIC INFORMATION FROM NON-GENOTYPED ANIMALS Thaise Pinto de Melo Zootecnista 2015 FACULDADE DE CIÊNCIAS AGRÁRIAS E VETERINÁRIAS UNIVERSIDADE ESTADUAL PAULISTA CÂMPUS DE JABOTICABAL GENOME-WIDE ASSOCIATION STUDY OF REPRODUCTION TRAITS IN NELORE CATTLE, INCLUDING ADDITIONAL PHENOTYPIC INFORMATION FROM NON-GENOTYPED ANIMALS Thaise Pinto de Melo Advisor: Dr. Roberto Carvalheiro Co-Advisor: Prof. Dra. Lucia Galvão de Albuquerque Dissertation presented to The Faculdade de Ciências Agrárias e Veterinárias Unesp, Câmpus de Jaboticabal, in partial fulfilment of requeriments for the degree of Mestre em Genética e Melhoramento Animal (Master in Animal Breeding and Genetics) 2015 Melo, Thaise Pinto de M528g Genome-wide association study of reproduction trais in Nelore cattle, including additional phenotypic information from non-genotyped animals / Thaise Pinto de Melo. – – Jaboticabal, 2015 xi, 73 p. : il. ; 28 cm Dissertação (mestrado) - Universidade Estadual Paulista, Faculdade de Ciências Agrárias e Veterinárias, 2015 Orientador: Roberto Carvalheiro Banca examinadora: Idalmo Garcia Pereira, Henrique Nunes de Oliveira Bibliografia 1. Age at first calving. 2. Heifer rebreeding. 3. GWAS. I. Título. II. Jaboticabal-Faculdade de Ciências Agrárias e Veterinárias. CDU 636.082:636.2 Ficha catalográfica elaborada pela Seção Técnica de Aquisição e Tratamento da Informação – Serviço Técnico de Biblioteca e Documentação - UNESP, Câmpus de Jaboticabal. DADOS CURRICULARES DO AUTOR THAISE PINTO DE MELO - was born in Natal, Rio Grande do Norte, Brazil, on February 16th 1991, as single child of Maria Necy de Melo and José Coelho Pinto. She started her undergraduate in Zootecnia (Animal Science) at Federal University of Rio Grande do Norte, Natal, on February 2008. During her undergraduate course she had involvement in university extension and scientific projects as internship or volunteer member. She started her animal breeding studies in one of these projects, which was focused in survival analyses of simulated data, under advice of Prof. Elizângela Emídio Cunha. After graduating, she started a M.Sc. within postgraduate program in Animal Breeding and Genetics, receiving financial support from Capes and FAPESP. In her second year of Master course she conducted a research project in Canada under advice of Prof. Flavio Schramm Schenkel, receiving financial support from FAPESP. In February 2015, she defended her Master dissertation under advice of Dr. Roberto Carvalheiro and Prof. Lucia Galvão de Albuquerque. Epigraphy Quando você partir, em direção a Ítaca, que sua jornada seja longa repleta de aventuras, plena de conhecimento. Não tema Laestrigones e Cíclopes nem o furioso Poseidon; você não irá encontrá-los durante o caminho, se você não carrega-los em sua alma, se sua alma não os colocar diante de seus passos. Espero que sua estrada seja longa. Que sejam muitas as manhãs de verão, e que o prazer de ver os primeiros portos traga uma alegria nunca vista. Procura visitar os empórios da Fenícia e recolha o que há de melhor. Vá as cidades do Egito, e aprenda com um povo que tem tanto a ensinar. Não perca Ítaca de vista, pois chegar lá é o seu destino. Mas não apresse os seus passos; é melhor que a jornada demore muitos anos e seu barco só ancore na ilha quando você já estiver enriquecido com o que conheceu no caminho. Não espere que Ítaca lhe dê mais riquezas. Ítaca já lhe deu uma bela viagem; sem Ítaca, você jamais teria partido. Ela já lhe deu tudo, e nada mais pode lhe dar. Se, no final, você achar que Ítaca é pobre, não pense que ela lhe enganou. Porque você tornou-se um sábio, e viveu uma vida intensa, e este é o significado de Ítaca. Konstantinos Kaváfis DEDICATION To God, for the omnipresence in my life, for hearing me, advising me and being with me in all moments. To my family, specially my parents for their support and care, for understanding my decisions and for giving me support for the realization of my dreams, I dedicate this dissertation. ACKNOWLEDGEMENTS I would like to thank God for too many things, so I just say, thanks God for everything. You are wonderful and merciful. I thank my parents for all support. I thank Leandro for the patience in these two years, for hearing and helping me in many moments. I thank my childhood friends, especially Brenda and Heloisa for their friendship and fellowship. My “Zoo friends” from UFRN, specially Alane and Thayana, who are great friends until today. My “salinha’s friends” and friends from exact sciences (FCAV), for supporting me and for the interesting discussions. My Jaboticabal’s friends, Kamila, Giovana (and family) and Luciana, for friendship and support me. I wish to thank to all friends and colleagues I met during this master course. I would like to express my gratitude to all professors of postgraduate program on Genetics and Animal Breeding at FCAV (Genética e Melhoramento Animal), for all transmitted teaching. I thank to the Faculdade de Ciências Agrárias e Veterinárias - Unesp Jaboticabal (FCAV / Unesp) and all staffs for all support provided to the development of this dissertation. The support and incentive of the postgraduate program on Genetics and Animal Breeding at FCAV/Unesp is gratefully acknowledged. I thank the financial support from Coordenação de Aperfeiçoamento de pessoal de nível superior (CAPES) and Fundação de Amparo à Pesquisa (FAPESP), process no 2013/17396-4 and no 2014/09603-2. Also financial supports from CNPq (no 559631/2009-0) and FAPESP (no 2009/16118-5) are acknowledged. I thank DeltaGen® breeding program for providing the data used in this study. I thank the University of Guelph and all staffs for all support given in my research internship. I’m really grateful to Prof. Flavio Schenkel, for all support, attention, suggestions to improve this work, and for spent his time transmitting me valuable knowledge. I thank all friends that I made in Canada, specially Mrs. Elvira, Mr. Elmer and Mrs. Linda and Mr. Dave, for the care, patience and for receiving me so lovely. I would like to thank Prof. Elizângela, my advisor in Animal Science undergraduate course, who always challenge me, since the Quantitative Genetics classes, what awoke in me the love by genetics. I’m also grateful to my forage Professor, Emerson, who warned me about my gift to teach, what encouraged me to follow the academic area. I would like to express my gratitude to all members of the internal and external examining comitees: Dr. Henrique Nunes de Oliveira, Dr. Idalmo Pereira Garcia and Dr. Danísio Prado Munari for their valuable suggestions and contributions to improve this dissertation. I’m thankful to my advisors, Dr. Roberto Carvalheiro for his patience, spent his valuable time teaching me the main part of genomic knowledge that I have now and for supporting me in the conduction of this work, and Dra. Lucia Galvão de Albuquerque for her valuable contribution for this work and for having accepted me in her study group. i CONTENTS Page CHAPTER 1 – GENERAL CONSIDERATIONS ............................................... ii Abstract ......................................................................................................... ii Resumo ......................................................................................................... iii Introduction ..................................................................................................... 1 References ..................................................................................................... . 5 CHAPTER 2 – IMPORTANCE OF INCORPORATING PHENOTYPIC INFORMATION OF NON-GENOTYPED ANIMALS IN GENOME-WIDE ASSOCIATION STUDIES ............................................................................... 9 Abstract ....................................................................................................... 9 Introduction ................................................................................................ 11 Material and Methods ................................................................................ 12 Results and Discussion ............................................................................. 18 Conclusions ............................................................................................... 28 References ................................................................................................ 28 CHAPTER 3 – GENOME WIDE ASSOCIATION STUDY OF HEIFER REBREEDING IN NELORE CATTLE ....................................................................................... 34 Abstract ..................................................................................................... 34 Introduction ................................................................................................ 35 Material and Methods ................................................................................ 36 Results and Discussion ............................................................................. 40 Conclusions ............................................................................................... 47 References ................................................................................................ 47 Appendix ................................................................................................... 53 ii GENOME WIDE ASSOCIATION OF REPRODUCTIVE TRAITS IN NELORE CATTLE, INCLUDING PHENOTYPIC INFORMATION FROM NON- GENOTYPED ANIMALS ABSTRACT – This dissertation was divided in three chapters, the first one is a literature review about the subject that will be discussed in subsequent chapters. In the second chapter, a genome wide association study (GWAS) for age at first calving (AFC) in Nelore cattle was performed, using real and simulated data, aiming to 1) assess if additional phenotypic information from non-genotyped animals affect QTL mapping of AFC; 2) evaluate, by simulation, if this additional phenotypic information contributes to detect QTLs more precisely for a low heritable complex trait, and with few available genotypes. The third chapter presents a GWAS for heifer rebreeding (HR) in Nelore cattle. In chapter two, GWA studies were performed using Bayes C and weighted single step GBLUP (WssGBLUP) methods and the top 10 marker windows (1Mb) that explained the larger proportion of variance for AFC were identified and further explored. Two scenarios were investigated, one including all females with available phenotypic information (SI scenario, with 43,482 females), and the other including just the females with available genotype (SII scenario, with 1,813 females). Three iterations were performed in WssGBLUP, recomputing the animals and SNPs effect in each subsequent iteration. It was simulated a population mimicking the parameters and the structure of the real dataset. Two different disequilibrium linkage levels (low and high) between adjacent markers were simulated. In chapter three, the data consisted of 142,878 HR phenotypic records and 2,923 genotypes. The GWAS was performed with WssGBLUP method using three different weightings (iterations) for the SNP effects. Total genetic variances were calculated for the top 10 1Mb SNP-windows, detected by each iteration. On each subsequent iteration, the genetic variance was distributed for a smaller number of SNPs, and the SNP effects were recomputed. Genes possibly associated with HR were searched to reinforce the suggestive importance of the detected windows. The results from chapter two revealed that considering or not additional phenotypic information influenced more the SNP effect estimates than the applied method of GWAS. Although most of genomic regions indicated by different analyses were not the same, some coincidence was observed. Some identified regions presented previously reported QTLs for reproductive traits. The results of simulated data indicated that including all available phenotypic information, even of non- genotyped animals, can improve QTL detection of complex low heritability traits. In chapter three, the GWA analyses detected a total of 21 different windows that harbored 13 QTLs previously reported in the literature for different reproductive (or related) traits. The top 10 marker-windows contained 182 annotated genes. Some of them were associated with pathways of reproductive traits. Evidence was found in this study that important candidate genes affecting HR in Nelore cattle were identified. Keywords: age at first calving, Bayes C, GWAS, heifers rebreeding, weighted single step GBLUP iii ASSOCIAÇÃO GENÔMICA AMPLA DE CARACTERÍSTICAS REPRODUTIVAS EM BOVINOS DA RAÇA NELORE, INCLUINDO INFORMAÇÃO FENOTÍPICA DE ANIMAIS NÃO GENOTIPADOS RESUMO – Esta dissertação foi dividida em três capítulos, o primeiro é uma revisão de literatura sobre o assunto que será discutido nos capítulos seguintes. No segundo capítulo, foi realizado um estudo de associação genômica ampla para a característica idade ao primeiro parto (IPP) em gado Nelore, que objetivou: 1) avaliar se a informação fenotípica adicional dos animais não genotipados afeta o mapeamento de QTLs da IPP; avaliar, por simulação, se esta informação fenotípica adicional contribui para detectar QTLs mais precisamente para uma característica complexa, com poucos genótipos disponíveis. O terceiro capítulo apresenta um estudo de associação para a característica reconcepção de primíparas (RP) em gado Nelore, cujo objetivo foi detectar importantes regiões genômicas (QTLs) associadas com esta característica. No capítulo dois, estudos de associação foram realizados utilizando as metodologias Bayes C e “Weighted single step GBLUP” (WssGBLUP) e as 10 janelas de marcadores (de 1Mb) que explicaram a maior proporção da variância para IPP foram identificadas e exploradas. Dois cenários foram investigados, um incluindo todas as fêmeas com informação fenotípica disponível (cenário SI – 43.482 fêmeas), e outro incluindo apenas as fêmeas com genótipo disponível (cenário SII – 1.813 fêmeas). Três iterações foram realizadas no método WssGBLUP, sendo recomputados os efeitos dos animais e dos SNPs a cada iteração. Foi simulada uma população com parâmetros e estrutura similares aos dos dados reais, com dois diferentes níveis de desequilíbrio de ligação (alto e baixo) entre os marcadores adjacentes. No capítulo três, os dados consistiram de 142.878 registros fenotípicos de RP e 2.923 genótipos. Os estudos de associação foram realizados com o método WssGBLUP usando três diferentes pesos (iterações) para os efeitos dos SNPs. Para cada iteração subsequente a variância genética foi distribuída para um número menor de SNPs, e os efeitos dos SNPs foram recomputados. Genes possivelmente associados com RP foram pesquisados para reforçar a importância sugestiva das janelas detectadas. Os resultados do capítulo dois indicaram que considerar ou não a informação fenotípica adicional influenciou mais os efeitos estimados dos SNPs do que o método aplicado nos estudos de associação. Embora a maioria das regiões genômicas indicadas pelas diferentes análises não terem sido as mesmas, alguma coincidência foi observada. Algumas regiões associadas com IPP apresentaram QTLs previamente reportados para características reprodutivas. Os resultados dos dados simulados indicaram que incluir toda a informação fenotípica disponível, mesmo dos animais não genotipados, pode melhorar a habilidade de detecção de QTLs de características complexas de baixa herdabilidade. Os estudos de associação do capítulo três detectaram um total de 21 diferentes janelas, as quais abrigaram 13 QTLs previamente reportados na literatura para diferentes características reprodutivas (ou associadas). As top 10 janelas de marcadores continham 182 genes descritos. Alguns deles estavam associados com vias metabólicas de algumas características reprodutivas. Foi encontrada evidência iv neste estudo que importantes genes candidatos afetando a RP em gado Nelore foram identificados. Palavras-chave: Bayes C, idade ao primeiro parto, GWAS, reconcepção de primíparas, weighted single step GBLUP 1 CHAPTER 1. GENERAL CONSIDERATIONS Introduction Brazil is one of the world’s largest beef meat exporters (USDA, 2014). Most of the Brazilian meat production comes from Zebu cattle, especially Nelore breed that is the most populous breed in the country. Nelore cattle are well adapted to climatic conditions and to Brazil’s production systems, but a common problem related to Zebu cattle is the low reproductive efficiency. Reproductive traits are associated with the success of beef cattle operations (CAMPOS et al., 2005). Age at first calving (AFC) is one of the reproductive traits used as selection criteria for Nelore females and affect directly the herd productivity. AFC is associated with animals’ puberty, and its reduction is associated with reduction on production costs, due to cows entering in the productive age earlier (NÚÑEZ-DOMINGUEZ et al., 1991, PEROTTO et al., 2006). Pötter et al. (1998) listed the following benefits of reducing AFC: economic return is faster, increase the quantity of live weight per hectare and the number of females in reproduction stage. An important aspect is that the AFC information is regularly obtained by the farmers, i.e. there is no need of additional costs for its obtainment. Heifers rebreeding (HR) is also an important reproductive trait because heifers that conceive in breeding season soon after the first calving allow a fast return on investments. Furthermore, this is a delicate stage of female life, since they require energy for maintenance, lactation and, in some cases, for growth or for recovering their body condition score (NAAB, 2004). HR is been considered in different breeding programs (RILEY et al., 2010; BOLIGON & ALBUQUERQUE, 2012), because heifers usually present low rebreeding rates (CORRÊA et al., 2001). AFC and HR are considered complex traits because they are most probably controlled by a large number of genes. Despite the economic importance of these reproductive traits, their selection is difficult because they 2 present low heritability, are expressed just in female (for which the selection intensity is lower than males) and late in female life. So, the genetic evaluations based just in phenotype and pedigree records provide expected breeding values (EBV) with low accuracy for these traits, even when considering information from correlated traits as, for example, scrotal circumference. Trying to overcome this, genome-wide association studies (GWAS) of reproductive traits have been performed (SASAKI et al., 2013) with the hope to detect genomic regions associated to the trait, and use this information as a tool to possibly increase genetic gain. GWAS consist basically in the detection of statistical associations between the trait of interest and any of the markers (usually Single Nucleotide Polymorphism - SNP) contained in high-density panels (Goddard & Hayes, 2009). There are many methods used to perform GWAS, one of these is the GBLUP (genomic BLUP), that was initially applied to genomic selection studies by multiple step procedures (CHRISTENSEN & LUND, 2010 and MIZSTAL et al., 2009). However, it has a limitation for using phenotypes or pseudo- phenotypes (e.g. deregressed EBVs) only from genotyped animals. Mizstal et al. (2009) proposed a single step procedure, named Single Step GBLUP (ssGBLUP), in which a relationship matrix (G) is estimated using markers information, in contrast with classic BLUP, in which a relationship matrix (A) is based on pedigree. The ssGBLUP combine information from genotyped and non-genotyped animals, which are represented by G and A matrices, respectively, in an H matrix (LEGARRA et al., 2009 and CHRISTENSEN & LUND, 2010). The ssGBLUP has been applied in studies with different species – dairy cattle (VANRADEN, 2012); chicken (CHEN et al., 2011) and swine (FORNI et al., 2011) – and had presented satisfactory results (CHRISTENSEN et al., 2012 and GAO et al., 2012). Since ssGBLUP method has improved the EBVs accuracy in genomic selection studies, it is expected that the inclusion of phenotypic information from non-genotyped animals could also increase the accuracy to predict the QTL regions in GWAS. Dikmen et al. (2013) used ssGBLUP to identify QTLs associated to rectal temperature in Hostein Cattle. Tiezzi et al. (2015) also used ssGBLUP methodology to perform GWAS for clinical mastitis in first parity of US Holstein cows. Despite the benefits of this method, it assumes equal variance 3 for all markers, i.e, all markers with the same weight, which is particularly limiting when dealing with traits that present markers with more pronounced effects than others. Trying to overcome this problem, Wang et al. (2012) proposed the “Weighted Single Step GBLUP” (wssGBLUP) method that allows to combine pedigree, phenotypic and genotypic information in a single step, weighting the effect of the SNPs according to their supposedly relevance for the trait of interest. In WssGBLUP it is not necessary to restrict information from non- genotyped animals or to include pseudo-phenotypes. This is a useful method for complex models, as non-linear, using multiple traits or, when there are many animals with available phenotype, but few genotyped, a common situation in real dataset. Wang et al. (2014) used WssGBLUP in a GWA study for body weight in broiler chickens. They concluded that each GWAS tested method presented some weakness, and their efficiency would depend on the number of animals with phenotype and genotype available, and the proportion of SNPs that explains the genetic variance. As ssGBLUP, the Bayesian methods were developed primarily for GS studies, but also have been adapted to GWAS. The Bayes A and Bayes B (MEUWISSEN et al., 2001) assume that variances of each marker can vary. Gianola et al. (2009) verified some statistical problems with these methods and suggested some modifications. The authors suggested for Bayes A to consider the linkage disequilibrium that exists between the markers and QTL(s), to reduce the priors influence, attributing the same variance to all markers and non-informative priors for scale and degrees of freedom parameters, and/or combine these two last options in a single procedure. For Bayes B, they suggested to attribute zero value to some SNP effects, instead of attributing zero value for their variances. These suggestions were implemented by Habier et al. (2011) that tried to reduce the scale parameter influence, considering the same variance for all SNPs (Bayes C), and treated a priori the scale parameter as unknown (Bayes D). However, these authors suggested another modification in those original models that was the π estimation, the proportion of SNPs with zero-effect, since π is assumed known in Bayes A and Bayes B. They argued that it is better to estimate π from the data because π influences the shrinkage of SNP effects. This change formulated the Bayes Cπ and Bayes Dπ methods. 4 Van den Berg et al. (2013), working with simulated data, compared the Bayes C and Bayes Cπ methods and concluded that under certain situations, as a trait with low heritability, low number of records and/or many QTLs affecting the trait, the Bayes C can achieve better results than Bayes Cπ, because in this situations the π is not well estimated, so it is better to fix it. With the crescent number of GWA studies in literature, several new QTLs are detected for many traits in different species. In QTLdb database (HU et al., 2013) 16,919 QTLs were reported for many traits in cattle until Feb. 13, 2015. However, few of these QTLs have been validated or reproduced by other studies. Therefore, even though there is evidence of important genomic regions being identified, these results should be carefully interpreted (FRAGOMENI et al., 2014). Known genes associated with the concern trait in the region could be a good evidence of a true QTL for real data set (SAATCHI et al., 2014). Despite advances in genomic and GWA studies, QTL detection with real data is still challenging. However, with simulated data these regions can be previously known, providing a more reliable manner to evaluate the usefulness of a method. For instance, Wang et al. (2012) used simulated data in GWA studies to assess the accuracy, precision and computation time for different methods in genomic predictions and QTL detection. Vitezica et al. (2011) also performed simulation studies to assess the predictive ability of single and multiple-step methods. Van den Berg et al. (2013), in a simulation study, evaluated the influence of different factors to detect true QTL using two methodologies. Thus, using simulated data to test the power of a method to detect true QTLs with known positions seems to be a helpful strategy. The present study was developed with the following specific objectives: 1) to assess if additional phenotypic information from non-genotyped animals affect QTL mapping of AFC in Nelore cattle; 2) to evaluate, by simulation, if this additional phenotypic information contributes to detect QTLs more precisely for a low heritable complex trait, with few available genotypes; 3) to search for genomic regions (QTLs) associated with heifer rebreeding (HR) in Nelore cattle. References 5 AGUILAR, I.; MISZTAL, I.; JOHNSON, D. L.; LEGARRA, A.; TSURUTA, S.; LAWLOR, T. J. Hot topic: A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. Journal of Dairy Science, v. 93, n. 2, p. 743-752, 2010. BOLIGON, A. A.; ALBUQUERQUE L.G. Genetic parameters and relationships between heifers rebreeding and hip height in Nellore cattle. Revista Brasileira de Zootecnia, v.41, n.3, p.598-602, 2012 CAMPOS, W. E.; SAUERESSIG, M. G.; SATURNINO, H. M.; de SOUZA, B. M.; AMARAL, T. B.; FERREIRA, F. Manejo reprodutivo em gado de corte. Planaltina, DF: Embrapa Cerrados: 2005. 54 p. (Embrapa. Documentos, 134). CHEN, C. Y.; MISZTAL, I.; AGUILAR, I.; LEGARRA, A.; MUIR, W. M. Effect of different genomic relationship matrices on accuracy and scale. Journal of Animal Science, v. 89, p. 2673-2679, 2011. CHRISTENSEN, O. F.; LUND, M. S. Genomic prediction when some animals are not genotyped. Genetics Selection Evolution, v. 42, n. 2, p. 1-8, 2010. CHRISTENSEN, O. F.; MADSEN, P.; NIELSEN, B.; OSTERSEN, T.; SU, G. Single-step methods for genomic evaluation in pigs. Animal, v. 6, n. 10, p. 1565-1571, 2012. CORRÊA, E.S.; EUCLIDES FILHO, K.; ALVES, R.G.O., et al. Desempenho reprodutivo em um sistema de produção de gado de corte. Campo Grande: EMBRAPA/CNPGC, 2001. 33p. (Boletim de Pesquisa, 13). DIKMEN, S.; COLE, J. B.; NULL, D. J.; HANSEN, P. J. Genome-wide association mapping for identification of quantitative trait loci for rectal temperature during heat stress in Holstein cattle. PLoS ONE, v. 8, n. 7, 2013. DOYLE, S. P.; GOLDEN, B. L.; GREEN, R. D.; BRINKS, J. S. Additive genetic parameter estimates for heifer pregnancy and subsequent reproduction in Angus females. Journal of Animal Science, v.78, p.2091-2098, 2000. FORNI, S.; AGUILAR, I.; MISZTAL, I. Different genomic relationship matrices for single-step analysis using phenotypic, pedigree and genomic information. Genetics Selection Evolution, v. 43, n. 1, p. 1-7, 2011. FRAGOMENI, B. O.; MISZTAL, I.; LOURENCO, D. L.; AGUILAR, I.; OKIMOTO, R.; MUIR, W. M. Changes in variance explained by top SNP windows over generations for three traits in broiler chicken. Frontiers in Genetics, v. 5, n. 332, p. 1-7, 2014. 6 GAO, H.; CHRISTENSEN, O. F.; MADSEN, P.; NIELSEN, U. S.; ZHANG, Y.; LUND, M. S.; SU, G. Comparison on genomic predictions using three GBLUP methods and two single-step blending methods in the Nordic Holstein population. Genetics Selection Evolution, v. 44, n. 8, p. 1-8, 2012. GIANOLA, D.; DE LOS CAMPOS, G.; HILL, W. G.; MANFREDI, E.; FERNANDO, R. Additive Genetic Variability and the Bayesian Alphabet. Genetics, v. 183, p. 347-363, 2009. GUARINI, A. R.; NEVES, H. H. R.; SCHENKEL, F. S.; CARVALHEIRO, R.; OLIVEIRA, J. A.; QUEIROZ, S. A. Genetic relationship among reproductive traits in Nellore cattle. Animal, p. 1-6, 2014. HABIER, D.; FERNANDO, R.L.; KIZILKAYA, K.; GARRICK, D. J. Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics, v. 12, n. 186, p. 1-12, 2011. HU, Z. H.; PARK, C. A.; WU, X. L.; REECY, J. M. Animal QTLdb: an improved database tool for livestock animal QTL/association data dissemination in the post-genome era. Nucleic Acids Research, v. 41, p. 871-879, 2013. LEGARRA, A.; AGUILAR, I.; MISZTAL, I. A relationship matrix including full pedigree and genomic information. Journal of Dairy Science, v. 92, n. 9, p. 4656-4663, 2009. MEUWISSEN, T. H. E.; HAYES, B. J.; GODDARD, M. E. Prediction of total genetic value using genome-wide dense marker maps. Genetics, v. 157, n. 4, p. 1819-1829, 2001. MISZTAL, I.; LEGARRA, A.; AGUILAR, I. Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information. Journal of Dairy Science, v. 92, n. 9, p. 4648-4655, 2009. NATIONAL ASSOCIATION OF ANIMAL BREEDERS. Rebreeding First-Calf Heifers. United States, 2004. 3 p. (Angus Beef Bulletin). NUÑEZ-DOMINGUEZ, R.; CUNDIFF, L. V.; DICKERSON, G. E.; GREGORY, K. E.; KOCH, R. M. Lifetime production of beef heifers calving first at two vs. three years of age. Journal of Animal Science, v. 69, p. 3467-3479, 1991. PEROTTO, D.; MIYAGI, A. P.; SOUZA, J. C.; MOLETTA, J. L.; FREITAS, J. A. Estudos de características reprodutivas de animais da raça Canchim, criados a pasto, no estado do Paraná, Brasil. Archives of Veterinary Science, v. 11, n. 2, p. 1-6, 2006. 7 PÖTTER, L.; LOBATO, J. F. P.; MIELITZ NETTO, C. G. A. Produtividade de um modelo de produção para novilhas de corte primíparas aos dois, três e quatro anos de idade. Revista Brasileira de Zootecnia, v. 27, n. 3, p. 613-619, 1998. RILEY, D. G.; CHASE, C. C. JR.; COLEMAN, S. W.; OLSON, T. A.; RANDEL, R. D. Evaluation of tropically-adapted straightbred and crossbred beef cattle: Heifer age and size at first conception and characteristics of their first calves. Journal of Animal Science. v. 88, n.10, p. 3173-82, 2010. SAATCHI, M.; SCHNABEL, R. D.; TAYLOR, J. F.; GARRICK, D. J. Large-effect pleiotropic or closely linked QTL segregate within and across ten US cattle breeds. BMC Genomics, v. 15, n. 442, p. 1-16, 2014. STRANGER, B.E.; STAHL, E. A.; RAJ, T. Progress and Promise of Genome- Wide Association Studies for Human Complex Trait Genetics. Genetics, v. 187, n. 2, p. 367-383, 2011. TIEZZI, F.; PARKER-GADDIS, K. L.; COLE, J. B.; CLAY, J. S.; MALTECCA, C. A Genome-Wide Association Study for Clinical Mastitis in First Parity US Holstein Cows Using Single-Step Approach and Genomic Matrix Re-Weighting Procedure. PLoS ONE, v. 10, n. 2, p. 1-15, 2015. USDA, Livestock and poultry: world markets and trade. Oct, 2014. Available in: . Accessed in Nov. 2014. VAN DEN BERG, I.; FRITZ, S.; BOICHARD, D. QTL fine mapping with Bayes C (π): a simulation study. Genetics Selection Evolution, v. 45, n. 19, 2013. VANRADEN, P. M. Avoiding bias from genomic pre-selection in converting daughter information across countries. In INTERNATIONAL BULL EVALUATION SERVICE, 2., 2012, Verona, Italy. Proceedings... Verona, n. 45, 2012, p. 1-5. VITEZICA, Z. G.; AGUILAR, I.; MISZTAL, I.; LEGARRA, A. Bias in genomic predictions for populations under selection. Genetics Research (Cambridge), v.93, p.357-366, 2011. WANG, H.; MISZTAL, I.; AGUILAR, I.; LEGARRA, A.; FERNANDO, R. L.; VITEZICA, Z.; OKIMOTO, R.; WING, T.; HAWKEN, R.; MUIR, W. M. Genome-wide association mapping including phenotypes from relatives without genotypes in a single-step (ssGWAS) for 6-week body weight in broiler chickens. Frontiers in Genetics, v. 5, n. 134, 2014. 8 WANG, H.; MISZTAL, I.; AGUILAR, I.; LEGARRA, A.; MUIR, W. M. Genome- wide association mapping including phenotypes from relatives without genotypes. Genetics Research, v. 94, n. 2 p. 73-83, 2012. 9 CHAPTER 2. IMPORTANCE OF INCORPORATING PHENOTYPIC INFORMATION OF NON-GENOTYPED ANIMALS IN A GENOME-WIDE ASSOCIATION STUDY OF A COMPLEX TRAIT ABSTRACT: Despite the advances in methods applied to genome-wide association studies (GWAS), QTL detection with real data is still challenging, especially in unfavorable scenarios, like those posed by low heritable complex traits, with few available genotyped animals, as is currently the case for age at first calving (AFC) in Nelore cattle. The weighted single-step GBLUP (WssGBLUP) method for GWAS combines simultaneously different sources of information to estimate marker effects, allowing using phenotypic information even from non-genotyped animals. It is not clear, however, in which extend the inclusion of phenotypes from relatives without genotypes can contribute to detect QTLs more precisely. The objectives of this study were: 1) to assess if additional phenotypic information from non-genotyped animals affect QTL mapping of AFC in Nelore cattle; 2) to evaluate, by simulation, if this additional phenotypic information contributes to detect QTLs more precisely for a low heritable complex trait, with few available genotyped animals. The real and simulated data analyses were performed using the WssGBLUP and Bayes C methods. For the WssGBLUP method, two scenarios were tested, including (SI) or not (SII) the phenotypic information from non-genotyped animals, using different weights for the markers. The Bayes C method was tested just under SII. The number of records and the structure of the simulated population mimicked the real data. The results of the real data analyses showed that the use of phenotypes from non-genotyped animals, in addition to phenotypes from genotyped animals, affected the SNP effect estimates and the mapping of the most important genomic regions. Besides some coincidence, the most important genomic regions indicated by the analyses, either considering or ignoring phenotypes from non-genotyped animals, were not the same. The results from simulate data indicated that the inclusion of all available phenotypic information, even from non-genotyped animals, can provide improvement in the detection of QTLs in GWAS of low heritable complex traits. For populations presenting low levels of linkage disequilibrium, the improvement is expected to be milder. 10 Keywords: age at first calving, beef cattle, Nelore, GWAS, simulation, WssGBLUP 11 2.1. Introduction Age at first calving (AFC) is closely associated with the success of beef cattle production systems and is still a bottleneck in many Bos indicus breeds as, for example, in Nelore cattle. Genetic gain of AFC is challenging mainly because the trait is sex limited and presents low heritability, especially if the heifers are lately exposed for reproduction (ATENCIO, 2000). Trying to overcome this, genome-wide association studies (GWAS) of AFC have been performed (SASAKI et al., 2013) with the hope to detect genomic regions associated to the trait, and use this information as a tool to possibly increase genetic gain. There are different methods available to perform GWAS as, for example, Bayesian multiple-regression methods (FERNANDO & GARRICK, 2013). Besides allowing including all markers simultaneously in the analyses, the Bayesian methods allow to consider different prior distributions for the marker effects, what is advocated to be appealing especially for traits affected by large quantitative trait loci (QTL). Van den BERG et al. (2013), in a simulation study, tested the adequacy of using Bayesian multiple-regression methods for QTL mapping, and observed feasibility of Bayes C (HABIER et al., 2011) to detect large QTLs, especially for traits with medium to high heritability and with a large number of records. The single-step GBLUP method (ssGBLUP) (MIZSTAL et al., 2009, CHRISTENSEN & LUND, 2010), originally used in plant and animal breeding for the prediction of breeding values, has also been used in GWAS (WANG et al., 2012, WANG et al., 2014). Despite the benefits of ssGBLUP compared to the multiple-step approach (AGUILAR et al., 2010, CHEN et al., 2011, FORNI et al., 2011), it is based on an infinitesimal model, i.e. a model in which all markers are assumed to have an small effect. This model is particularly limiting when working with traits that present markers with more pronounced effect than others. Given this limitation, Wang et al. (2012) proposed the "Weighted Single- Step GBLUP" (WssGBLUP) method, which allows combining pedigree, phenotypic and genotypic information in a single-step, weighting the marker 12 effects according to their importance for the trait of interest. Furthermore, in WssGBLUP, as in ssGBLUP, it is not necessary to restrict the use of phenotypes just from genotyped animals or to use pseudo-phenotypes like in multiple-step methods. Therefore, the method is expected to be advantageous when working with complex models, such as non-linear or multiple-trait; or in datasets with many phenotypes, but few available genotyped animals, a common situation in real datasets (LOURENCO et al., 2014). Despite the advances in methods applied to GWAS, QTL detection with real data is still challenging, especially in unfavorable scenarios, like those posed by low heritable complex traits, with few available genotyped animals, as is currently the case for AFC in Nelore cattle. It is not clear in which extend the inclusion of phenotypes from relatives without genotypes can contribute to detect QTLs more precisely in this unfavorable scenario. Aiming to elucidate this, the present study was developed with the following specific objectives: 1) to assess if additional phenotypic information from non-genotyped animals affect QTL mapping of AFC in Nelore cattle; 2) to evaluate, by simulation, if this additional phenotypic information contributes to detect QTLs more precisely for a low heritable complex trait, with few available genotyped animals. 2.2. Material and Methods 2.2.1. Real data 2.2.1.1. Phenotypes and pedigree Phenotypic information of AFC (in days) was obtained from the DeltaGen® Nelore breeding program database. This database contains records from animals raised under tropical pasture in different commercial herds located in the southeast, west and central regions of Brazil. During the mating season, usually in the rainy period, the heifers are either artificially inseminated or naturally mated. In general, the first mating of heifers occurs at about 26 months of age, although some herds expose heifers earlier, at around 14-18 months of age. Two scenarios were established to perform the GWAS for AFC. The first 13 including all cows with observed AFC (SI), totaling 43,482 females, and another considering only cows with observed AFC and available genotype (SII), totaling 1,813 females. The pedigree file considered in both scenarios had 237,602 animals. 2.2.1.2. Genotyped animals A total of 1,829 Nelore females were genotyped with the Illumina Bovine HD panel (Illumina®, San Diego, CA, USA). These females were born between 2007 and 2009 and were chosen among contemporary heifers that were exposed in a precocious age (14-18 months), pertaining to two main farms of the commercial breeding program. The genotyped heifers that did not get pregnant in the anticipated mating season were also exposed in the regular mating season, at about 26 months of age. The heifers that did not conceived a calf after being exposed to these two mating seasons, were not considered in this study. The minimum, maximum and average values of observed AFC of the genotyped females were equal to 698, 1,200 and 1,050 days, respectively, and the proportion of heifers considered precocious (AFC < 900 days) was equal to 28.2%. The quality control (QC) of the genotypes were performed considering the following exclusion criteria: i) for the single nucleotide polymorphism markers (SNPs): from non-autosomal regions, mapped to the same position, presenting a p-value for the Hardy-Weinberg equilibrium test lower than 10-5, with GC score lower than 0.7, call rate lower than 0.95, minor allele frequency lower than 0.02, and highly correlated (r2 > 0.995) with other SNPs within a sliding window containing 100 consecutive markers (only one marker from each pair of highly correlated SNPs was removed); ii) for samples: with call rate lower than 0.9 and being a replicated genotype. The remaining number of SNPs and samples after QC were 333,878 and 1,813, respectively. 2.2.2. Simulated data The number of phenotypes and genotypes available in the simulation study mimics the information available in the real dataset. The phenotypes and genotypes were simulated using the software QMSim v.1.10 (SARGOLZAEI & SCHENKEL, 2013). Ten replicates were performed of a hypothetical trait with 14 heritability equal to 0.14 and phenotypic variance of 1.0. This heritability was chosen for being the estimate obtained with the real data. 2.2.2.1. Simulated population Two different populations were simulated, differing in the number of historical generations, to produce different levels of linkage disequilibrium (LD). In both cases, a first historical population was generated from generation zero to 1,000, with constant size of 1,000 animals. Then, from generation 1,001, a second historical population was simulated with a gradual reduction in the number of animals (from 1,000 to 200), producing a "bottleneck effect" and, consequently, genetic drift and LD. This gradual reduction was made over 1,020 or 2,020 generations, resulting in a population with a lower (LLD) or a higher (HLD) level of LD, respectively. The remaining parameters of the simulation process were the same for both populations (LLD and HLD). The 200 animals (from which 100 are females) of the latest generation of the historical population were selected for the expanded population. The size of this population mimicked the effective size of the real population (BRITO et al., 2013). For the expanded population, it was considered a mating system based on random union of gametes, absence of selection and an exponential growth of the number of females (every generation generates twice the number of females of the previous generation), with a replacement rate of 100% every generation and an average of five products per dam. Six generations of expansion were generated, resulting in 16,000 animals from which 8,000 were females. After the expansion process, 240 males and 6,000 females of the last generation were randomly selected, comprising the founder animals of the selection population. This population was span over 15 generations. At each generation of the selection population, the selected males and females were randomly mated, generating a single product with equal probability of being a male or a female. The replacement rate of sires and dams was kept constant at 20% and the selection criterion was the expected breeding values. Through the 15 generations, a total of 90,000 animals were generated, mimicking the number of available phenotypes in the real dataset, being 50% females whose phenotypic data were used for GWAS. The genotypes of 2,000 females of the last three generations (13, 14 and 15) were randomly selected for GWAS. 15 2.2.2.2. Simulated genome and phenotype It was assumed that QTLs explain 100% of genetic variance. The simulated genome presented a total length of 2,333 cM, 735,293 markers and 7,000 QTLs. The number of markers and QTLs per chromosome ranged from 46,495 to 12,931 and from 121 to 438, respectively, and they were both randomly distributed over 29 autosomes. All markers were bi-allelic, mimicking SNPs present in bovine commercial panels. For QTL, the amount of alleles for loci ranged (randomly) from 2 to 4. Minor allele frequencies (MAF) were assumed equal for markers and QTL alleles. QTL allele effects were sampled from a gamma distribution with a shape parameter equal to 0.4 (HAYES & GODDARD, 2001). It was considered a mutation rate of 10-4 for markers and QTLs, in the historical populations. A total of 335,000 markers (with MAF greater or equal than 0.02) and 1,000 QTL were randomly selected from the last generation of the historical population to generate genotypic data for the selection population. The average distance between adjacent markers was 0.007 cM. Although the “genetic architecture” of reproduction traits is unknown, the simulation parameters used in this study aimed to mimic a polygenic complex trait affected by many genes of small effects and few with more pronounced effects. The phenotypes of the animals were comprised by the sum of the QTL effects plus an error term sampled from a normal distribution with zero mean and variance equal to 0.86. 2.2.3. Linkage disequilibrium analyses The LD between any two loci within the same chromosome was assessed using the r2 measure (HILL & ROBERTSON, 1968) using the software SnppldHD (SARGOLZAEI, University of Guelph, Canada). The pattern of LD decay of the real and simulated data was compared aiming to evaluate the adequacy of the simulation process (LLD population) to mimic the real scenario. 2.2.4. Statistical analyses For real and simulated data two statistical methods were used, namely WssGBLUP (WANG et al., 2012) and Bayes C (HABIER et al., 2011). Although 16 the comparison of methods was not part of our objective, Bayes C was also used for being indicated as a feasible method for QTL detection (VAN DEN BERG et al., 2013). For the real data, the WssGBLUP method adopted was based on the following model: y=Xβ+Zaa+e, where � is a vector of phenotypic observations (AFC, in days), X is an incidence matrix relating phenotypes to fixed effects, β is a vector of fixed effects, including contemporary group (formed by concatenating the classes for herd, year, season and weaning and yearling management groups) and age of dam as covariate (linear and quadratic effects), Za is an incidence matrix that relates animals to phenotypes, a is the vector of additive direct genetic effects, and e is the vector of residuals. The covariance between a and e was assumed equal to zero and their variances were considered, respectively, equal to Hσa 2 and Iσe 2, where σa 2 and σe 2 are the additive direct and the residual variances, respectively, H is the matrix which combines pedigree and genomic information (AGUILAR et al., 2010), and I is an identity matrix. As previously stated, the above model was applied considering all the available phenotypes in SI and only the phenotypes of the genotyped cows in SII. For a fairly comparison between scenarios, as fixed effects would not be as good estimated in SII as in SI, the phenotypic observations in SII were pre-adjusted for the fixed effect estimates obtained in a previous “regular” BLUP analysis considering all the available phenotypes. Thus, the β vector of the GWAS in SII contained just an intercept, and the y vector the pre-adjusted AFC. The “regular” BLUP analysis assumed the same model described above with the exception that the variance of a was assumed equal to Aσa 2, where A is the regular numerator relationship matrix. For the simulated data, the same WssGBLUP model was used, except that just an intercept was considered as fixed effect. As no contemporary group effect was simulated, there was no necessity to pre-adjust the phenotypes in the scenario SII, for the simulated data. . For both scenarios (SI and SII) and datasets (real and simulated), the solutions of SNP effects (û) were obtained such as in VanRaden et al. (2009) and Stranden & Garrick (2009): u�=DZ’[ZDZ’]-1a�g, where D is a diagonal matrix with weights for SNPs, Z is a matrix relating genotypes of each locus, and a�g is 17 the vector of predicted breeding values of genotyped animals. The D matrix and the SNP and animal effects were iteratively recomputed following the method described by Wang et al. (2012) as “ssGBLUP/S2”. Three iterations (w1, w2 and w3) were performed for each scenario, resulting in an increasing shrinkage from w1 to w3 for the SNPs explaining lower variance. Actually, w1 represented the situation where the same weight (w1=1) was given to all SNPs, and served as the basis to initially calculate the proportion of variance explained by each SNP, that was subsequently used to calculate the weights of the next iteration. Therefore, SIw1 refers to scenario SI using w1 weight, SIw2 to scenario SI using w2 weight, and so forth. An equivalent notation was adopted for scenario SII. The BLUPF90 family programs (MISZTAL et al., 2012) were used to run the analyses. Bayes C analyses were based on the following model (HABIER et al., 2011): y=1μ+∑ gibiδi n i=1 +e, where the vectors y and e are the same as previously described, 1 is a vector of ones, � is the overall mean, gi is a vector with the genotypes of the animals for the ith SNP, bi is the allele substitution effect of the ith SNP, �� is an indicator variable (0, 1) sampled from a binomial distribution with parameters n and π, where n is the number of SNPs and π is the fraction of SNPs not included in the model. For the real dataset, a prior beta distribution with parameters α=108 and β=1010 was assumed for π so that, in practice, π was almost fixed to 0.99 (LEGARRA et al., 2014). For the simulated datasets, the π value assumed two possible values, almost fixed to 0.99, as used in real data, or almost fixed to 0.999, assuming a prior beta distribution with parameters α=108 and β=1011. This strategy was adopted, instead of letting π be estimated from the data (Bayes Cπ), because it was found to give better results in QTL detection (VAN DEN BERG et al., 2013). A scaled inverse chi- squared prior distribution was assumed for the variance of SNP effects (σ2g) and the residual variance (σ2e). The SNP genotypes were coded as the number of copies of one of the SNP alleles, i.e., 0, 1 or 2. For Bayes C, only the scenario SII was evaluated because the adopted implementation did not allow the inclusion of phenotypes from non-genotyped animals. For the same reason stated in scenario SII of WssGBLUP (real data), the vector y in Bayes C contained the pre-adjusted phenotypes of AFC. The Bayes C analyses were performed using the Markov 18 chain Monte Carlo algorithm implemented in the software GS3 (LEGARRA et al., 2014), running a single chain with 550,000 iterations, a burn-in period of 50,000 and a thin interval of 50 iterations. 2.2.5 Criteria for comparison For the real data, GWAS results were compared based on SNP effects estimates and on proportions of variance explained by SNPs within consecutive 1Mb windows. A total of 2,525 windows were considered, spanning over all the autosomes, with an average density of 132±44 SNPs per window. The top 10 windows, that captured the highest proportion of variance explained by the markers, were identified for the different scenarios and methods. The QTLdb database (HU et al., 2013) was consulted to assess the existence of previouslly described QTLs, related to reproduction traits, overlapping the top 10 windows and their neighbouring 1Mb windows (the next to the left and to the right), using the UMD3.1 bovine genome assembly (ZIMIN et al., 2009) as the reference map. The presence of a previous described QTL in the QTLdb database was double checked in the original listed references. Recent scientific publications were also inspected by hand, with the same purpose. When the study used a different reference assembly, the alignment with UMD3.1 was made using the SNPchiMp database (NICOLAZZI et al., 2014), prior to inferring. For the simulated data, the following statistics were calculated: number of QTLs explaining 1% or more of the genetic variance (topQTL); number of top 1Mb marker windows, accounting for the highest proportion of variance explained by the markers (topMRKw) - this number was enforced to be equal to topQTL; sum of the percentages of genetic variances explained by all the topQTL (Pvar_topQTL) and top marker windows (Pvar_topMRKw); highest percentage of genetic variance explained by a topQTL (Pvar_1stQTL) and a top marker window (Pvar_1stMRKw); and the number of true QTLs (NtrueQTL) i.e., the number of topQTL identified by a topMRKw distant no more than 1Mb from a true QTL position. 2.3. Results and Discussion 19 2.3.1. Linkage disequilibrium The LD decay presented by the simulated population with higher level of LD (HLD) was closer to the pattern presented by taurine breeds (PÉREZ O’BRIEN et al., 2014), and was a consequence of the high number of historical generations (from 1,001 to 3,020) simulated to produce the “bottleneck effect”. For the LLD simulated population, the LD was slightly higher than the LD presented by the real data, but it was similar to the levels of LD observed by Espigolan et al. (2013) in another set of Nelore animals, indicating adequacy of the LLD population to represent real indicine populations. Figure 1. Linkage disequilibrium (LD) decay of real data and two simulated populations (HLD and LLD) 2.3.2. Simulated data The simulation process resulted in an average number of QTLs explaining 1% or more of the genetic variance (topQTL) equal to 16.7, for the HLD population (Table 1). Together, the topQTL explained 29.74% of the genetic variance, with the most important QTL explaining, on average, 5.07%. As presented in Table 1, the top marker windows were able to capture a smaller 20 proportion of the genetic variance, in comparison with the topQTL, except for the analyses with a stronger shrinkage (using π=0.999 and w3). These results are partially related to the imperfect LD between markers and QTLs, and also due to false positive signals being captured by the markers. As expected, a stronger shrinkage of the SNPs explaining lower variance (from w1 to w3 or, equivalently, from π=0.99 to π=0.999) resulted in a higher variance being accounted by the top marker windows. The method that produced the stronger shrinkage was the Bayes C (π=0.999), were the genetic variance captured by the SNPs was distributed among approximately 334 markers (Table 1). In general, irrespective of the method and scenario, the GWAS presented poor ability to map the topQTLs (maximum 2.9 true QTLs – Table 1), indicating that the available phenotypic and genotypic information from the simulated population was not large enough to map the QTLs properly. These results are in agreement with Van den Berg et al. (2013), who also observed in a simulation study that the QTLs were poorly identified in low heritable traits, with large number of QTLs and few records. Table 1. Average (SD), over ten replicates, of marker and QTL related statistics, for the Bayes C and weighted single step GBLUP (WssGBLUP) analyses of the simulated population presenting high level of LD. aGWAS using (SI) or ignoring (SII) phenotypic information of non-genotyped animals, applying different weights (w1, w2 and w3) for the SNP effects in the WssGBLUP method. And using π=0.99 and π=0.999 in the Bayes C method. bGenetic variance (%) explained by the sum of variances accounted by top marker windows (Pvar_topMRKw) and by the NtopQTLs (Pvar_topQTL). cMaximum genetic variance (%) explained by a top marker window (Pvar_1stMRKw) and by a topQTL (Pvar_1stQTL). dNumber of true QTLs explaining 1% or more of the genetic variance (NtopQTL), and number of NtopQTLs identified by a top marker window distant no more than 1Mb from a NtopQTL (NtrueQTL). Method/Scenarioa Pvar_topMRKw(%)b Pvar_1stMRKw(%)c NtrueQTLd Bayes C (π=0.99) 7.78 (0.99) 1.19 (0.63) 2.20 (1.23) Bayes C (π=0.999) 46.13 (14.13) 16.45 (17.86) 1.90 (1.29) WssGBLUP/SIw1 5.16 (0.74) 0.54 (0.17) 2.90 (1.66) WssGBLUP/SIw2 17.03 (3.11) 2.29 (0.94) 2.90 (1.79) WssGBLUP/SIw3 38.07 (6.27) 7.30 (3.30) 1.30 (1.16) WssGBLUP/SIIw1 5.30 (0.63) 0.59 (0.19) 2.00 (1.49) WssGBLUP/SIIw2 18.76 (1.70) 2.65 (0.93) 1.90 (1.29) WssGBLUP/SIIw3 39.31 (7.47) 7.82 (5.80) 0.90 (0.74) True values Pvar_topQTL(%)b Pvar_1stQTL (%)c NtopQTLd 29.74 (4.88) 5.07 (2.36) 16.7 (2.83) 21 As can be seen in Table 1, analyses using different weights (w1 and w2) presented similar ability to map the QTLs. Also, a higher percentage of variance being explained by the top marker windows, as a result of a stronger shrinkage, was not necessarily associated to a better performance in mapping the QTLs. For the HLD population, the WssGBLUP analyses using w1 and w2 outperformed the analysis using w3 in terms of true QTL detected. Using a higher π also did not improve the ability of Bayes C to detect QTLs (Table 1). The use of phenotypic information from non-genotyped animals contributed to detect QTLs more precisely, for the HLD population. The analysis with the best result considering additional phenotypic information was able to detect, on average, 17.4% (2.9 out of 16.7) of the topQTL, whereas the WssGBLUP analyses ignoring this information were able to detect at best 12.0% of the topQTL. Within the methods, a relatively large variability over replicates was observed for the NtrueQTL parameter. For instance, in the HLD population, the minimum and maximum NtrueQTL values presented by the WssGBLUP/SIw2 analysis were, respectively, equal to 0 out of 14 topQTL and 6 out of 19 topQTL. These values were equal to 1/18 and 4/16 for Bayes C (π=0.99) and equal to 0/16 and 4/17 for WssGBLUP/SIIw1, respectively. Besides this variability, the number of times a method beat the others in QTL detection reinforced the superiority of WssGBLUP/SIw2 over the other methods, for the HLD population. It won 5 times, drew 3 and lost 2 out of the 10 replicates, based on the NtrueQTL criteria (data not shown). For the LLD population (Table 2), the importance of using additional phenotypic information became less evident. This could be explained because the genotyped animals are in some extend related with the animals presenting just available phenotype. For presenting low level of LD, the size of identic by descendent haplotype segments shared is reduced and, as a consequence, the inclusion of additional phenotypic information does not help the detection of true QTLs. For the LLD population the number of QTLs explaining 1% or more of the genetic variance (topQTL) was equal to 15.4 (Table 2). Together, the topQTL 22 explained 24.32% of the genetic variance, with the most important QTL explaining, on average, 3.19%. In comparison with HLD, all methods and scenarios from the LLD population presented poorer ability to detect true QTLs. In addition, the LLD population presented a different pattern of the Bayes C results, compared to HLD population. The average NtrueQTL value of the analysis assuming π=0.999 was higher than the value presented by the analysis using π=0.99. These results reinforce that the adequacy of a method to detect QTLs is dependent on the pattern of LD presented by the population of interest. As observed in Table 2, although slightly better on average, the “SI” analyses presented similar NtrueQTL values than their counterpart “SII” analyses. The analysis with the best result, WssGBLUP/SIw2, considering additional phenotypic information was able to detect, on average, 13.6% (2.1 out of 15.4) of the topQTL. Table 2. Average (SD), over ten replicates, of marker and QTL related statistics, for the Bayes C and weighted single step GBLUP (WssGBLUP) analyses of the simulated population presenting low level of LD. aGWAS using (SI) or ignoring (SII) phenotypic information of non-genotyped animals, applying different weights (w1, w2 and w3) for the SNP effects in the WssGBLUP method. And using π=0.99 and π=0.999 in the Bayes C method. bGenetic variance (%) explained by the sum of variances accounted by top marker windows (Pvar_topMRKw) and by the NtopQTLs (Pvar_topQTL). cMaximum genetic variance (%) explained by a top marker window (Pvar_1stMRKw) and by a topQTL (Pvar_1stQTL). dNumber of true QTLs explaining 1% or more of the genetic variance (NtopQTL), and number of NtopQTLs identified by a top marker window distant no more than 1Mb from a NtopQTL (NtrueQTL). 2.3.2. Real data Method/Scenarioa Pvar_topMRKw(%)b Pvar_1stMRKw(%)c NtrueQTLd Bayes C (π=0.99) 3.95 (0.58) 0.42 (0.10) 1.40 (0.97) Bayes C (π=0.999) 36.71 (12.35) 12.39 (11.93) 1.80 (1.62) WssGBLUP/SIw1 2.73 (0.53) 0.25 (0.08) 2.00 (1.70) WssGBLUP/SIw2 10.58 (1.59) 1.45 (0.37) 2.10 (1.29) WssGBLUP/SIw3 26.80 (4.93) 4.51 (1.31) 1.20 (0.63) WssGBLUP/SIIw1 2.82 (0.35) 0.26 (0.04) 1.90 (1.20) WssGBLUP/SIIw2 12.05 (1.72) 1.69 (0.38) 1.90 (1.10) WssGBLUP/SIIw3 31.60 (3.76) 5.66 (2.49) 1.10 (0.88) True values Pvar_topQTL(%)b Pvar_1stQTL (%)c topQTLd 24.32 (4.92) 3.19 (0.61) 15.4 (2.32) 23 As observed in simulated data, the Figures 2A to 2C indicate that adding phenotypic information of non-genotyped animals influenced the SNP effect estimates also in real data. Although correlated, the WssGBLUP SNP effect estimates were not the same among scenarios SI and SII. The influence of using or not the phenotypes from non-genotyped animals became stronger as the shrinkage on SNPs explaining lower variance was higher. For the SNPs with more pronounced effects, the WssGBLUP SNP effect estimates from SI were more similar to those from SII in w1 (Figure 2A) than in w2 (Figure 2B) and w3 (Figure 2C). This occurred because within each scenario the weights of further iterations were calculated as a function of the SNP effects estimated in the previous iteration (WANG et al., 2012). Consequently, the differences became larger in each every subsequent iteration. As observed by Wang et al. (2012) and in our simulation study, the total genetic variance was distributed for a smaller number of SNPs as the weights and subsequently the animal and SNP effects were recomputed in the WssGBLUP method. As a consequence, solutions from w1 (Figures 2A, 2D and 2G) were more similar to those expected from a trait following an infinitesimal model, and solutions from w3 (Figures 2C, 2F and 2I) were closer to what would be expected for an oligogenic trait. Unfortunately, when analysing real data, the WssGBLUP method does not allow inferring which weight led to better estimates. Wang et al. (2012) recognized that their proposed method calculates the weights in a suboptimal manner and suggested some refinements using, for example, Bayesian methods. 24 Figure 2. Scatter plots of SNP effect estimates for GWA analyses of age at first calving in Nelore cattle, using Bayes C and weighted single step GBLUP method, considering (SI) or ignoring (SII) phenotypes from non-genotyped animals, under different weights (w1, w2 and w3) for the SNPs 25 As stressed by Gianola et al. (2009), unrevealing the “genetic architecture” underlying the traits is not trivial even with the adoption of Bayesian methods, as the priors dominate the inference when we have few observations and want to estimate simultaneously a huge number of SNP effects. The simulation results from our study reinforce this statement, where the prior value assumed for π determined the shrinkage on SNP effect estimates. Letting the π be estimated from the data seems not to be a good strategy for GWAS purpose (VAN DEN BERG et al., 2013). The Bayes C solutions were more similar to the WssGBLUP solutions of scenario SII (Figures 2G to 2I), where the same phenotypic information was used, than those from the scenario SI (Figures 2D to 2F). For instance, the adherence between Bayes C and WssGBLUP SIIw1 solutions was very high (Figure 2G). This result suggests that, for the present study, considering or not additional phenotypic information influenced more SNP effect estimates than difference in the method applied for GWAS. This evidence was not so strong in the simulated data, mainly for the LLD population. The sigmoidal shape of Figures 2E, 2F, 2H and 2I indicated that the weights w2 and w3, applied in the WssGBLUP method, resulted in more SNPs with their effects shrunk to zero than Bayes C with π≈0.99. As also observed in the simulation study, a stronger shrinkage (w2 and w3) on SNPs explaining lower variance redistributed the variance and resulted in larger variances being explained by the most important regions (Table 3). In SI, the proportion of variance explained by the top 10 windows was equal to 1.83%, 7.18% and 23.65% for w1, w2 and w3, respectively. These figures were equal to 1.84%, 8.07% and 34.70% in SII. For the Bayes C method, the top 10 windows explained 2.65% of the genetic variance. Although the most important genomic regions indicated by the different analyses were not the same, some coincidence was observed. Considering all the seven analyses, 31 different windows were indicated as top 10 (Table 3). The most common windows between the analyses were: 4 of chromosome 7 (7/4), indicated as top 10 in 6 analyses; 5/115, 8/107 and 18/5, indicated as top 10 in 5 analyses; 17/50 and 23/27, indicated as top 10 in 4 out of 7 analyses. WssGBLUP/SIIw2 and Bayes C were the only analyses which indicated the six most common regions within the top 10 windows. 26 Based on the number of QTLs previously described in the literature, the WssGBLUP/SIw2 outperformed the other methods, presenting 5 out of the 10 top windows with a previous described QTL, within the window or in a neighbouring window, associated to reproduction traits. This result is an indicative that the use of phenotypic information from non-genotyped animals contributed to detect QTLs more precisely, also for the real data. 27 Table 3. Top 10 windows explaining the highest proportion of variance of age at first calving in GWAS using Bayes C and weighted single step GBLUP method considering (SI) or ignoring (SII) phenotypes from non-genotyped animals, under different weights (w1, w2 and w3) for the SNPs. 28 Although the simulated results indicated poor ability to detect QTLs, evidence was found in the literature that important regions were identified by the real data analyses, as presented in Table 4. Table 4. Mainly reproductive traits reported by different authors for different breeds in the most important regions (Ch, Chromossome and wi, Window) detected by Bayes C (π=0.99) and wSSGBLUP (w1, w2 and w3) methods. Ch/wi Traits reported in the region/Breed Authors 6/16 Maternal stillbirth later/ Nordic cattle Hoglund et al. (2012) 6/108 Calving Difficult (Direct)/ Nordic cattle Hoglund et al. (2012) 25/33 Calving easy traits/Holstein Sahana et al. (2011) 18/5 Calving index/Nordic cattle Hoglund et al. (2012) 3/12 Calving easy traits/Holstein Sahana et al. (2011) 6/44 Calving easy (direct)/Maine-Anjou Saatchi et al. (2014) 4/77 Calving easy (direct)/Angus Body condition score*/Brahman and an Australian crossbreed herd Saatchi et al. (2014) Porto-Neto et al. (2014) 14/63 Age at first calving/Hanwoo Hyeong et al. (2014) 1/27 Age at first calving/Hanwoo Hyeong et al. (2014) 10/92 Heifer pregnancy/Angus Peters et al. (2013) 22/2 Semen volume/Holstein Calving traits/Holstein Non-return of daughters at 56 days after insemination/Holstein Druet et al. (2009) Thomasen et al. (2008) Scrooten et al. (2004) *Body condition score was listed for indirectly affecting females reproductive performance (HERD & SPROTT, 1996). There was also evidence of important genes being located. For instance, Waters et al. (2014), studying the transcriptional regulation process in the uterine endometrium of beef heifers, under a special dietary supplementation, detected many genes, which were clustered in similar functional groups. One of these genes is the TSPO, which was grouped in a cluster described as “potentially co-regulated genes with reproductive function and steroid 29 biosynthesis”. Considering the importance of steroids to reproductive female regulation process, this could be an important candidate gene associated with AFC in Nelore cattle. The TSPO gene is located within the window 115 of BTA5 (UMD3.1), which was identified as a top window by all the analyses, except those using w1 (Table 3). 2.4. Conclusions Additional phenotypic information from non-genotyped animals influences the GWAS results. Besides some coincidence, the most important genomic regions for AFC in Nelore cattle, indicated by the analyses considering or ignoring phenotypes from non-genotyped animals, were not the same. The results from simulated data indicate that the inclusion of all available phenotypic information, even from non-genotyped animals, can provide a small improvement in the detection of QTLs in GWAS of low heritable complex traits. For populations presenting low levels of LD, the improvement is expected to be milder. 2.5. References AGUILAR, I.; MISZTAL, I.; JOHNSON, D. L.; LEGARRA, A.; TSURUTA, S.; LAWLOR, T. J. Hot topic: A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. Journal of Dairy Science, v. 93, n. 2, p. 743-752, 2010. ATENCIO, A.M. Predicción genética de la fertilidad em la hembra cebu. In: Congresso Internacional de Transferência Tecnológica Agropecuária, 20, 2000, Assunción. Anales... Assunción: Congresso Internacional de Transferência Tecnológica Agropecuária, 2000. p.29-42. BRITO, F.V.; SARGOLZAEI, M.; BRACCINI NETO, J.; COBUCI, J.A.; PIMENTEL, C.M.; BARCELLOS, J.; SCHENKEL, F.S. In-depth pedigree analysis in a large Brazilian Nellore herd. Genetics and Molecular Research, v.12(4), p.5758-5765, 2013. 30 CHEN, C. Y.; MISZTAL, I.; AGUILAR, I.; LEGARRA, A.; MUIR, W. M. Effect of different genomic relationship matrices on accuracy and scale. Journal of Animal Science, v. 89, p. 2673-2679, 2011. CHRISTENSEN, O. F.; LUND, M. S. Genomic prediction when some animals are not genotyped. Genetics Selection Evolution, v. 42, n. 2, p. 1-8, 2010. DRUET, T.; FRITZ, S.; SELLEM, E.; BASSO, B.; GÉRARD, O.; SALAS- CORTES, L.; HUMBLOT, P.; DRUART, X.; EGGEN, A. Estimation of genetic parameters and genome scan for 15 semen characteristics traits of Holstein bulls. Journal of Animal Breeding and Genetics, v. 126, p. 269-277, 2009. ESPIGOLAN, R.; BALDI, F.; BOLIGON, A. A.; SOUZA, F. R. P.; GORDO, D. G. M.; TONUSSI, R. L.; CARDOSO, D. F.; OLIVEIRA, H. N.; TONHATI, H.; SARGOLZAEI, M.; SCHENKEL, F. S.; CARVALHEIRO, R.; FERRO, J. A.; ALBUQUERQUE, L. G. STUDY OF WHOLE GENOME LINKAGE DISEQUILIBRIUM IN NELLORE CATTLE. BMC Genomics, v. 14, p. 1-8, 2013. FERNANDO, R. L.; GARRICK, D. Bayesian methods applied to GWAS. In GONDRO, C.; van der WERF, J.; HAYES, B. GENOME-WIDE ASSOCIATION STUDIES AND GENOMIC PREDICTIONS. Springer: Humana Press. 2013, chap.10, p. 237-274. FORNI, S.; AGUILAR, I.; MISZTAL, I. Different genomic relationship matrices for single-step analysis using phenotypic, pedigree and genomic information. Genetics Selection Evolution, v. 43, n. 1, p. 1-7, 2011. GIANOLA, D.; DE LOS CAMPOS, G.; HILL, W. G.; MANFREDI, E.; FERNANDO, R. Additive Genetic Variability and the Bayesian Alphabet. Genetics, v. 183, p. 347-363, 2009. HABIER, D.; FERNANDO, R. L.; KIZILKAYA, K.; GARRICK, D. J. Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics, v. 12, n. 186, p. 1-12, 2011. HAYES, B. J.; GODDARD, M. E. The distribution of the effects of genes affecting quantitative traits in livestock. Genetics Selection Evolution, v. 33, p. 209-229, 2001. HERD, D. B.; SPROTT, L. R. Body condition, nutrition and reproduction of beef cows. College Station, Texas: Texas Agricultural Extension Service. 1996, 12 p. HILL, W. G.; ROBERTSON, A. Linkage Disequilibrium in Finite Populations. Theoretical and Applied Genetics, v. 38, n. 6, p. 226-231, 1968. 31 HÖGLUND, J. K.; GULDBRANDTSEN, B.; LUND, M. S.; SAHANA, G. Analyzes of genome-wide association follow-up study for calving traits in dairy cattle. BMC Genetics, v. 13, n. 71, 2012. HYEONG, K.-E.; IQBAL, A.; KIM, J.J. Genome Wide Association Study on Age at First Calving Using High Density Single Nucleotide Polymorphism Chips in Hanwoo (Bos taurus coreanae). Asian-Australasian Journal of Animal Sciences. v. 27, n.10, 2014. HU, Z. H.; PARK, C. A.; WU, X. L.; REECY, J. M. Animal QTLdb: an improved database tool for livestock animal QTL/association data dissemination in the post-genome era. Nucleic Acids Research, v. 41, p. 871-879, 2013. LEGARRA, A.; RICARD, A.; FILANGI, O. GS3. Manual. France. [S.I.: s.n.], 2014, 24p. Available in: . Accessed in Oct. 2014. LOURENCO, D. A. L.; MISZTAL, I.; TSURUTA, S., AGUILAR, I.; EZRA, E.; RON, M.; SHIRAK, A.; WELLER, J. I. Methods for genomic evaluation of a relatively small genotyped dairy population and effect of genotyped cow information in multiparity analyses. Journal of Dairy Science, v. 97, p. 1742- 1752, 2014. MISZTAL, I. BLUPF90 - a flexible mixed model program in Fortran 90. Manual. Animal and Dairy Science, University of Georgia, Athens, GA, USA. [S.I.: s.n.], 2012, 25p. Available in: . Accessed in Feb. 2014. MEUWISSEN, T. H. E.; HAYES, B. J.; GODDARD, M. E. Prediction of total genetic value using genome-wide dense marker maps. Genetics, v. 157, n. 4, p. 1819-1829, 2001. NICOLAZZI, E. L.; PICCIOLINI, M.; STROZZI, F.; SCHNABEL, R. D.; LAWLEY, C.; PIRANI, A.; BREW, F.; STELLA, A. SNPchiMp: A database to disentangle the SNPchip jungle in bovine livestock. BMC Genomics, v. 15, n. 123, p. 1-6, 2014. PÉREZ O’BRIEN, A. M.; MÉSZÁROS, G.; UTSUNOMIYA, Y. T.; SONSTEGARD, T. S.; GARCIA, J. F.; TASSELL, C. P. V.; CARVALHEIRO, R.; SILVA, M. V. B.; SÖLKNER, J. Linkage disequilibrium levels in Bos indicus and Bos taurus cattle using medium and high density SNP chip data and different 32 minor allele frequency distributions. Livestock Science, v. 166, p. 121-132, 2014. PETERS, S. O.; KIZILKAYA, K.; GARRICK, D. J.; FERNANDO, R. L.; REECY, J. M.; WEABER, R. L.; SILVER, G. A.; THOMAS, M. G. Heritability and Bayesian genome-wide association study of first service conception and pregnancy in Brangus heifers. Journal of Animal Science, v. 91, p. 605-612, 2013. PORTO-NETO, L. R.; REVERTER, A.; PRAYAGA, K.C.; CHAN, E. K. F.; JOHNSTON, D. J.; HAWKEN, R. J.; FORDYCE, G.; GARCIA, J. F.; SONSTEGARD, T. S.; BOLORMAA, S.; GODDARD, M. E.; BURROW, H. M.; HENSHALL, J. M.; LEHNERT, S. A.; BARENDSE, W. The Genetic Architecture of Climatic Adaptation of Tropical Cattle. PLoS ONE, v. 9, n. 11, p. 1-22, 2014. SAATCHI, M.; SCHNABEL, R. D.; TAYLOR, J. F.; GARRICK, D. J. Large-effect pleiotropic or closely linked QTL segregate within and across ten US cattle breeds. BMC Genomics, v. 15, n. 442, p. 1-16, 2014. SAHANA, G.; GULDBRANDTSEN, B.; LUND, M. S. Genome-wide association study for calving traits in Danish and Swedish Holstein cattle. Journal of Dairy Science, v. 94, p. 479-486, 2011. SARGOLZAEI, M.; SCHENKEL, F. S. QMSim: User’s Guide [S.I.: s.n.], 2013, p.77. SASAKI, S.; IBI, T.; IKEDA, S.; SUGIMOTO, Y. A genome-wide association study reveals a quantitative trait locus for age at first calving in delta/notch-like EGF repeat containing on chromosome 2 in Japanese Black cattle. Animal Genetics, v. 45, n. 2, p. 285-287, 2013. SCHROOTEN, C.; BINK, M. C. A. M.; BOVENHUIS, H. Whole genome scan to detect chromosomal regions affecting multiple traits in dairy cattle. Journal of Dairy Science, v. 87, p. 3550-3560, 2004. STRANDEN, I.; GARRICK, D. J. Technical note: Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. Journal of Dairy Science, v. 92, n. 6, p. 2971-2975, 2009. THOMASEN, J. R.; GULDBRANDTSEN, B.; SØRENSEN, P.; THOMSEN, B.; LUND, M. S. Quantitative trait loci affecting calving traits in Danish Holstein cattle. Journal of Dairy Science, v. 91, p. 2098-2105, 2008. 33 VAN DEN BERG, I.; FRITZ, S.; BOICHARD, D. QTL fine mapping with Bayes C (π): a simulation study. Genetics Selection Evolution, v. 45, n. 19, 2013. VANRADEN, P. M.; VAN TASSELL, C. P.; WIGGANS, G. R.; SONSTEGARD, T. S.; SCHNABEL, R. D.; TAYLOR, J. F.; SCHENKEL, F. S. Invited review: Reliability of genomic predictions for North American Holstein bulls. Journal of Dairy Science, v. 92, n. 1, p. 16-24, 2009. WANG, H.; MISZTAL, I.; AGUILAR, I., LEGARRA, A.; MUIR, W. M. Genome- wide association mapping including phenotypes from relatives without genotypes. Genetics Research, v. 94, n. 2 p. 73-83, 2012. WANG, H.; MISZTAL, I.; AGUILAR, I.; LEGARRA, A.; FERNANDO, R. L.; VITEZICA, Z.; OKIMOTO, R.; WING, T.; HAWKEN, R.; MUIR, W. M. Genome- wide association mapping including phenotypes from relatives without genotypes in a single-step (ssGWAS) for 6-week body weight in broiler chickens. Frontiers in Genetics, v. 5, n. 134, 2014. WATERS, S. M.; COYNE, G. S.; KENNY, D. A.; MORRIS, D. G. Effect of dietary n-3 polyunsaturated fatty acids on transcription factor regulation in the bovine endometrium. Molecular Biology Reports, v. 41, p. 2745-2755, 2014. ZIMIN, A. V.; DELCHER, A. L.; FLOREA, L.; KELLEY, D. R.; SCHATZ, M. C.; PUIU, D.; HANRAHAN, F.; PERTEA, G.; VAN TASSELL, C. P.; SONSTEGARD, T. S.; MARÇAIS, G.; ROBERTS, M.; SUBRAMANIAN, P.; YORKE, J. A.; SALZBERG, S. L. A whole-genome assembly of the domestic cow, Bos taurus. Genome Biology, v. 10, i. 4, 2009. 34 CHAPTER 3. GENOME WIDE ASSOCIATION STUDY OF HEIFER REBREEDING IN NELORE CATTLE ABSTRACT: A genome wide association study (GWAS) was conducted aiming to find important genomic regions (QTLs) associated with heifer rebreeding (HR) in Nelore cattle. The data consisted of 142,878 HR records and 2,923 high-density (777K SNPs) genotypes. The GWAS was performed using the weighted single step GBLUP method, under three different weightings for the SNPs. Genetic variances explained by consecutive no-overlapping 1Mb SNP- windows were calculated, for each of the three GWAS. The top 10 windows, that captured the highest proportion of variance explained by the markers, were further investigated. To reinforce their suggestive importance, scientific papers were consulted to inspect if any identified top window overlapped with a previously described QTL related to bovine reproductive traits. A possible association of the genes, within the top 10 windows, with HR were also investigated. From the three performed GWAS, a total of 21 different windows were detected, that harbored 13 QTLs previously reported in literature for different reproductive (or related) traits. The top 10 marker-windows contained 182 annotated genes. Some of them were associated with pathways of reproductive traits. Evidence was found in the present study that important candidate genes affecting HR in Nelore cattle were identified. Keywords: beef cattle, genes, GWAS, QTL, reproductive trait 35 3.1 Introduction Reproductive traits are closely associated with the success of beef cattle productive system (CAMPOS et al., 2005). Some of the reproduction traits are included in breeding programs aiming to produce a more profitable herd. Heifer rebreeding (HR) is an important reproductive trait because heifers that conceive in breeding season soon after the first calving allow a fast return on investments. Furthermore, this is a delicate stage of female life, because it requires energy to its maintenance, lactation, and in some cases to growth (NAAB, 2004) or to recover its body condition score. HR is been considered in different breeding programs (RILEY et al., 2010; BOLIGON & ALBUQUERQUE, 2012), because heifers usually present low rebreeding rates (CORRÊA et al., 2001). Despite the economic importance of HR, its selection is difficult because HR presents low heritability (DOYLE et al., 2000; GUARINI et al., 2014), it is expressed late in life and just in females (which present lower selection intensity compared to males). So, the genetic evaluations based just in phenotype and pedigree records usually provide low accurate genetic proofs, even when correlated traits are considered in the analyses. Using genomic information from high-density Single Nucleotide Polymorphism (SNP) panels have been applied to perform genome wide association studies (GWAS). A methodology that has been used to GWAS is the “Weighted Single Step GBLUP (WssGBLUP)” (WANG et al., 2012), that allows to combine pedigree, phenotypes and genotypes information in a single step, in which different weights can be attributed to each marker. With the crescent number of GWA studies in literature, several new QTLs are detected for many traits in different species. In QTLdb database (HU et al., 2013) 16,919 QTLs were reported for many traits in cattle until Feb. 13, 2015. However, few of these QTLs have been validated or reproduced by other studies. Therefore, even though there is evidence of important genomic regions 36 being identified, these results should be carefully interpreted (FRAGOMENI et al., 2014). Known genes associated with the trait of interest in the region could be a good evidence of a true QTL (SAATCHI et al., 2014). The purpose of this study was to find important genomic regions (QTLs) associated with heifer rebreeding (HR) in Nelore cattle. 3.2. Material and Methods 3.2.1. Phenotypic and pedigree information Phenotypic information of HR was obtained from Alliance Nelore database. After editing, the data contained HR records of 142,878 Nelore heifers, born from 1984 to 2010, and raised in 188 different commercial farms located in the southeast, west and central regions of Brazil and in Paraguay. The feeding system adopted by these farms consisted basically of tropical pastures, mineral salt and water ad libitum. In the dry season, the animals usually receive mineral supplementation. During the mating season, which lasts approximately 90 days and occurs usually in the rainy period, the heifers are either artificially inseminated or naturally mated. A total of 44.1% of the first-calves (62,996) and 44.3% of the second-calves (35,539) were born from AI-sired heifers. In general, the first mating of heifers occurs at about 26 months of age, although some herds expose heifers earlier, at around 14-18 months of age, in an anticipated mating season. In the data used, 46.6% (66,643) of the heifers were earlier exposed, from which 21.4% (14,287) conceived. Those that were exposed and did not conceive earlier had a second chance at about 26 months of age, in the regular mating season, and all the heifers that did not get pregnant at this period were culled, including those that were exposed for the first time. HR was defined as success (1) or failure (0) for heifers that calved or not, respectively, since they had previously produced the first calf. In the editing process, were excluded heifers presenting age at first and second calving lower or greater than 21 and 40 months, and 32 and 53 months, respectively, and 37 calving interval lower than 11 months. Contemporary groups (CG) were defined concatenating information of herd, year and season of birth, and weaning and yearling management groups of the heifers. CG with less than five heifers and without variability for HR, i.e. composed by animals with the same categorical response, were excluded from the data. After editing, 78,389 (54.9%) and 64,489 (45.1%) heifers presented success and failure to re-conceive, respectively. The 142,878 heifers were daughters of 2,391 different sires and 108,440 dams. The pedigree file contained 223,195 animals distributed over five generations. 3.2.2. Genomic information A total of 2,925 Illumina Bovine HD genotypes (Illumina®, San Diego, CA, USA) were used. Genotypes came from 2,212 Nelore heifers, from 12 different herds, and from 713 Nelore sires that had on average 73.6 progeny evaluated for HR. The genotyped heifers and sires were born from 2002 to 2009 and 1965 to 2006, respectively. The quality control (QC) of the genotypes were performed considering the following exclusion criteria: i) for the single nucleotide polymorphism markers (SNPs): from non-autosomal regions, mapped to the same position, presenting a p-value for the Hardy-Weinberg equilibrium test lower than 10-5, with GC score lower than 0.15, call rate lower than 0.95, and minor allele frequency lower than 0.02; ii) for samples: with call rate lower than 0.9 and being a replicate. The remaining number of SNPs and samples after QC were 409,376 and 2,923, respectively. 3.2.3. Statistical analysis The SNP marker effects were estimated using the weighted single-step GBLUP method (WssGBLUP) proposed by Wang et al. (2012). This method was chosen for allowing combining pedigree, phenotypic and genomic information in a single-step, weighting the marker effects according to their importance for the trait of interest. The WssGBLUP is claimed to be 38 advantageous when working with datasets with many phenotypes, but few available genotypes (WANG et al., 2012, 2014), as is the case in the present study. To run the analysis, the BLUPF90 family programs (MISZTAL et al., 2012) were used. Firstly, predicted breeding values were obtained based on the following threshold animal model (GIANOLA & FOULLEY, 1983): eaZXy a ++β= , where y is a vector of underlying liabilities for HR, β is a vector of fixed effects of CG, a is a vector of random additive direct genetic effects (breeding values), X and Za are incidence matrices relating elements in β and a to y, respectively, and e is the vector of random residuals. The underlying liabilities for HR were defined as follows: HR = 0 if y ≤ t1; and HR = 1 if y > t1, where t1 is the threshold corresponding to the discontinuity in the observed scale of HR. The covariance between a and e was assumed equal to zero and their variances were considered, respectively, equal to 2 aHσ and 2 eIσ , where 2 aσ and 2 eσ are the additive direct and the residual variances, respectively, H is the matrix which combines pedigree and genomic information (AGUILAR et al., 2010), and I is an identity matrix. As the variable in the underlying distribution is not observable, the parameterization 2 eσ = 1 was adopted (SORENSEN & GIANOLA 2002). The parameter estimates of the threshold model were obtained under a Bayesian framework using the Gibbs sampling program THRGIBBS1F90 (TSURUTA & MISZTAL, 2006). The default prior distributions were assumed for the variance components and for the fixed and random effects. The Gibbs sampler was run in a single chain of 500,000 iterations, with a burn-in of 50,000 and a thin interval of 50 iterations, totalling 9000 posterior samples for each parameter being estimated. The posterior means of the samples were used as the parameter estimates. The convergence of Monte Carlo chains of 2 aσ and heritability were evaluated using the software postGibbsf90 (MISZTAL et al., 2012) and the R package BOA (SMITH, 2008). The solutions of SNP effects (û) were then obtained according to VanRaden et al. (2009) and Stranden & Garrick (2009) equation: [ ] g 1â'ZDZ'DZû −= , where D is a diagonal matrix with weights for SNP effects, Z 39 is a matrix relating genotypes of each locus, and gâ is the vector of predicted breeding values of genotyped animals. The D matrix and the SNP effects were iteratively recomputed following the method described by Wang et al. (2012) as “ssGBLUP/S1”. In the first iteration, the diagonal elements of D (di) were assumed equal to 1 (i.e. the same weight for all the markers). For the subsequent iterations, di was calculated as: di = ûi 2pi(1-pi), where ûi is the allele substitution effect of the ith marker, estimated from the previous iteration, and pi is the allele frequency of the second allele of the ith marker. Prior to re- computing û, the D matrix was normalized to enforce the total genetic variance to be constant across iterations. Three iterations were performed resulting in an increasing shrinkage from iteration 1 to 3 for the markers explaining lower variance and, as a consequence, in an increasing proportion of variance being explained by the remaining markers. According to Wang et al. (2014), 3 iterations are sufficient to reduce the noise of unimportant markers, i.e. to shrink their effects towards zero. 3.2.4. QTL mapping The attempt to identify quantitative trait locus (QTL) affecting HR was made based in the proportion of variance explained by SNPs within non- overlapping consecutive 1Mb windows. As the effect of a QTL may be spread over a number of neighbouring SNPs, due to linkage disequilibrium, the presence of a previous described QTL was also inspected in the neighbouring 1Mb windows of the top ones. A total of 2,523 windows were considered, spanning over all the autosomes, with an average density of 162±48 SNPs per window. The top 10 windows, that captured the highest proportion of variance explained by the markers, were assumed as important genomic regions. To reinforce their suggestive importance, scientific papers and the cattle QTLdb database (HU et al., 2013) were consulted to inspect if any identified top window overlapped with a previously described QTL related to bovine reproductive traits. The UMD3.1 bovine genome assembly (ZIMIN et al., 2009) was used as the reference to build the 1Mb windows and to consult the QTLdb database. 40 3.2.5. Candidate genes Annotated genes located within the top 10 1Mb windows, from the three iterations of WssGBLUP (Table 2), were further inspected. The list of genes was provided by the NCBI Map Viewer tool (www.ncbi.nlm.nih.gov/mapview/), using the annotation release 103 and the Bos taurus UMD 3.1 as the primary assembly. The Panther database v.9.0 (MI et al., 2013) was then used to find biological processes related to these genes. Finally, it was investigated if there was evidence that these processes were associated with reproductive traits. 3.3 Results and Discussion The estimated parameters, genetic additive variance and residual variance converged, as showed in supplementary material (Figure S1A and S1B). Their convergence was evaluated by the heritability pattern. Table S1 present some convergence statistics and Geweke test. The effective sample size and independent chain size for the additive variance were 149.3 and 110, respectively. All the statistics evidenced adequacy of the chain size. It was detected 21 different top 10 marker-windows using w1, w2 and w3 weights (Table 1). There were five common windows between w1 and w2 weights, three between w3 and w1 and between w3 and w2 (in bold). Eight regions presented some association with reproductive traits (one star) and six regions presented some association with sexual precocity traits (two stars). It was observed an increasing in the proportion of variance explained by the top 10 marker windows from w1 to w3. The sum of this proportion was equal to 1.53, 3.63 and 4.66 for w1, w2 and w3, respectively. It is important to emphasize that the applied algorithm to calculate the weights in the WssGBLUP method iterated just on SNP effects (S1). As observed by Wang et al. (2012), the S1 algorithm promotes a slower “thinning” than S2, which recalculates the SNP and animal effects in each iteration. This effect can be better observed in 41 Figure 1, which presents the Manhattan plots of the variance explained by SNP windows for the three iterations. Although it was observed a stronger shrinkage in each subsequent iteration for the SNPs explaining lower variance, there were no clear peaks, even increasing the weights. This result indicates the polygenic nature of HR and can also be associated to the lack of sufficient information in the data to properly identify the important genomic regions affecting HR. Table 1. Top 10 windows explaining the highest proportion of variance of heifer rebreeding in GWAS using weighted single step GBLUP method using three iterations (w1, w2 and w3). w1 w2 w3 Rank Ch/Wi1 Pvar2 Ch/Wi Pvar Ch/Wi Pvar 1st 17/050 0.250 17/050 0.630 17/050 0.694 2nd 25/023* 0.179 11/075* 0.404 06/059 0.502 3rd 14/081 0.160 04/007 0.386 11/073* 0.486 4th 23/025** 0.151 06/079 0.354 26/043* 0.444 5th 20/004* 0.146 20/004* 0.328 20/004* 0.440 6th 06/079 0.138 26/043* 0.328 14/022** 0.434 7th 14/065* 0.136 29/014* 0.313 06/079 0.428 8th 10/089** 0.125 25/023* 0.308 19/033* 0.419 9th 05/012** 0.124 06/111 0.297 22/043** 0.407 10th 04/007 0.122 11/072 0.280 09/068** 0.407 1Ch=chromosome; Wi=1Mb window within the chromosome; 2pvar=proportion of variance explained by the SNPs within the window. The most common windows (ranked as top 10 in all analyses) are highlighted in bold. *Window (or neighbouring window) with a previous described QTL for bovine reproduction traits. **Window (or neighbouring window) with a previous described QTL for bovine sexual precocity traits. 42 Figure 1. Manhattan Plots of a genomic wide association study on heifer rebreeding with three iterations (w1, w2 and w3). 43 The studies that reported QTLs associated with reproductive traits within the top 10 windows are presented in Table 2. Three of the QTLs are directly associated with HR: 56-d nonreturn rate for Holstein heifers (HOGLUND et al., 2009), age at puberty in Brahman and Tropical composite herds (HAWKEN et al., 2012) and length in days of interval from first to last insemination for Holstein cows (HOGLUND et al., 2009). Many calving traits were reported in the same important regions detected for HR. This could be explained because most of the studies involved taurine breeds, known to present calving problems. Despite fat thickness is not a reproductive trait it is associated with the female body condition, which affects the ability to conceive. Table 2. The most important regions associated with heifer rebreeding that contained previously reported QTLs associated with reproductive traits for different breeds. Ch/wi1 Traits reported in the region/Breed Authors2 26/43 Direct first calving easy and maternal calving survival in first lactation/ Holstein Sahana et al. (2011) 20/4 Calving easy direct and calving easy maternal/ Angus, Hereford, Simmental and Red Angus Saatchi et al. (2012) 19/33 Calving traits/Holstein Gestation length/US Holstein and Italian Brown Sahana et al. (2011) Malteca et al. (2011) 9/68 Scrotal circumference/ Angus Fertility index/ Holstein McClure et al. (2010) Sahana et al. (2010) 23/25 Scrotal circumference/ Angus McClure et al. (2010) 25/23 Maternal calving survival/Holstein Sahana et al. (2011) 14/65 Fat thickness/ Angus McClure et al. (2010) 5/12 Scrotal circumference/ Angus McClure et al. (2010) 29/14 Scrotal circumference/ Angus McClure et al. (2010) 44 22/43 56-d nonreturn rate for heifers/ Holstein Hoglund et al. (2009) 14/22 Age at puberty3/Brahman and Tropical composite Hawken et al. (2012) 11/75 Calving easy maternal/Hereford Saatchi et al. (2014) 10/89 Length in days of interval from first to last insemination for cows/Holstein Hoglund et al. (2009) 1Ch/Window= Chromosome/Window. 2Studies that detected regions or significant SNPs overlapping or neighboring the top 10 windows, using the UMD3.1 bovine genome assembly. 3Defined as age in days at first observed corpus luteum after frequently ovarian ultrasound scans. In the 21 different top 10 marker-windows (and neighboring windows) associated with HR (Table 1) there were 271 annotated genes, including LOC genes according to NCBI Map Viewer. Excluding these LOC genes, remained 182 described genes (Supplementary material, Table S2). The software Panther (v.9.0) (MI et al., 2013) classified these genes by biological process in 12 different categories, as showed in Figure 2. Most of the genes were associated with metabolic and cellular processes (37.9% of the total). Figure 2. Biological processes of genes (and number of genes associated) located in the top 10 windows for heifer rebreeding. Two genes were detected associated with reproduction process, the Transforming growth factor beta 3 (TGFB3) and von Willebrand C domain- 45 containing protein (VWC2) genes. TGFB3 gene located in BTA10 at 88Mb is associated with polycystic ovary syndrome (PCOS) in human and cattle females (HATZIRODOS et al., 2011). Females with PCOS present the activity of this