UNIVERSIDADE ESTADUAL PAULISTA “JÚLIO DE MESQUITA FILHO” FACULDADE DE MEDICINA Marília Rodrigues Silva Passos DIVERSIDADE GENÉTICA E VARIANTES ESTRUTURAIS DO BLOCO ALFA DO COMPLEXO PRINCIPAL DE HISTOCOMPATIBILIDADE HUMANO Dissertação apresentada à Faculdade de Medicina, Universidade Estadual Paulista “Júlio de Mesquita Filho”, Campus de Botucatu, para obtenção do título de Mestra em Patologia. Orientador: Prof. Dr. Erick da Cruz Castelli Botucatu 2020 Marília Rodrigues Silva Passos Diversidade genética e variantes estruturais do bloco alfa do complexo principal de histocompatibilidade humano Dissertação apresentada à Faculdade de Medicina, Universidade Estadual Paulista “Júlio de Mesquita Filho”, Campus de Botucatu, para obtenção do título de Mestra em Patologia. Orientador: Prof. Dr. Erick da Cruz Castelli Botucatu 2020 Palavras-chave: Alpha block; HLA; MHC; NGS; Variantes estruturais. Passos, Marília Rodrigues Silva. Diversidade genética e variantes estruturais do bloco alfa do complexo principal de histocompatibilidade humano / Marília Rodrigues Silva Passos. - Botucatu, 2020 Dissertação (mestrado) - Universidade Estadual Paulista "Júlio de Mesquita Filho", Faculdade de Medicina de Botucatu Orientador: Erick da Cruz Castelli Capes: 21103003 1. Genes MHC classe I. 2. Antígenos HLA. 3. Histocompatibilidade. 4. Variação genética. DIVISÃO TÉCNICA DE BIBLIOTECA E DOCUMENTAÇÃO - CÂMPUS DE BOTUCATU - UNESP BIBLIOTECÁRIA RESPONSÁVEL: ROSANGELA APARECIDA LOBO-CRB 8/7500 FICHA CATALOGRÁFICA ELABORADA PELA SEÇÃO TÉC. AQUIS. TRATAMENTO DA INFORM. DEDICATÓRIA À minha família, meu filho Miguel Schitini Silva Passos, meu marido João Vitor Schitini Passos e nossos dois anjos que estão no céu. Meus pais, Valdicéia Alves R. Silva e Luiz Antonio da Silva Junior, meu irmão Luiz Leonardo R. Silva, minha madrinha Neusa Maria R. F. da Silva e à nossa Ana Luiza. AGRADECIMENTOS Agradeço primeiramente a Deus pela vida, pelas oportunidades, pelos livramentos e pelas pessoas maravilhosas que surgiram em meu caminho. Pela chance de estudar e tentar ser uma pessoa melhor no mundo. À meus pais, Valdicéia e Luiz Antônio ou Val e Luizão, pela criação e carinho recebido em toda minha vida. A dificuldade que superaram para nos criar em uma casa simples, num bairro simples no interior só me ensinou a valorizar a família e que é preciso esforço e dedicação para crescer. Todo esforço de vocês está refletido na vida maravilhosa que levam, e todo amor que sentem pelo meu filho Miguel está refletido no sorriso do nosso pequeno. Ao melhor amigo, companheiro e marido do mundo, João Vitor, que se dispôs a cuidar de nossa família sozinho para que eu realizasse o sonho do mestrado. Foram muitas noites sem voltar pra casa, viagens para congressos, diagnósticos preocupantes, semanas estressantes e você ali comigo, sem deixar que eu desista. Nossa casa, nosso filho, nossa família toda sendo cuidada por você nesses três anos, representando o pilar que você sempre foi para nós. Te amo. Aos meus “amigos-professores” da graduação Lígia Maria Micai Gomide, Messias Miranda Junior, Renato Paschoal Prado, Thiago Cândido e Bruna Ventura de Campos Camargo, pelo incentivo e ajuda nos anos que passamos juntos e mais ainda depois que decidi enfrentar mais esse desafio. Vocês foram essenciais para minha inserção no mundo científico. Para todo sempre meu muito obrigada! Ao meu orientador Dr Erick da Cruz Castelli por me aceitar em sua equipe e apresentar o maravilhoso mundo da Bioinformática e Imunogenética, sempre ensinando com muita paciência e se preocupando com todos seus alunos como se fossem filhos. Tenho certeza que sentirei muita falta durante nosso “período de pausa”. Serei sempre grata por tudo. Aos meus colegas do Laboratório de Genética Molecular e Bioinformática por todos os momentos de alegria, tristeza, concentração e preocupação. Tudo que aprendemos juntos e ensinamos juntos só me tornou uma pessoa ainda melhor, absorvendo um pouco do que cada um traz no coração. Thálitta e Andreia, obrigada pela ótima recepção e pela parceria que sempre tivemos, vocês são excelentes educadoras e sempre possuirão um lugar na minha família. Emiliana e Heloisa, agradeço por todos os perrengues que passamos juntas, por toda preocupação, por toda superação, por me acolher sempre e sempre vibrar com minhas conquistas como se fosse de vocês. Juliana Lara, Michele, Iane e Camila, sempre serão minhas mestres e nunca me cansarei de sentir orgulho de vocês. Enfim, a todos os companheiros de vida nesse laboratório meu muito obrigada. Ao Programa de Pós Graduação em Patologia da Faculdade de Medicina de Botucatu (FMB-UNESP) pela oportunidade de aprender e a todo o corpo docente pelos ensinamentos nas disciplinas do programa. O presente trabalho foi realizado com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Código de Financiamento 001. Por fim, a todos os membros da banca que se disponibilizaram a estar aqui e contribuir para minha formação bem como para a melhoria desse trabalho. A todos que contribuíram direta ou indiretamente para a realização de mais essa conquista. Muito obrigada! EPÍGRAFE “O êxito da vida não se mede pelo caminho que você conquistou, mas sim pelas dificuldades que superou no caminho” Abraham Lincoln RESUMO Resumo O MHC humano possui cerca de 224 genes, a maioria deles está relacionada de alguma forma com o sistema imunitário. O bloco alfa do MHC humano compreende genes importantes do complexo HLA, como HLA-F, HLA-G e HLA-A. HLA-F e HLA-G são genes HLA não clássicos que possuem baixa variabilidade, expressão restrita em tecidos e tem função imunomodulatória, enquanto HLA-A é um gene altamente polimórfico e importante para a apresentação de antígenos. Este segmento também apresenta algumas variantes estruturais, como elementos Alu e CNVs. O HLA-G e o HLA-A foram anteriormente descritos para frequência e variabilidade em pequenas amostras, além disso, a diversidade de HLA-F em populações mundiais é desconhecida. Aqui aplicamos um pipeline de bioinformática que otimiza o mapeamento de leituras curtas para genes HLA e permite a determinação de haplótipos confiáveis para pesquisar a diversidade genética do bloco alfa do MHC em 1323 brasileiros, juntamente da avaliação de três variantes estruturais, AluyHF, AluyHG e esv3608493, em uma amostra altamente miscigenada da maior cidade brasileira, São Paulo. O HLA-F é o mais conservado, com 82,65% dos cromossomos codificando a mesma proteína HLA-F e com apenas uma proteína adicional frequente. O gene HLA-A mostrou-se altamente polimórfico, com 96 sequências genômicas codificando 59 proteínas, incluindo as raras. Foi encontrado forte LD entre AluyHF e HLA-F, mas não com alelos específicos de HLA-F. Enquanto AluyHG foi encontrado em forte LD com um alelo específico de HLA-G, G*01:01:01:01/UTR-1, e esv3608493 foi associado apenas aos alelos HLA-A*23 e A*24. Também detectamos haplótipos frequentes em nossa amostra, especialmente a estreita relação entre os alelos HLA-G e HLA-A. Por esse motivo, a evidência de seleção balanceadora em ambos os genes pode não ser independente, pois a alta heterozigose em um locus aumentaria a heterozigose no outro. Esses dados reúnem um conjunto de informações valiosas sobre a variabilidade e estrutura de haplótipos estendidos na região do alfa-bloco do complexo HLA, que podem contribuir em estudos clínicos e evolutivos. ABSTRACT Abstract Human MHC has about 224 genes, most of which are related in some way to the immune system. The alpha block of human MHC comprises important genes of the HLA complex, such as HLA-F, HLA-G, and HLA-A. HLA-F and HLA-G are non-classical HLA genes that have low variability, restricted expression in tissues and have an immunomodulatory function, while HLA-A is a highly polymorphic gene and important for the presentation of antigens. This segment also has some structural variants, such as Alu elements and CNVs. HLA-G and HLA- A have previously been described for frequency and variability in small samples, also, the diversity of HLA-F in populations worldwide is unknown. Here we apply a bioinformatics pipeline that optimizes the mapping of short readings for HLA genes and allows the determination of reliable haplotypes to research the genetic diversity of the MHC alpha block in 1323 Brazilians, together with the evaluation of three structural variants, AluyHF, AluyHG, and nsv823470, in a highly mixed sample from the largest Brazilian city, São Paulo. HLA-F is the most conserved, with 82.65% of chromosomes encoding the same HLA-F protein and with only one additional frequent protein. The HLA-A gene was shown to be highly polymorphic, with 96 genomic sequences encoding 59 proteins, including rare ones. Strong LD was found between AluyHF and HLA-F, but not with specific HLA-F alleles. While AluyHG was found in strong LD with a specific HLA-G allele, G*01:01:01:01/UTR-1 and nsv823470 were associated only with the HLA-A*23 and A*24 alleles. We also detected frequent haplotypes in our sample, especially the close relationship between the HLA-G and HLA-A alleles. For this reason, the evidence of balanced selection in both genes may not be independent, since high heterozygosity in one locus would increase heterozygosity in the other. These data bring together a set of valuable information about the variability and structure of extended haplotypes in the alpha- block region of the HLA complex, which can contribute to clinical and evolutionary studies. Lista de ilustrações Capítulo 1. Revisão de Literatura Figura 1. Mapa genômico humano da localização dos genes MHC de classe I, nnnnnnnnnnnnlldivididos nos blocos de segregação: Beta, Kappa e Alfa. Na imagem é nnnnnnnnnnnnllpossível observar os genes HLA-A, -G e –F (objetos do presente estudo), nnnnnnnnnnnnllalém dos genes MICA, MICB e os pseudogenes HLA-H e –J.. ...................... 21 Figura 2. Composição gênica do bloco alfa do MHC humano e as variantes estruturais nnnnnnnnnnnnnlpresentes nele................................................................................................. 26 Capítulo 2. Artigo Figure 3. A map of the human MHC alpha block and the genes and structural variants nnnnnnnnnnnnlevaluated in a large admixed Brazilian sample from São Paulo. .................... 47 Figure 4. Linkage disequilibrium across the alpha-block, considering 638 biallelic SNPs, nnnnnnnnnnnnlland the structural variants AluyHF, AluyHG, and nsv823470. ...................... 47 Lista de tabelas Capítulo 2. Artigo Table 1. HLA-F CDS alleles in an admixed sample from São Paulo (city), Brazil. ....... 48 Table 2. HLA-F coding alleles in an admixed sample from São Paulo (city), Brazil. .... 48 Table 3. HLA-G CDS alleles in an admixed sample from São Paulo (city), Brazil. ...... 49 Table 4. HLA-G genomic alleles in an admixed sample from São Paulo (city), Brazil. 50 Table 5. HLA-A CDS alleles in an admixed sample from São Paulo (city), Brazil. ....... 51 Table 6. HLA-A genomic alleles in an admixed sample from São Paulo (city), Brazil. . 52 Table 7. HLA-F genomic alleles and the AluyHF haplotypes in an admixed sample from São Paulo (city), Brazil. .................................................................................... 53 Table 8. HLA-G genomic alleles and the AluyHG haplotypes in an admixed sample from São Paulo (city), Brazil. .................................................................................... 54 Material suplementar Table S1. List of variants observed in the across the HLA-F gene in 1323 Brazilians from São Paulo, and their frequencies.......................................................................................................56 Table S2. List of variants observed in the across the HLA-G gene in 1323 Brazilians from São Paulo, and their frequencies. ...................................................................... 59 Table S3. List of variants observed in the across the HLA-A gene in 1323 Brazilians from São Paulo, and their frequencies. ...................................................................... 62 Table S4. List of all HLA-G / HLA-A haplotypes observed in 1323 Brazilians from São Paulo and their frequencies. .............................................................................. 74 Table S5. List of all HLA-G / HLA-A haplotypes observed in 1323 Brazilians from São Paulo and their frequencies.. ............................................................................. 76 Lista de abreviaturas e siglas MHC Major Histocompatibility Complex (Complexo Principal de Histocompatibilidade) Mb Megabases (106 bases) HLA Human Leukocyte Antigen (Antígeno Leucocitário Humano) β Beta α Alfa Ig Imunoglobulina MICA Major histocompatibility complex class I chain-related gene A MICB Major histocompatibility complex class I chain-related gene B NK Natural Killer IPD Immuno Polymorphism Database IMGT ImMunoGeneTics information system 3’UTR 3’ Unstranslated region 5’UTR 5’ Unstranslated region HIV-1 Vírus da imunodeficiência humana tipo 1 Aids Síndrome da imunodeficiência adquirida DM-1 Diabetes melitus tipo 1 i. e. id est (isto é) SINE Short Interspersed Nuclear Element bp ou pb base pairs (pares de base) RNA Ácido ribonucleico LINE Long interspersed Nuclear Elements PCR Reação em cadeia da polimerase DNA Ácido desoxirribonucleico SNP Single nucleotide polymorphism (Polimorfismo de base única) Kb Kilobases (103 bases) CNV Copy Number Variation LD Linkage disequiilibrium (Desequilíbrio de ligação) ILT2 Human immunoglobulin-like transcript 2 ILT4 Human immunoglobulin-like transcript 4 KIR3DL2 Killer cell immunoglobulin-like receptor 3DL2 KIR2DS4 Killer cell immunoglobulin-like receptor 2DS4 HCG4B HLA complex group 4B WGS Whole Genome Sequencing SABE Saúde, Bem Estar e Envelhecimento CEGH-CEL Centro de Estudos do Genoma Humano e Células-Tronco EDTA Ethylene diaminotetracetic acid Hg38 Humam genome version 38 BAM Binary Alignment Map GATK Genome Analysis Toolkit GVCF GenomicVariant Call Format RBP ReadBackedPhasing VCF Variant Call Format MAF Minimum allele frequency CDS Coding Sequence STR Short tandem repeats NGS Next generation sequencing REDOME Registro Nacional de Doadores de Medula Óssea Chr6 Cromossomo 6 SNPid Identificação do SNP SUMÁRIO CAPÍTULO 1 .......................................................................................................................... 18 1. Revisão de Literatura ................................................................................................... 19 1.1. O MHC humano ........................................................................................................ 19 1.2. HLA-A ....................................................................................................................... 21 1.3. HLA-G ....................................................................................................................... 22 1.4. HLA-F ....................................................................................................................... 23 1.5. Inserções Alu ............................................................................................................. 23 1.6. Forças seletivas...........................................................................................................23 1.7. Referências ................................................................................................................ 27 CAPÍTULO 2 .......................................................................................................................... 30 2. Artigo ............................................................................................................................... 31 2.1. Introduction ............................................................................................................... 32 2.2. Methods ..................................................................................................................... 36 2.3. Results ....................................................................................................................... 38 2.4. Discussion .................................................................................................................. 41 MATERIAL SUPLEMENTAR ............................................................................................ 55 Table S1. List of variants observed in the across the HLA-F gene in 1323 Brazilians from São Paulo, and their frequencies....................................................................................... 56 Table S2. List of variants observed in the across the HLA-G gene in 1323 Brazilians from São Paulo, and their frequencies....................................................................................... 59 Table S3. List of variants observed in the across the HLA-A gene in 1323 Brazilians from São Paulo, and their frequencies....................................................................................... 62 Table S4. List of all HLA-G/HLA-A haplotypes observed in 1323 Brazilians from São Paulo and their frequencies............................................................................................... 74 Table S5. List of all HLA-G/HLA-A haplotypes observed in 1323 Brazilians from São Paulo and their frequencies............................................................................................... 76 2.5. References ................................................................................................................. 81 3. CONCLUSÃO ................................................................................................................. 87 18 CAPÍTULO 1 19 1. REVISÃO DE LITERATURA 1.1. O MHC humano O complexo principal de histocompatibilidade (MHC, do inglês Major Histocompatibility Complex) localiza-se no braço curto do cromossomo seis (6p21.3) em um segmento de aproximadamente 3,5 Mb, com cerca de 224 genes. Esses genes e as moléculas que eles codificam, bem como sua organização cromossômica, estão presentes em todos os vertebrados que possuem mandíbula, os gnatostomados, numa organização que vem sendo mantida conservada por mais de 500 milhões de anos (KLEIN; SATO, 1998). O MHC foi descoberto por meio de estudos de rejeição a transplantes em camundongos. Essa descoberta rendeu um prêmio Nobel em 1980 a George Davis Snell, Jean Baptiste Dausset e Baruj Benacerraf (RAJU, 1999). Esses pesquisadores perceberam que enxertos de pele entre camundongos isogênicos não eram rejeitados, enquanto transplantes entre linhagens diferentes eram rejeitados pelo receptor. Em humanos, essa descoberta foi feita a partir de três grupos de pacientes: (i) aqueles que haviam passado por múltiplas transfusões sanguíneas, (ii) pacientes que haviam passado por transplantes renais e (iii) mulheres multíparas. Analisando o sangue desses pacientes, foi encontrado, respectivamente, anticorpos que reconheciam as células sanguíneas e renais dos doadores e as células paternas. Por esse motivo, o MHC humano foi chamado de Antígenos Leucocitários Humanos (HLA, do inglês Human Leucocyte Antigens) (TERASAKI, 2007; THORSBY, 2009). A maioria dos genes do MHC está relacionada de alguma forma com o sistema imunitário (GOLDBERG; RIZZO, 2015). Alguns genes desse complexo são particularmente variáveis e seus produtos são fundamentais ao desenvolvimento da resposta imunológica. Com base em estrutura molecular e funções, esses genes foram divididos em três classes. Os genes de classe I e II também são conhecidos como genes HLA, uma vez que atuam na apresentação antigênica à linfócitos. Esses genes, principalmente aqueles cuja molécula codificada é responsável pela apresentação antigênica, estão entre os mais polimórficos do genoma humano (SHIINA et al., 2009). Os principais genes de classe I envolvidos diretamente na apresentação de antígenos aos linfócitos T durante a resposta imune são os genes HLA-A, HLA-B e HLA-C, também chamados de genes clássicos de classe I (KLEIN; SATO, 2000). Estes genes são extremamente polimórficos e expressos na maioria das células nucleadas, apresentando juntos mais de 12.000 alelos conhecidos (http://www.ebi.ac.uk/ipd/imgt/hla/stats.html). O mais variável desses genes é o HLA-B, com 4.828 alelos, seguido pelo HLA-A (3.968 alelos) e HLA-C (3.579 alelos). Esses 3 genes codificam a cadeia alfa de um heterodímero de membrana, responsável pela 20 apresentação antigênica (a molécula HLA). As cadeias alfas ligam-se não covalentemente a uma molécula de β2-microglobulina, codificada por um gene localizado fora do complexo HLA, no cromossomo 15. A molécula HLA de classe I apresenta cinco domínios: dois domínios de ligação a peptídeos, α1 e α2, que juntos formam a fenda de ligação de peptídeo; um domínio de imunoglobulina (Ig), α3; um domínio transmembrana e a cauda citoplasmática. A maioria das células humanas nucleadas expressam moléculas HLA de classe I clássicas na sua superfície, juntamente com um repertório de peptídeos de origem endógena associados a estas moléculas HLA. Os linfócitos T CD8+ precisam diferir peptídeos normais dos estranhos ao organismo (AMIGORENA, 2016; BRUTKIEWICZ, 2016). Desse modo, as moléculas HLA de classe I participam desse processo de reconhecimento do próprio e do não próprio. Para isso, os antígenos são mantidos na fenda de ligação a peptídeos por meio de interações não covalentes, entre a sequência do peptídeo e pontes de hidrogênio formadas entre aminoácidos conservados da molécula de HLA e aminoácidos do peptídeo (localizados nos finais da fenda) (JENSEN, 2007). A variabilidade dos genes HLA clássicos está diretamente relacionada com a capacidade de apresentação de diferentes peptídeos, tanto do ponto de vista pessoal quanto populacional. Desse modo, a capacidade de apresentação antigênica das moléculas HLA possui um importante papel em infecções virais. Além disso, vários alelos de genes HLA associam-se à uma série de contextos patológicos, como doenças autoimunes, doenças infecciosas, neoplasias e o resultado de transplantes (OGAHARA et al., 1998; VANDIEDONCK et al., 2009). Na mesma região genômica dos genes clássicos estão os genes HLA-G, HLA-E e HLA- F, conhecidos como genes HLA não clássicos de classe I. Ainda, na região de classe I do MHC humano temos os genes MICA e MICB (Major histocompatibility complex class I chain-related gene A and B). Esses genes codificam moléculas que apresentam similaridades estruturais com as moléculas HLA de classe I. No entanto, MICA e MICB não se ligam à β2-microglobulina ou apresentam peptídeos aos linfócitos T. A transcrição destas moléculas é induzida em condições de estresse celular, e ocorre principalmente na superfície celular de fibroblastos, epitélio intestinal e tumores. Essas moléculas interagem com o receptor NKG2D presente em células NK e em linfócitos T, modulando a atividade dessas células. MICA, por exemplo, é expresso em uma série de tecidos, principalmente tecidos endócrinos, fígado, vesícula biliar, pulmão e testículo (http://www.proteinatlas.org/ENSG00000204520-MICA/tissue). O MHC é composto por blocos altamente polimórficos que foram gerados por uma sére de duplicações, deleções e vários eventos de recombinação genômica durante a evolução de mamíferos e primatas. A região está dividida nos blocos Beta, Kappa e Alfa, conforme Figura http://www.proteinatlas.org/ENSG00000204520-MICA/tissue 21 X e, este último, compreende os genes HLA-A, HLA-G e HLA-F, além de variantes estruturais abordadas posteriormente. Figura 1. Mapa genômico humano da localização dos genes MHC de classe I, divididos nos blocos de segregação: Beta, Kappa e Alfa. Na imagem é possível observar os genes HLA-A, -G e –F (objetos do presente estudo), além dos genes MICA, MICB e os pseudogenes HLA-H e –J. 1.2. HLA-A Considerando o banco de dados IPD-IMGT/HLA na versão 3.38.0, existem 18.691 alelos reconhecidos para os loci HLA de classe I. O HLA-A possui 8 éxons, sendo o oitavo também chamado de região 3’ não traduzida (3’UTR) e a região 5’UTR, com 5.735 alelos que codificam 3.629 diferentes proteínas completas (IPD-IMGT/HLA). A maior parte dessa variabilidade é encontrada nos éxons 2 e 3, responsáveis pela codificação dos domínios α1 e α2, responsáveis pela formação da fenda de ligação ao peptídeo. Considerando que esse banco de dados, que rastreia e organiza todas as sequências conhecidas de genes HLA, geralmente solicita a avaliação e caracterização de apenas os éxons 2 e 3 para a anexação de uma nova sequência de alelos ao banco de dados, e não inclui o promotor completo e a região 3’UTR da maioria dos genes HLA, não está claro se os segmentos reguladores e outros éxons dos genes da classe I seguem a mesma natureza polimórfica dos éxons 2 e 3. Além disso, poucos estudos caracterizaram a variabilidade dessas regiões reguladoras, mesmo que de forma parcial. Os dados de associação dos alelos HLA-A com doenças auto-imunes, infecciosas e o câncer estão bem presentes na literatura, com ênfase na relação do alelo HLA-A*74:01 com a proteção contra a infecção pelo HIV-1 e com a lenta progressão para a aids. O grupo de alelos HLA-A*02 possui muitas relações com patologias, como Alzheimer e tireoide de Hashimoto, além dos alelos A*24:02 relacionado com a predisposição para diabetes mellitus tipo 1 (DM- 1) e A*11:01, A*32:01 ou A*66:01 associados com a proteção contra essa doença (CARVALHO DOS SANTOS et al., 2013; LIMA et al., 2019). É pela função de reconhecer o próprio e o não próprio do organismo que os genes HLA de classe I estão relacionados com o 22 resultado de transplantes, dependendo da compatibilidade de doador-receptor do enxerto para o sucesso da técnica. 1.3. HLA-G O HLA-G faz parte do grupo de genes não clássicos do MHC humano, caracterizados como conservados em relação aos genes clássicos do HLA. Apesar da molécula produzida por este gene apresentar a mesma estrutura dos genes clássicos, o mesmo está pouco envolvido com a apresentação antigênica e tem como função principal a imunomodulação das células T e Natural Killer (NK). Graças a essa imunomodulação das NK na interface materno-fetal, é viabilizado o crescimento de um feto semi alogênico, porém, tal expressão de HLA-G em células tumorais possibilita um escape imunológico da célula tumoral, inibindo a ação das células T contra esta célula alvo. Em situações patológicas é expresso HLA-G em câncer de cólon, câncer de mama, câncer renal, carcinoma de bexiga, entre outros, mas em situações não patológicas a expressão é restrita a alguns tecidos como córnea, timo e placenta (CAROSELLA et al., 2008; DUNKER et al., 2008; CASTELLI et al., 2008; GERAGHTY et al., 1992; DONADI et al., 2011). Atualmente diversos estudos tem sido realizados para detalhar sua estrutura e variabilidade no Brasil e em diferentes populações e, de fato, as 19 proteínas diferentes descritas codificadas pelos seus 69 alelos (IPD-IMGT/HLA) não tem evidência de modificar sua função (CASTELLI et al., 2017; CASTELLI et al., 2010; CASTELLI et al, 2007; CASTELLI et al., 2010; CASTELLI et al., 2011; CASTELLI et al., 2014; LIMA et al., 2016; FELICIO et al., 2014; DONADI et al, 2011). Entretanto, a região promotora do gene HLA-G e sua região 3’UTR incluem muitos elementos reguladores transcricionais e pós-transcricionais, portanto, alguns sítios polimórficos nesses segmentos foram associados a diferentes perfis de expressão de HLA-G. Considerando que há baixa variabilidade na região de codificação do HLA-G e apenas poucas moléculas de proteína HLA-G de comprimento total são realmente detectadas em todo o mundo (CASTELLI et al., 2014, 2017); e que as principais funções biológicas da molécula, incluindo a dimerização e a interação com os receptores de leucócitos, são aparentemente conservadas para todas as frequentes moléculas de proteína HLA-G de comprimento total; a magnitude da produção de HLA-G, em condições normais ou patológicas, depende da constituição das regiões regulatórias (promotor e 3'UTR) e os fatores secretados no microambiente (CASTELLI et al., 2009; MANASTER et al., 2012; MOREAU; FLAJOLLET; CAROSELLA, 2009; PORTO et al., 2015; TAN et al., 2007; YIE et al., 2008; DONADI et al., 2011). 23 1.4. HLA-F O HLA-F também é um gene não clássico do MHC humano, porém pouco explorado em termos de função e diversidade genética. Sabe-se atualmente que esse gene é pouco variável, ou seja, possui um baixo grau de polimorfismos, com apenas 44 alelos e 6 proteínas (HLA- F*01:01, HLA-F*01:02, HLA-F*01:03, HLA-F*01:04, HLA-F*01:05 e HLA-F*01:06) descritas no banco de dados IPD-IMGT/HLA versão 3.38.0. Como o gene HLA-F apresenta as mesmas características dos demais genes não clássicos, i.e., expressão restrita a alguns tecidos ou baixa expressão, baixo polimorfismo e interação da proteína HLA-F com receptores inibitórios ILT e KIR (GOODRIDGE et al., 2013; LEPIN et al., 2000), acredita-se que sua função também esteja relacionada com imunomodulação. Até o momento foi detectada a espressão de HLA-F em situações fisiológicas no timo, baço e amígdala (LEPIN et al., 2000; WAINWRIGHT; BIRO; HOLMES, 2000). Em situações patológicas, a expressão de HLA-F foi descrita em vários tipos de câncer e doenças autoimunes (ISHIGAMI et al., 2015; JUCAUD et al., 2016), em geral relacionado ao prognóstico da doença. Os genes HLA não clássicos HLA-G e HLA-E são importantes durante a gestação e o transplante. No entanto, ainda existem poucos estudos que descrevam a relação do gene HLA- F com essas situações. Salvo um estudo que avaliou a variabilidade do gene HLA-F em uma amostra da população brasileira (LIMA et al., 2016) não há estudos que tenham avaliado a diversidade desse gene em populações mundiais. Portanto, é possível que a variabilidade desses gene esteja bastante subestimada na literatura e nos bancos de dados de variantes HLA. 1.5. Variantes estruturais As inserções Alu são sequências repetidas no genoma e os representantes mais fartos de uma família de retrotransponsons chamada de SINE (Short Interspersed Nuclear Element), que possuem em sua estrutura um sítio de restrição para a enzima AluI (BATZER; DEININGER, 2002; HOUCK; RINEHART; SCHMID, 1979). Sua estrutura é composta por aproximadamente 300 pb, dividida em dois braços praticamente idênticos, onde o monômero direito tem cerca de 31 bases a mais que o esquerdo, resultando em torno de 150 bases, separados por uma região rica em adeninas (A) (STRACHAN; READ, 1999). O motivo rico em “A” presente no centro da estrutura geralmente possui a repetição A5TACA6 (BATZER; DEININGER, 2002). Além disso, esses autores comentam que as sequências Alu possuem uma cauda terminal oligo-(dA) de tamanho variável. As inserções Alu também têm repetições curtas de timinas e adeninas (T + A) flanqueando a sequência, refletindo sua inserção cromossomal 24 (JURKA, 1997). Como parte da sequência, ainda, há a presença de um promotor para a enzima RNA polimerase III (BATZER; DEININGER, 2002), necessária à sua transcrição. Um grande número de inserções Alu estão localizadas em diferentes cromossomos e foram utilizadas em estudos populacionais, sendo estimadas contribuições das mesmas em cerca de 1% das doenças genéticas humanas (DUNN et al., 2002). Esses marcadores são normalmente encontrados em íntrons, regiões não traduzidas dos genes e regiões genômicas intergênicas, chegando a mais de um milhão de cópias no genoma, representando 10% do total, de forma não uniforme (BATZER; DEININGER, 2002). Esses elementos móveis são derivados, ancestralmente, do RNA 7SL, envolvido na ligação com o peptídeo sinal, que sinaliza o transporte de proteínas ao retículo endoplasmático (TIAN et al., 2008). O seu mecanismo de transposição envolve a retrotranscrição de um produto originado a partir da ação da enzima RNA polimerase III, provavelmente pela ação de uma transcriptase reversa codificada por uma sequência do tipo LINE (Long interspersed Nuclear Elements) (MATHIAS et al., 1991), já que os elementos Alu perderam, ao longo do processo evolutivo, a sequência que codifica essa enzima (BATZER; DEININGER, 2002). Esses autores comentam, ainda, a importância de regiões conservadas (TTAAA) flanqueando a sequência Alu necessárias ao reconhecimento por endonucleases para que ocorra o processo de integração ao genoma. Esses elementos são classificados em subfamílias, baseado em seu padrão de mutação, idade genética e diferenças na sequência, acumuladas durante a evolução. Entre as sequências presentes na espécie humana existem aquelas mais antigas, compartilhadas com outros primatas, chamadas de AluJ e AluS (JURKA; SMITH, 1988) e algumas sequências específicas da espécie humana, notadamente as de inserções mais recentes, chamadas AluY (BATZER et al., 1996). Para a diferenciação das famílias de inserções Alu há consenso em utilizarem-se diferenças nas sequências, como algumas mutações diagnósticas (JURKA; SMITH, 1988; BATZER et al., 1996; KAPITONOV; JURKA, 1996). Dentre as famílias Alu descritas acima, a AluY mostra-se a mais útil no estudo populacional, devido ao fato de ainda não terem alcançado sua completa fixação no genoma (KULSKI et al., 2001), sendo dimórficas entre a população humana, ou seja, ou estão presentes ou ausentes (DUNN et al., 2002), além de sua característica de terem identidade por descendência (DEININGER; BATZER, 1999). Ainda, sua determinação é um procedimento rápido e fácil através da reação em cadeia da polimerase (PCR) e posterior eletroforese (ROY- ENGEL et al., 2001). A presença ou ausência dessas inserções Alu resultam em marcadores de DNA que apresentam vantagens em relação aos SNPs e outros marcadores: (1) a probabilidade 25 é quase nula de duas retroposições idênticas acontecerem no mesmo local cromossômico; (2) a condição ancestral de cada inserção Alu polimórfica é conhecida, i.e., a ausência de inserção é a condição ancestral, possibilitando o estabelecimento preciso das origens dos diferentes alelos (TIAN et al., 2008), pois não há mecanismos de retirada dessas inserções Alu do genoma. Portanto, dois indivíduos possuindo a mesma inserção no mesmo locus possuem um ancestral comum com a mesma inserção. Foram identificadas pelo menos cinco diferentes sequências de AluY na região de classe I do MHC humano, sendo elas: AluYMICB, AluYTF, AluYHJ, AluyHF e AluyHG (DUNN et al., 2002). A AluYMICB é encontrada no primeiro íntron do gene MICB (KULSKI et al., 2002). AluYTF está localizada entre os genes HLA-B e HLA-E, a cerca de 505 kb em relação ao último (DUNN et al., 2003). A AluyHG está localizada a aproximadamente 20 kb da ponta 3’ do gene HLA-G (KULSKI et al., 2001) e foi recentemente explorada em um estudo do grupo (SANTOS et al., 2013). A AluYHJ está localizada a aproximadamente 190 kb em relação ao HLA-G (DUNN et al., 2002). A AluyHF se encontra a aproximadamente 130 kb telomérico em relação a HLA-G e próxima ao gene HLA-F (DUNN et al., 2002). Na Figura 2 observa-se o bloco alfa evidenciado, com cerca de 235 kb, onde estão presentes os genes HLA-A, -G e –F e as variantes estruturais, foco principal do presente trabalho. Inicialmente, a diversidade genética e os padrões de associação entre AluyHF, AluyHG e AluYHJ com HLA-A foram relatados nas populações australiana e japonesa (DUNN et al., 2002). Posteriormente, estudos de associação Alu com HLA-A e HLA-B foram realizadas em uma população do nordeste tailandês (DUNN et al., 2005), em um estudo na Malásia (DUNN et al., 2007) e outro estudo relacionou cinco marcadores Alu com o MHC de classe I (KULSKI; DUNN, 2005). Embora estudos prévios tenham sido realizados, os dados disponíveis no momento desses estudos eram de baixa resolução, i.e., apenas alguns éxons de cada gene HLA foram explorados, em geral buscando variantes já conhecidas. Esses marcadores são bastante informativos e contribuem para o entendimento da história evolutiva dos genes que os circundam. Ainda, estes marcadores podem indicar a presença de haplótipos específicos, alguns associados com fenótipos específicos e susceptibilidade a doenças. Por exemplo, em um estudo prévio do grupo (SANTOS et al., 2013), a presença do marcador AluyHG acompanhava um haplótipo de HLA-G relacionado com alta produção dessa molécula regulatória. 26 Figura 2. Composição gênica do bloco alfa do MHC humano e as variantes estruturais presentes nele. O conhecimento do perfil de variabilidade das inserções AluY ao longo do MHC humano pode refinar a determinação de haplótipos estendidos deste complexo gênico, possibilitando, por exemplo, determinar quais alelos e haplótipos estão associados com algum fenótipo importante. Além disso, a correlação entre estes marcadores e polimorfismos detectados nos genes de classe 1 pode ajudar na diferenciação dos fatores intrínsecos, como recombinações, e extrínsecos, como a seleção adaptativa, na variabilidade observada no MHC humano, determinando quais alelos são mais antigos ou possuem uma origem comum. Ainda, por meio da determinação de haplótipos estendidos, podemos estimar quais alelos HLA acompanham a presença desses marcadores Alu. A análise de inserções Alu consiste em um método rápido e barato, que pode indiretamente indicar a presença de alelos HLA relevantes. Outra variante estrutural no bloco alfa do MHC é a CNV (copy number variation) entre o HLA-G e o HLA-A (Figura 2), excluindo mais de 60Kb em uma região de 5Kb à montante do HLA-A. Essa CNV (esv3608493) inclui os genes HLA-H e HCG4B, como visto na Figura 2, que podem estar presentes ou não em alguns indivíduos. Essa CNV tem sido associada com a menor expressão de HCG4B e HLA-A e tem influência na doença pulmonar obstrutiva crônica e na asma (CHEN et al., 2017; OLIVEIRA et al., 2018b; PNG et al., 2019). 1.6. Forças seletivas A seleção balanceadora já foi bem documentada para os genes HLA-A e HLA-G, entretanto ainda existem questões sobre a ação de seleção purificadora na região codificante de HLA-G e a um possível efeito carona entre esses dois genes, onde as forças evolutivas atuando em um deles poderiam influenciar as frequências alélicos do outro. Neste estudo, como parte de um projeto maior que visa caracterizar todos genes e variantes estruturais na região do MHC humano e a variabilidade encontrada em cada gene, avaliamos a diversidade genética observada 27 no bloco alfa do MHC humano e como esta diversidade se correlaciona com as variantes estruturais presentes nele. 1.7. Referências AMIGORENA, S. Antigen presentation: from cell biology to physiology. Immunological Reviews, v. 272, n. 1, p. 5-7, Jul 2016. BATZER, M. A.; DEININGER, P. L. Alu Repeats and Human Genomic Diversity. Nature Reviews Genetics, v. 3, n. 5, p. 370–379, 2002. BATZER, M. A. et al. Standardized nomenclature for Alu repeats. Journal of Molecular Evolution, v. 42, n. 1, p. 3-6, Jan 1996. BRUTKIEWICZ, R. R. Cell Signaling Pathways That Regulate Antigen Presentation. Journal of Immunology, v. 197, n. 8, p. 2971-2979, Oct 2016. CAROSELLA, E. D.; MOREAU, P.; LEMAOULT, J.; ROUAS-FREISS, N. HLA-G: from biology to clinical benefits. Trends in Immunology, v. 29, n. 3, p 125–132, Mar 2008. CASTELLI, E. C. et al. A comprehensive study of polymorphic sites along the HLA-G gene: implication for gene regulation and evolution. Molecular Biology and Evolution, v. 28, n.11, p. 3069-3086, Nov 2011. CASTELLI, E. C. et al. HLA-E coding and 3' untranslated region variability determined by next-generation sequencing in two West-African population samples. Human Immunology, v. 76, n. 12, p. 945-53, Dec 2010. CASTELLI, E. C. et al. HLA-G alleles and HLA-G 14 bp polymorphisms in a Brazilian population. Tissue Antigens. v. 70, n. 1, p. 62-68, Jul 2007. CASTELLI, E. C. et al. HLA-G polymorphism and transitional cell carcinoma of the bladder in a Brazilian population. Tissue Antigens, v. 72, n. 2, p. 149-157, Aug 2008. CASTELLI, E. C. et al. HLA-G variability and haplotypes detected by massively parallel sequencing procedures in the geographicaly distinct population samples of Brazil and Cyprus. Molecular Immunology v. 83, p. 110-126, Mar 2017. CASTELLI, E. C. et al. Insights into HLA-G Genetics Provided by Worldwide Haplotype Diversity. Frontiers in Immunology, v. 5, Oct 2014. CASTELLI, E. C. et al. The genetic structure of 3' untranslated region of the HLA-G gene: polymorphisms and haplotypes. Genes and Immunity, v. 11, n. 2, p. 134-141, Mar 2010. DEININGER, P. L; BATZER, M. A. Alu repeats and human disease. Molecular Genetics and Metabolism, v. 67, n. 3, p. 183–193, Jul 1999. DONADI, E. A., et al. Implications of the polymorphism of HLA-G on its function, regulation, evolution and disease association. Cellular and Molecular Life Sciences, v. 68, n. 3, p. 369- 395, Feb 2011. javascript:void(0); javascript:void(0); javascript:void(0); javascript:void(0); 28 DUNKER, K.; SCHLAF G.; BUKUR J. et al. Expression and regulation of non-classical HLA- G in renal cell carcinoma. Tissue Antigens, v. 72, n. 2, p. 137-148, Aug 2008. DUNN, D. S.; CHOY, M. K.; PHIPPS, M. E.; KULSKI, J. K. The distribution of major histocompatibility complex class I polymorphic Alu insertions and their associations with HLA alleles in a Chinese population from Malaysia. Tissue Antigens. v. 70, n. 2, p. 136–143, Aug 2007. DUNN, D. S. et al. Dimorphic Alu element located between the TFIIH and CDSN genes within the major histocompatibility complex. Electrophoresis. v. 24, n. 16, p. 2740-2748, Aug 2003. DUNN, D. S. et al. Polymorphic Alu insertions and their associations with MHC class I alleles and haplotypes in the northeastern Thais. Annals of Human Genetics, v. 69, pt. 4, p. 364–72, Jul 2005. DUNN, D. S. et al. The association between HLA-A alleles and young Alu dimorphisms near the HLA-J, -H, and -F genes in workshop cell lines and Japanese and Australian populations. Journal of Molecular Evolution, v. 55, n. 6, p. 718-726, Dec 2002. FELICIO, L. P. et al. Worldwide HLA-E nucleotide and haplotype variability reveals a conserved gene for coding and 3' untranslated regions. Tissue Antigens. v. 83, n. 2, p. 82-93, Feb 2014. GERAGHTY, D. E. et al. The HLA class I gene family includes at least six genes and twelve pseudogenes and gene fragments. Journal of Immunology, v. 149, n. 6, p. 1934-1946, Sep 1992. GOLDBERG, A. C.; RIZZO, L. V. MHC structure and function – antigen presentation. Part 1. Einstein (São Paulo). v. 13, n. 1, p. 103-106, Mar 2010. HOUCK, C. M.; RINEHART, F. P.; SCHMIDT, C. W. A ubiquitous family of repeated DNA sequences in the human genome. Journal of Molecular Biology, v. 132, n. 3, p. 289-306, Aug 1979. JENSEN, P. E. Recent advances in antigen processing and presentation. Nature Immunology, v. 8, n. 10, p. 1041-8, Oct 2007. JURKA, J. Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proceedings of the National Academy of Sciences of the United States of America, v. 94, n. 5, p. 1872-1877, Mar 1997. JURKA, J.; SMITH, T. A fundamental division in the Alu family of repeated sequences. Proceedings of the National Academy of Sciences of the United States of America, v. 85, n. 13, p. 4775-4778, Jul 1988. KAPITONOV, V.; JURKA, J. The age of Alu subfamilies. Journal of Molecular Evolution, v. 42, n.1, p. 59-65. Jan 1996. KLEIN, J.; SATO, A. Birth of the major histocompatibility complex. Scandinavian Journal of Immunology, v. 47, n. 3, p. 199-209, Mar 1998. 29 KLEIN, J.; SATO, A. The HLA system. First of two parts. New England Journal of Medicine, v. 343, n. 10, p. 702-709, Sep 2000. KLEIN, J.; SATO, A. The HLA system. Second of two parts. New England Journal of Medicine, v. 343, v. 11, p. 782-786, Sep 2000. KULSKI, J. K. et al. The association between HLA-A alleles and an Alu dimorphism near HLA- G. Journal of Molecular Evolution, v. 53, n. 2, p. 114-123, Aug 2001. KULSKI, J. K.; D. S. DUNN. Polymorphic Alu insertions within the Major Histocompatibility Complex class I genomic region: a brief review. Cytogenetic and Genome Research, v. 110, n. 1-4, p. 193-202, Jul 2005. KULSKI, J. K. et al. Alu polymorphism within the MICB gene and association with HLA-B alleles. Immunogenetics, v. 53, n. 10-11, p. 975-979, Feb 2002. Epud 26 Jan 2002. LIMA, T. H. et al. HLA-F coding and regulatory segments variability determined by massively parallel sequencing procedures in a Brazilian population sample. Human Immunology, v. 77, n. 10, p. 841-853, Oct 2016. MATHIAS, S. L. et al. Reverse transcriptase encoded by a human transposable element. Science. v. 254, n. 5039, p. 1808-1810, Dec 1991. OGAHARA, S. et al. Effect of mismatched combinations of HLA-A antigens on graft survival in the transplanted kidney. Transplantation Proceedings, v. 30, n. 7, p. 3500-1, Nov 1998. RAJU, T. N. The Nobel chronicles. 1980: George Davis Snell (1903-96); Jean Baptiste Dausset (b 1916); Baruj Benacerraf (b 1920). Lancet, v. 354, n. 9191, p. 1738, Nov 1999. ROY-ENGEL, A. M. et al. Alu insertion polymorphisms for the study of human genomic diversity. Genetics, v. 109, n. 1, p. 279-290, Sep 2001. SANTOS, K. E. et al. Insights on the HLA-G evolutionary history provided by a nearby Alu Insertion. Molecular Biology and Evolution, v. 30, n. 11, p. 2423-2434, Nov 2013. SHIINA, T. et al. The HLA genomic loci map: expression, interaction, diversity and disease. Journal of Human Genetics, v. 541, p. 10-39, Jan 2009. STRACHAN, T.; READ, A. P. Human molecular genetics 2 / Tom Strachan and Andrew P. Read. 2. ed. New York : Wiley-Liss, 1999. TERASAKI, P. I. A brief history of HLA. Immunol Res, v. 38, n. 1-3, p. 139-148, 2007. THORSBY, E. A short history of HLA. Tissue Antigens, v. 74, n. 2, p. 101-16, Aug 2009. TIAN, W. et al. Polymorphic insertions in 5 Alu loci within the major histocompatibility complex class I region and their linkage disequilibria with HLA alleles in four distinct populations in mainland China. Tissue Antigens, v. 72, n. 6, p. 559–567, Dec 2008. VANDIEDONCK, C. et al. Association of HLA-A in autoimmune myasthenia gravis with thymoma. Journal of Neuroimmunology, v. 210, n. 1-2, p. 120-3, May 2009. 30 CAPÍTULO 2 31 2. ARTIGO GENETIC DIVERSITY AND STRUCTURAL VARIANTS IN THE HUMAN MHC ALPHA-BLOCK Marilia Rodrigues Silva Passos1, Thálitta H. A. Lima2, Andreia S. Souza2, Nayane dos Santos B. Silva1, Heloisa S. Andrade2, Camila Ferreira B. Castro3, Michel S. Naslavsky4, Marília Scliar 4, Eduardo Antônio Donadi5, Nicolas Vince 6, Diogo Meyer7, Celso T. Mendes-Junior8, Erick C. Castelli1 1 Universidade Estadual Paulista (UNESP), Faculdade de Medicina de Botucatu, Brasil. 2 Universidade Estadual Paulista (UNESP), Instituto de Biociências de Botucatu, Brasil. 3 Centro Universitário Sudoeste Paulista – UniFSP, Brasil. 4 Centro de Pesquisas sobre o Genoma Humano e Células Tronco, Instituto de Biociências, Universidade de São Paulo (USP), Brasil. 5 School of Medicine of Ribeirão Preto, University of São Paulo, Ribeirão Preto, State of São Paulo, Brazil. 6 Nantes Université, Centrale Nantes, CHU Nantes, Inserm, Centre de Recherche em Transplantation et Immunologie, UMR 1064, ITUN, F-44000 Nantes, France. 7 Faculdade de Saúde Pública, Escola de Enfermagem, Universidade de São Paulo (USP), Brasil. 8 Departamento de Química, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto, SP, Brasil. Contact: Erick C. Castelli Departamento de Patologia, Faculdade de Medicina Unesp, Botucatu – SP, CEP 18618970, Brazil erick.castelli@unesp.br, +55 14 3880 1696 32 Abstract The MHC (Major Histocompatibility Complex) alpha block comprehends important HLA genes such as HLA-F and HLA-G, both non-classical HLA loci with low variability and restricted tissue expression, and HLA-A, a highly polymorphic gene important for antigen presentation. This segment also presents some structural variants such as Alu elements and CNVs. HLA-G and HLA-A have been described as targets for balancing selection in different scenarios, the coding region for HLA-A and the regulatory segments for HLA-G. Moreover, HLA-F diversity in worldwide populations is unknown. Here we applied a bioinformatics pipeline that optimizes short reads mapping for HLA genes and allows the determination of reliable haplotypes, to survey the MHC alpha block genetic diversity in 1323 Brazilians. We also evaluated three structural variants, AluyHF, AluyHG, and esv3608493. These individuals configure a highly admixed sample from the largest Brazilian city, São Paulo. Our data reviewed the presence of a few HLA-F and HLA-G sequences in spite of the large sample size. HLA-F is the most conserved one, with 82.65% of the chromosomes encoding the same HLA- F protein, and with only one additional frequent protein. HLA-A was highly polymorphic with 96 genomic sequences encoding 59 proteins, including rare ones. We also assessed linkage disequilibrium and haplotypes along the MHC alpha block. AluyHF was in LD only with HLA- F but not with any particular allele. AluyHG was in LD with G*01:01:01:01/UTR-1, while esv3608493 was associated only with alleles HLA-A*23 and A*24. We also detected frequent haplotypes in our sample, especially the close relationship between HLA-G and HLA-A alleles. Because of that, the evidence of balancing selection in both these genes might not be independent since high heterozygosis in one locus would increase the heterozygosis in the other. This data provides a valuable resource of variants, frequencies, and haplotypes in HLA genes of admixed samples for clinical and evolutive studies. 2.1. Introduction The MHC (Major Histocompatibility Complex) alpha block comprehends three HLA genes important for antigen presentation and immune modulation (Figure 1). At the telomeric end, we find the HLA-F gene, a non-classical HLA gene with low variability and low expression levels in different tissues (LEPIN et al., 2000; LIMA et al., 2016). The IPD-IMGT/HLA database version 3.38.0 describes only 44 alleles for HLA-F, encoding 6 different proteins. HLA-F may interact with inhibitory receptors such as ILT2, ILT4, KIR3DL2 and KIR2DS4, which may modulate NK and T cell activity (GOODRIDGE et al., 2013; LEPIN et al., 2000). 33 HLA-F expression has been detected in physiological situations in the thymus, spleen, and amygdala (LEPIN et al., 2000; WAINWRIGHT; BIRO; HOLMES, 2000), and pathological in various types of cancer and autoimmune diseases usually associated with a bad prognosis (ISHIGAMI et al., 2015; JUCAUD et al., 2016). HLA-F genetic diversity using molecular techniques able to detect any variant has been explored mainly in Brazil and Benin (LIMA et al., 2016; SONON et al., 2018) and, because of that, its genetic diversity might be underestimated. In the center of the alpha block (Figure 1) we find the HLA-G locus, another non- classical HLA gene important for immune modulation. HLA-G was first identified on the maternal-fetal interface, especially on the placenta’s cytotrophoblast, where immunotolerance contributes to the fetus maintenance. HLA-G is constitutively expressed on the trophoblast cells surface, cornea, pancreas, and thymus (KING et al., 2000). In endocrine tissues, HLA-G is regulated according to growth and inflammatory stimulus and can exhibit regulatory functions in human pancreatic islets, including autoimmune progression (CIRULLI et al., 2006). HLA-G can inhibit Natural Killer cells (NK) and T CD8+ lymphocytes cytotoxicity, by the interaction with NK and T cell surface receptors (CAROSELLA et al., 2008). HLA-G binding to these receptors modulates T cytotoxic and NK cells activity, inducing the expression of inhibitory receptors with tyrosine motifs (CAROSELLA et al., 2003). HLA-G is conserved in the nucleotide and protein levels, with 69 alleles encoding 19 full-length proteins according to the IPD-IMGT/HLA database version 3.38.0. However, when studying different population samples few different HLA-G proteins are detected worldwide, and the majority of the HLA-G alleles present intronic or synonymous mutations, and they are present in all population samples with different frequencies (CASTELLI et al., 2014, 2017; SONON et al., 2018). The HLA-G promoter and 3’ untranslated region (3’UTR) enclose many transcriptional (MOREAU; FLAJOLLET; CAROSELLA, 2009) and post-transcriptional regulatory elements (MANASTER et al., 2012; PORTO et al., 2015; TAN et al., 2007; YIE et al., 2008), respectively. Therefore, some HLA-G polymorphic sites in regulatory sequences have been associated with different expression profiles (HVIID, 2006; MOREAU; FLAJOLLET; CAROSELLA, 2009; YIE et al., 2008). HLA-G deserves especial attention in pregnancy, as the molecules codified by these genes play a role in immunosuppression and homeostasis maintenance during pregnancy (HVIID, 2006). The non-expression of classical class I HLA molecules by the trophoblastic cells make then potential targets for the lysis mediated by NK cells. However, the interaction between the trophoblast HLA-G and NK inhibitory receptors hampers the trophoblast cell lysis 34 (PONTE et al., 1999). Also, HLA-G activates cytokines production that promotes vascular remodeling at the maternal-fetal interface (KIRWAN et al., 2002). In the centromeric end of the MHC alpha block, we find the HLA-A locus. HLA-A is a classical class I gene important for antigen presentation. It is one of the most polymorphic genes in the human genome, presenting 5.735 different alleles according to the IPD-IMGT/HLA database version 3.38.0. The Brazilian population has been studied regarding the HLA-A genetic diversity, and the HLA-A promoter sequence is highly variable, probably related to different expression profiles for each of the main HLA-A allele groups (LIMA et al., 2019). The most common HLA-A variant is HLA-A*02, which accounts for more than 920 alleles already described (IPD-IMGT/HLA), with some of them related to diseases such as Alzheimer, Hashimoto thyroiditis, and type 1 diabetes mellitus (NOBLE et al., 2010; SHIINA; INOKO; KULSKI, 2004). Another frequent HLA-A allele group is HLA-A*24, with some alleles (A*24:02), together with alleles of other groups (A*11:01, A*32:01, and A*66:01) associated with type 1 diabetes mellitus (DM-1) (CARVALHO DOS SANTOS et al., 2013; LIMA et al., 2019). It is noteworthy that these alleles are relatively frequent in Brazil (CARVALHO DOS SANTOS et al., 2013; LIMA et al., 2019). There is a functional link between the HLA-A high nucleotide and protein diversity and the antigen presentation compatibility since different HLA- A molecules would present different subsets of peptides (BUHLER; NUNES; SANCHEZ- MAZAS, 2016). As the antigen presentation is the main function of the HLA classical class I molecules such as HLA-A, one can consider that the natural selection has shaped the genetic variability the HLA classical genes, increasing variability at the peptide binding site, thus allowing a great diversity of antigen presentation. Likewise, since the expression of classical class I molecules should be constitutive in virtually all somatic cells, and that the expression of class I molecules occurs in a codominant fashion (FISCHER; MAYR, 2001; KLEIN; SATO, 2000a, 2000b), it is expected that the regulatory regions of these genes might be less variable, avoiding too many gene expression variations. However, variations in the regulatory regions may modify this scenario. It is believed that polymorphisms in these regions could substantially affect the expression levels, leading to an unbalanced expression among the different alleles of an individual. Balancing selection has been well documented for both HLA-A and HLA-G, but at different fashions. While the most polymorphic segment at HLA-A is the coding sequence, which is associated with the presence of many frequent and divergent coding alleles in worldwide populations, the most polymorphic at HLA-G are the regulatory segments. In fact, 35 many studies have detected signatures of balancing selection at the HLA-G promoter and 3’UTR, while the coding segment is conserved (GINEAU et al., 2015). However, it is not clear whether balancing selection is indeed operating at the HLA-G regulatory segments, or these results are a hitchhiking effect caused by the selective pressures acting on HLA-A. In the MHC alpha block there are some structural variants, such as two Alu insertions, AluyHF and AluyHG. Alu insertions are repeated sequences with approximately 300 bp distributed across the genome. Its structure is divided into two nearly identical arms separated by an adenine-rich region (A), a variable length oligo- (dA) tail, short thymine, and adenine (T + A) repeats flanking the sequence and a promoter for RNA polymerase III enzyme, required for its transcription. These markers are usually found in introns, untranslated gene regions, and intergenic regions (as it is in this case), reaching more than 10% of the genome, nonuniformly. Among the Alu families, Aluy is the most useful for population genetics studies because of its identity by descent and because it has not yet reached complete fixation in the genome, being dimorphic in the population. The presence or absence of these insertions results in DNA markers that have advantages over SNPs and other markers: (1) the probability is almost zero of two identical elements occurring at the same chromosomal site; (2) have a known ancestral condition. These allowed the exact establishment of the origins of the different alleles since there are no mechanisms for removing these Alu insertions from the genome. Therefore, two individuals with the same insertion in one place and may have the same common ancestor (BATZER; DEININGER, 2002; DUNN et al., 2002; JURKA, 1997; KULSKI et al., 2001; MATHIAS et al., 1991). Previous studies report associations between AluyHF and AluyHG with HLA-A in the Australian and Japanese populations (DUNN et al., 2002) and AluyHG following the presence of a HLA-G haplotype related to high HLA-G production, G*01:01:01:01/UTR-01 (SANTOS et al., 2013). Another structural variant in the MHC alpha block is copy number variation (CNV) between the HLA-G and HLA-A (Figure 1), deleting more than 60Kb in a region 5Kb upstream HLA-A. This CNV (esv3608493, nsv823469, nsv823470) includes the genes HLA-H and HCG4B, which may be present or not in some individuals. This CNV loss has been associated with lower HCG4B and HLA-A expression and influences chronic obstructive pulmonary disease and asthma (CHEN et al., 2017; OLIVEIRA et al., 2018b; PNG et al., 2019). Here we evaluate the HLA-F, -G and -A genetic diversity, and also the presence or absence of the AluyHF, AluyHG, and esv3608493, in 1323 individuals living in the largest Brazilian city, São Paulo, by using Whole Genome Sequencing (WGS) and a bioinformatics workflow developed to optimize short-read alignment for HLA genes. Then, we evaluate 36 Linkage disequilibrium along the MHC alpha-block and relationship among genes and structural variants, defining the most common haplotypes in our highly admixed population. 2.2. Methods 2.2.1. Samples and sequencing We evaluated polymorphisms in the MHC alpha-block in 1323 DNA samples from individuals from São Paulo (city). These individuals are part of the SABE project (Saúde, Bem Estar e Envelhecimento – Health, Well-Being, and Aging), led by the Centro de Estudos do Genoma Humano e Células-Tronco (CEGH-CEL), University of São Paulo (IB/USP). Peripheral blood was drawn into ethylene diaminotetracetic acid (EDTA) tubes and DNA was extracted from leukocytes using the automatic extraction robot QIAsymphony (Qiagen, Hilder, Alemanha) automated purification system with the lysis/isopropanol precipitation protocol as recommended by the manufacturer. The quality and concentration of DNA was evaluated with NanoDrop® (Thermo Fisher Scientific, Carlsbad, Califórnia, EUA). Sequencing was performed by the CEGH-CEL using sonication (Covaris, Inc., Woburn, Massachusetts, USA) and Illumima HiSeq (Illumina, San Diego, Califórnia, EUA). Then, reads were aligned using Isaac Aligner against the human reference hg38. 2.2.2. MHC alpha-block analysis We used samtools do extract reads mapped to the MHC segment and unmapped ones and convert them to paired FASTQ. We resynchronizing the FASTQ files with local Perl script. We used hla-mapper version 3.05 to produce HLA-F, HLA-G, and HLA-A specific aligned files (BAM format) to the reference genome version hg38 (CASTELLI et al., 2018). This software was previously introduced evaluating the HLA-A, HLA-G, HLA-E, and HLA-F in different population samples (BUTTURA et al., 2019; CASTELLI et al., 2015, 2017; LIMA et al., 2016, 2019; RAMALHO et al., 2017; SONON et al., 2018). The Genome Analysis Toolkit HaplotypeCaller (GATK, version 4.1.2) in the GVCF mode and GATK GenotypeGVCFs was used to infer genotypes with default parameters (DEPRISTO et al., 2011; MCKENNA et al., 2010; VAN DER AUWERA et al., 2013). Then, variant refinement was performed using vcfx (available at www.castelli-lab.net/apps/vcfx). This software introduces missing alleles in genotypes with low likelihoods (function checkpl) and unbalanced genotypes (function checkad), and also annotates each variant with some quality control parameters (function evidence). To ensure that only high-quality genotypes are http://www.castelli-lab.net/apps/vcfx 37 passed forward to the next imputation and haplotyping steps, all variants not indicated as “PASS” by vcfx evidence were manually inspected and, if necessary, removed. Since hla- mapper version 3 reports the depth of coverage in each HLA locus, we used the HLA-H depth of coverage to genotype esv3608493 as presenting none, one, or two copies of this segment. A local Perl script was used to genotype the AluyHG and AluyHG in all samples. This script considers the distance observed between known sequences surrounding the insertion site, and calculates weather the samples present none, one, or two copies of the Alu insertion. We inferred haplotypes combining two computational strategies. First, GATK ReadBackedPhasing (RBP) inferred the relationship between close heterozygous variants, with the minimal PhaseQualityThreshold set to 500 (25x the default value). This procedure resulted in phasing sets, i.e., blocks of known phases but unphased among them. Multi-allelic variants, indels, and missing alleles are not considered by RBP. Then, the second step consisted of inferring the final haplotypes considering these phase sets. For that, we used an in-house software named PHASEX (available upon request). PHASEX use Shapeit4 (DELANEAU; MARCHINI; ZAGURY, 2011) to phase all bi-allelic variants considering the phase sets, in 50 independent runs, comparing the results afterward. Samples presenting the same pair of haplotypes in at least 48 runs (P > 0.95) are fixed as completely phased samples and passed forward to the next round. This procedure is repeated until the number of samples with P > 0.95 no longer increases. Then, the haplotypes defined in the previous step are passed forward to the next, in which Beagle 4.1 is used to infer the final set of haplotypes, now including the multi- allelic variants. Here we also used 50 independent runs for each round, fixing the samples in which the same haplotypes are inferred in at least 48 independent runs (P>0.95). The final set of haplotypes are the ones with P > 0.95 after the final run. Shapeit4 and Beagle also imputed the bi-allelic and multi-allelic missing alleles, respectively. We have removed all singletons before the haplotyping procedure. This step is necessary because singletons are ambiguous by definition. When possible, singletons were manually inserted in the final VCF file comparing the RBP results and the BAM files. The procedure described above was performed for HLA-F, HLA-G, and HLA-A, separately. Then, a joined VCF file was produced with the AluyHF, HLA- F, HLA-G, AluyHG, the HLA-H CNV, and HLA-A data, and another haplotyping round was performed. This procedure allowed us to identify the extended haplotypes of 1306 individuals (98.7% of the original sample). The final VCF file (phased) was converted into two complete sequences for each locus and each individual, using the hg38 chromosome 6 as a draft and replacing the correct nucleotide in each position by using vcfx fasta. Then, a local Perl script identified the closest 38 known HLA allele for each sequence, considering the IPD-IMGT/HLA database (ROBINSON et al., 2015), version 3.38.0. In cases in which the sequence observed was identical to one known sequence, the same name was used. In cases in which the sequence observed was not identical to any one already reported, we classified these sequences as “unknown”. Linkage disequilibrium (LD) was assessed using Haploview 4.2 and variants with minor allele frequency (MAF) greater than 1%. 2.3. Results HLA-F genetic diversity among Brazilians We detected 16 different HLA-F CDS sequences in this sample (Table 1). The list of variants in exons, introns, and regulatory segments, and their frequencies, is available at table S1. The most frequent CDS allele was F*01:01:01 (64.81%) and F*01:01:02 (17.64%), all encoding the same HLA-F protein F*01:01. The HLA-F molecule F*01:01 has a summed frequency of 82.65%, and F*01:03 has a frequency of 14.93%. Thus, in this large admixed Brazilian sample, there are only two frequent HLA-F molecules, indicating that HLA-F is indeed conserved in the DNA and protein level. There were nine HLA-F CDS sequences detected in our sample, with a summed frequency of 1.58%. One of these new sequences was detected in 22 individuals. When the genomic sequence is taken into account (exons + introns), we found only 19 alleles compatible with known sequences as described in Table 2, with a summed frequency of 92.78%. The new genomic sequences are mostly related to rare intronic variants and number of repeats in STRs. Although not shown in this study, the HLA-F promoter and 3’UTR haplotypes, and their association with coding alleles, are compatible with the ones previously described for other Brazilian and African samples (BUTTURA et al., 2019; LIMA et al., 2016; SONON et al., 2018). HLA-G genetic diversity among Brazilians We detected 27 different HLA-G CDS sequences in this sample (Table 3), 20 of them compatible with known HLA-G sequences and with summed frequency of 99.51%. The list of variants in exons, introns, and regulatory segments, and their frequencies, is available at table S2. The most frequent CDS allele was G*01:01:01 (42.75%) and G*01:01:02 (17.18%). Converting the alleles at Table 3 in protein sequences, there were 9 variants. The most frequent was G*01:01 (68.09%), followed by G*01:04 (15.94), G*01:03 (8.04%), G*01:06 (5.02%), 39 G*01:05N (truncated protein, 2.11%), and other four rare molecules (G*01:08, G*01:09, G*01:11, and G*01:21N). In terms of protein and nucleotide diversity, HLA-G is more polymorphic than HLA-F. When the genomic sequence is taken into account (exons + introns), we found only 27 alleles compatible with known sequences as described in Table 4, with a summed frequency of 96.71%. The new genomic sequences are mostly related to rare intronic variants, but some are even frequent them the alleles reported in Table 4. The most frequent HLA-G genomic sequence was G*01:01:01:01 (24.62%) and G*01:01:02:01 (16.16%). This large sample confirms the existence of rare HLA-G alleles already reported in the IMGT/HLA database, such as G*01:01:22:01, G*01:01:19, G*01:09, and G*01:21N. Although not shown in this study, the HLA-G promoter and 3’UTR haplotypes, and their association with coding alleles, are compatible with the ones previously described for other Brazilian and African samples (CASTELLI et al., 2014, 2017; SONON et al., 2018). HLA-A genetic diversity among Brazilians HLA-A is one of the most polymorphic genes in the human genome, with thousands of alleles reported in the IPD-IMGT/HLA database. In this large Brazilian sample, we found 70 different CDS sequences and one new and rare one (Table 5). There were copies of each allele group, including rare ones such as A*80. The list of variants is described in Table S3. The most frequent CDS sequence was A*02:01:01, A*24:02:01, A*03:01:01, and A*01:01:01, but at least 19 of them present a frequency higher than 1%. In terms of different HLA-A proteins, this Brazilian sample encodes 59 different HLA-A proteins. When the genomic sequence is taken into account (exons + introns), we found only 96 alleles compatible with known sequences as described in Table 6, with a summed frequency of 98.71%. The most frequent HLA-A genomic sequence was A*02:01:01:01 (21.08%), A*01:01:01:01 (8.61%), and G*24:02:01:01 (8.23%), but many alleles present high frequencies. Here we confirm the presence of uncommon alleles such as A*23:17:01:01 and A*80:01:01:02, all presenting many copies in our sample, and also the existence of very rare HLA-A alleles such as A*24:104, A*30:29, A*68:17:01. Although not shown in this study, the HLA-A promoter and 3’UTR haplotypes, and their association with coding alleles, are compatible with the ones previously described for another Brazilian sample (LIMA et al., 2019). 40 Linkage Disequilibrium across the MHC alpha block We observed high Linkage Disequilibrium across the MHC alpha block considering 638 biallelic SNPs and 3 structural variants, AluyHF, AluyHG, and esv3608493. There were 3 large segregation blocks (Figure 2). The first one encompasses the Alu insertion AluyHF and most of the HLA-F variants, corresponding to a segment of 13Kb. The LD among AluyHF and non- HLA-F variants is low. The second block encompasses a few 3’ HLA-F variants, all HLA-G SNPs up to the AluyHG and the CNV esv3608493, in a segment of 104 Kb. The transition between blocks coincides with these two structural variants. The third block, of 114 Kb, includes esv3608493 and the majority of the HLA-A variants. Because of this LD pattern, we opted to track the association between the AluyHF insertion and HLA-F alleles, the AluyHG insertion with HLA-G alleles, and esv3608493 with HLA-A allele. Besides, since there are many HLA-G/HLA-A pairs with high LD, we will also track the association between HLA-G and HLA- A alleles. AluyHF and HLA-F haplotypes The frequency of the AluyHF insertion was 16.08%, mostly associated with the allele F*01:01:01:08 and F*01:01:01:18 (Table 7). Nevertheless, many copies of both F*01:01:01:08 and F*01:01:01:18 are associated with the absence of the AluyHF insertion. Although there is a high LD between this Alu element and HLA-F (Figure 2), this Alu element can be used to track specific HLA-F variants. AluyHG and HLA-G haplotypes The frequency of the AluyHG insertion was 32.39%, mostly associated with the allele G*01:01:01:01 and other uncommon HLA-G alleles that were derived from G*01:01:01:01, such as G*01:01:01:08 (Table 8). The CNV nsv823470 and HLA-A haplotypes The loss of the 60Kb related to CNV esv3608493 has a frequency of 15.77% in our Brazilian sample. All alleles A*23 and A*24 described in Table 6 (and only them) are associated with the loss of this segment. HLA-G and HLA-A haplotypes There were 159 HLA-G/HLA-A haplotypes, 60 of them occurring just once and 13 occurring twice. The most frequent haplotype was G*01:01:01:01 / A*02:01:01:01, but at least 21 haplotypes present a frequency higher than 1%. The full table of haplotypes is available as Table S4. Among the associations in which one HLA-G allele is mostly associated with the 41 same HLA-A allele or allele group, with minor deviations, we may list: G*01:01:01:04 with A*29/A*74, G*01:01:01:05 with A*03, G*01:01:01:06 with A*80, G*01:01:01:08 and A*30:02, G*01:01:01:02 with A*66, G*01:01:03 with A*11, G*01:01:15 with A*03:01:01:15, G*01:01:17 with A*32:01:01:01, A*01:04:04 with A*23, the null allele G*01:05N with A*30:01:01:01, G*01:06 with A*01:01:01:01, and others. MHC alpha block common haplotypes Here we evaluated the common haplotypes observed when considering the genomic sequences of HLA-F, HLA-G, and HLA-A. Since HLA-F and HLA-A are 200 Kb apart (Figure 1), with weak LD between them (Figure 2), we found 268 extended haplotypes, 120 of them occurring just once and 66 occurring twice. The most frequent haplotype was F*01:01:01:01 / G*01:01:01:01 / A*02:01:01:01, with a frequency of 10.29%. Table S5 described all the haplotypes and their frequencies. 2.4. Discussion Here we evaluated the genetic diversity of the HLA alpha-block in highly admixed samples from Brazil, focusing on the HLA-F, HLA-G, and HLA-A loci, and three structural variants, AluyHF, AluyHG, and esv3608493. Since the HLA region is prone to mapping and genotyping errors because of NGS short reads (BRANDT et al., 2015; CASTELLI et al., 2018), we used the hla-mapper software to optimize read mapping as described elsewhere (CASTELLI et al., 2018; LIMA et al., 2019). This is the largest survey of HLA-F genetic variability so far. HLA-F diversity is poorly explored in worldwide populations, with a smaller and different Brazilian sample and a sample from Benin evaluated using a similar methodology and describing many new and frequent HLA- F alleles (LIMA et al., 2016; SONON et al., 2018) that were subsequently confirmed by independent analysis (PAGANINI et al., 2019). This large admixed Brazilian sample confirms that HLA-F is indeed conserved, with 82.65% of the chromosomes encoding the same HLA-F protein, and with only one additional frequent protein (Table 1 and 2). Moreover, many alleles that were recently described, such as F*01:01:02:09, is quite frequent among Brazilians, reaching near 5%. It should be mentioned that many HLA-F alleles differ only by the number of TG repeats in a Short Tandem Repeat (STR) located in the 3’UTR region (LIMA et al., 2016; PAGANINI et al., 2019b). Because of that, the frequencies displayed in Table 2 might be biased towards specific allele sizes predicted by the GATK HaplotypeCaller. The high number of unknown HLA-F genomic sequences (Table 2) might be a consequence of this issue. Only 42 cloning and Sanger sequencing would confirm the presence of each of those alleles and their frequencies, which is not practical and cost-effective in this sample size. Nevertheless, this issue does not affect the results in Table 1, which demonstrates the frequency of each exonic sequence found in Brazil. Since Brazilians are highly admixed, with major European, African, and Amerindian contributions, but also with recently Asian contributions in the last century, we may expect to find many different alleles in Brazil. However, this was not the case for HLA-F. Considering the exonic sequences we may find only three frequent ones (Table 1), two of them carrying synonymous mutations (F*01:01:01 and F*01:01:02). Recently, we described F*01:01:02 as a high-expressing allele because it carries a unique and highly divergent promoter sequence (BUTTURA et al., 2019), and the frequency of this allele is Brazil was 17.65%. Although we did not present data for the HLA-F regulatory sequences, we have included promoter and 3’UTR variants in this analysis (Table S1), and this sample present the same pattern observed previously (LIMA et al., 2016; SONON et al., 2018). We also noticed that at least 1.58% of the chromosomes present an unknown HLA-F CDS sequence, and some of them encoded different HLA-F molecules. Thus, considering the genomic and CDS sequences, and in spite of the HLA- F conservation, HLA-F diversity is greater than the one reported at the IPD-IMGT/HLA database. We will clone and sequence these new sequences shortly. The only structural variant presenting LD with HLA-F was AluyHF (Figure 2). The insertion was associated mainly with alleles F*01:01:01:08 and F*01:01:01:18, but other copies of these alleles follow AluyHF absence. AluyHF is not marking the presence of any specific allele. Nevertheless, the STR issue might be influencing these results. HLA-G is the most studied non-classical HLA gene. Many manuscripts are addressing HLA-G diversity in worldwide populations (ALVAREZ et al., 2009; ARNAIZ-VILLENA et al., 2018; CARLINI et al., 2013; CASTELLI et al., 2014, 2017; EMMERY et al., 2017; GINEAU et al., 2015; MENDES-JUNIOR et al., 2013; NILSSON et al., 2018; OLIVEIRA et al., 2018a; RIBEYRE et al., 2018; SABBAGH et al., 2014; SANTOS et al., 2013; SONON et al., 2018; SVENDSEN et al., 2018; ZIDI et al., 2016), and many others addressing association with diseases. However, here we present the largest dataset of HLA-G variants in a population- based study in admixed samples. This Brazilian sample presented many HLA-G genomic sequences, with frequencies compatible with previous studies (CASTELLI et al., 2017; CASTELLI; MENDES; DONADI, 2007). In this sample we found many rare alleles such as G*01:09 and G*01:21N (Tables 7 and 8), but the protein polymorphism is still restricted to five frequent versions: G*01:01, G*01:03, G*01:04, G*01:05N, and G*01:06, with G*01:01 43 encoded by 68.09% of the chromosomes. However, these nucleotide exchanges seem to be modifying the main HLA-G features including peptide-binding, membrane expression, and binding to ILT and KIR receptors (DONADI et al., 2011; UROSEVIC; TROJAN; DUMMER, 2002). Interestingly, there was one homozygous individual for the truncated G*01:05N allele, as already reported in other populations (CASTRO et al., 2000), demonstrating that there may be other immunomodulatory mechanisms allowing the survival of individuals not producing the membrane-bound full-length HLA-G molecule, or that the soluble isoforms are sufficient for HLA-G function (LE DISCORDE et al., 2005). The AluyHG insertion occurs about 20Kb downstream HLA-G and it is included in the second large segregation block observed in Figure 2. This insertion is quite frequent in worldwide populations and has been previously associated with the haplotype PROMO- G010101a/G*01:01:01:01/UTR-1 (SANTOS et al., 2013), which in turn is considered a high- expressing haplotype (CARLINI et al., 2013; DURMANOVA et al., 2019; MARTELLI- PALOMINO et al., 2013; PORAS et al., 2017). Our data confirm that 84% of the chromosomes carrying the AluyHG insertion also carry G*01:01:01:01. Moreover, the previous association of G*01:01:01:01 with PROMO-G010101a and UTR-1 is maintained, together with all other associations previously described in worldwide populations (reviewed at (CASTELLI et al., 2014, 2017; DONADI et al., 2011)) The admixed nature of our sample may also explain the high number of different HLA- A genomic alleles and HLA-A proteins observed (Tables 5 and 6). We found at least one copy of each HLA-A allele group including some rare ones such as A*80, but except A*43. This is not a methodological issue and reflects the low frequency of A*43 around the world, with an estimated frequency in Brazil according to the Brazilian Donor Registry (REDOME) of 0.007% (www.allelefrequencies.net). Since there are thousands of HLA-A alleles described in the IPD- IMGT/HLA database with mutations covering most of the HLA-A sequence, coupled with the many studies evaluating HLA-A diversity because of its importance for transplantation outcome, we did not find many new HLA-A alleles in our sample. New genomic alleles represent only 1.28% of the sequences (Table 6), most of them carrying intronic mutations and all encoding known proteins. The number of new CDS sequences (all encoding known proteins) was quite low, only 0.42%. Considering this sample size, and the fact that this population present European, African, Amerindian, and Asian component, it is possible that the IPD- IMGT/HLA database now covers most of the HLA-A genetic diversity observed in worldwide populations, at least at the protein level (Table 5). http://www.allelefrequencies.net/ 44 The CNV esv3608493 represents a deletion of about 60Kb that includes the HLA-H pseudogene and gene HCG4B that encodes an ncRNA, about 5 Kb upstream HLA-A. This CNV has been describing influencing HLA-A expression (CHEN et al., 2017; OLIVEIRA et al., 2018a; PNG et al., 2019). Here we tracked this CNV by evaluating read depth at the HLA-H gene using hla-mapper. These CNV have been explored recently as allele H*Del (CARLINI et al., 2016; PAGANINI et al., 2019), with a frequency of 17% in the 1000 Genomes sample, which is compatible with the one observed in Brazil (15.77%). It should be mentioned that the previous study does not include Brazilian samples. Because of that, and combining the sample sizes, we have more than 3,500 individuals from worldwide populations confirming that CNV loss is related to only HLA-A*23 and A*24 alleles. These HLA-A alleles are clustered together because they present similar coding sequences and identical regulatory sequences (LIMA et al., 2019). HLA-A*24 (and probably A*23) is a high expressing lineage because of many mechanisms including a low level of methylation (RAMSURAN et al., 2015, 2017). Moreover, a previous study demonstrated that the promoters of A*23 and A*24 present fewer CpG sites (LIMA et al., 2019), and the CNV loss may reduce even more the number of CpG sites upstream HLA-A, favoring HLA-A expression. Thus, likewise AluyHG, esv3608493 is marking the presence of high-expression allele. This CNV is also related to alleles G*01:04 since 86.13% of the chromosomes with the CNV loss also present a G*01:04 allele, configuring two frequent haplotypes, G*01:04:01/CNVloss/A*24 and G*01:04:04/CNVloss/A*23. Other low-frequency combinations such as G*01:04:04/A*36:01 do not present the CNV loss (Table S5). There are previous studies addressing the relationship between HLA-F, HLA-G, and HLA-A alleles, but using low-resolution typing methods (CARLINI et al., 2016; PAGANINI et al., 2019). Because of that, we will not compare our results with the previous ones, although they are similar when we consider the same resolution level. Our study aimed to measure the linkage disequilibrium between genes and markers in the HLA alpha block and to evaluate the extended haplotypes. The association of these markers with nearby SNPs can detect large blocks of segregation between neighboring, as was observed for each structural variant and their neighbor genes (Figure 2). Because of the distance between HLA-F and HLA-A (Figure 1) and the presence of many recombination sites in this segment (MIRETTI et al., 2005), we detected several extended haplotypes as described in Table S5. However, there was a close relationship between HLA-G and HLA-A, which are 100Kb apart (Table S4). There are many frequent haplotypes, and some HLA-G alleles (as discussed for G*01:04 earlier) follow specific HLA-A alleles. Other examples include the association of G*01:06 with A*01:01:01:01, 45 G*01:01:02:02 with A*66:01:01:01, and G*01:05N with A*30:01:01:01 or A*30:95 (Table S4). HLA-A is a notorious target for natural selection (BITARELLO; FRANCISCO; MEYER, 2016; DOS SANTOS FRANCISCO et al., 2015; MEYER; THOMSON, 2001), considering that balancing selection has been shaping the HLA-A coding variability (MACK et al., 2009; SOLBERG et al., 2008; TU et al., 2007), increasing heterozygosis to improving antigen presentation. Because of that, high heterozygosis is expected for HLA-A and was indeed observed in our sample, with positive and significant Tajima’s D considering the entire HLA-A locus (Tajima’s D = 2.9005). Balancing selection has already been reported for HLA-G, especially at the regulatory segments (promoter and 3’UTR) (CASTELLI et al., 2011; DONADI et al., 2011; GINEAU et al., 2015; TAN; SHON; OBER, 2005). However, there is no strong functional evidence that different HLA-G regulatory regions are indeed associated with different HLA-G expression profiles besides an SNP at position -725 (OBER et al., 2006). Our data demonstrate a close relationship between HLA-G and HLA-A (Table S4). Therefore, natural selection acting on one locus will probably influence frequencies in the other locus, as a hitchhiking effect. Because of that, high heterozygosis in HLA-A may lead to high heterozygosis in HLA- G, and vice-versa. For HLA-G, this high heterozygosis would be detected in intronic segments because most of the HLA-G alleles differ only by intronic mutations. Using a 150-bp window approach to calculate Tajima’s D across HLA-G we detected Tajima’s D over 3.00 in just one promoter region coinciding with position -725, and in introns 1, 4, and 5, while exonic segments and most of the promoter sequence presented negative values or low positive values (data not shown). Exception made to position -725, the full HLA-G promoter and 3’UTR presented a Tajima’s D of 0.4834 and 0.3953, respectively. Since most of the HLA-G alleles differ only in intron sequences, and different intron sequences are associated with specific HLA-A allele groups, high heterozygosis in HLA-A we would lead to a high heterozygosis in the HLA-G introns as a hitchhiking effect. For instance, G*01:01:01:04 and G*01:01:01:05 are 1 mutation apart, and the first is associated with HLA-A*29 and HLA-A*74, while the later with HLA- A*03 (Table S4). However, we cannot rule out balancing selection at the HLA-G position -725 since it was the second strongest signal observed across HLA-G after intron 5. In conclusion, here we report the HLA-F, HLA-G, and HLA-A full variability in a large admixed Brazilian sample and the relationship among the observed alleles, together with their associations with three structural variants. The Alu element AluyHG and the CNV esv3608493 are both associated with high-expressing alleles, the first with G*01:01:01:01 and the second 46 with A*23 and A*24. Our data suggests a close relationship between HLA-A and HLA-G, with both loci with evidence of balancing selection (intronic segments for HLA-G, the entire locus for HLA-A). Thus, this phenomenon might not be independent, and the observations for HLA- G might be hitchhiking effect because of HLA-A. This is the largest dataset of variants and full- resolution HLA alleles (in the alpha block) for an admixed sample, providing a valuable resource of variants, frequencies, and haplotypes in HLA genes of admixed samples for clinical and evolutive studies. 47 Figure 3. A map of the human MHC alpha block and the genes and structural variants evaluated in a large admixed Brazilian sample from São Paulo. Figure 4. Linkage disequilibrium across the alpha-block, considering 638 biallelic SNPs, and the structural variants AluyHF, AluyHG, and nsv823470. C N V 48 Table 1. HLA-F CDS alleles in an admixed sample from São Paulo (city), Brazil. HLA-F CDS sequence a Frequency (2n = 2.646) F*01:01:01 0.6481 F*01:01:02 0.1765 F*01:03:01 0.1493 F*01:01:04 0.0011 F*01:01:05 0.0008 F*01:02 0.0026 F*01:04:01 0.0057 unknown 0.0158 a alleles are listed in decreased order of frequency Table 2. HLA-F coding alleles in an admixed sample from São Paulo (city), Brazil. HLA-F genomic sequences (alleles) a Frequency (2n= 2646) F*01:01:01:09 0.1716 F*01:01:01:08 0.1584 F*01:01:01:01 0.1534 F*01:03:01:01 0.0835 F*01:01:01:18 0.0608 F*01:01:01:17 0.0537 F*01:03:01:03 0.0518 F*01:01:02:09 0.0480 F*01:01:02:08 0.0393 F*01:01:02:10 0.0393 F*01:01:02:07 0.0215 F*01:01:02:11 0.0185 F*01:01:01:11 0.0166 F*01:04:01:02 0.0057 F*01:01:01:13 0.0023 F*01:01:04 0.0011 F*01:03:01:04 0.0011 F*01:01:05 0.0008 F*01:01:01:19 0.0004 unknown 0.0722 a alleles are listed in decreased order of frequency 49 Table 3. HLA-G CDS alleles in an admixed sample from São Paulo (city), Brazil. HLA-G CDS sequence a Frequency (2n = 2.646) G*01:01:01 0.4275 G*01:01:02 0.1718 G*01:04:01 0.0921 G*01:03:01 0.0804 G*01:04:04 0.0650 G*01:01:03 0.0521 G*01:06 0.0502 G*01:05N 0.0211 G*01:01:22 0.0098 G*01:01:17 0.0049 G*01:01:15 0.0042 G*01:01:14 0.0038 G*01:01:09 0.0026 G*01:01:19 0.0023 G*01:04:05 0.0023 G*01:01:12 0.0019 G*01:08:02 0.0011 G*01:11 0.0011 G*01:09 0.0004 G*01:21N 0.0004 unknown 0.0049 a alleles are listed in decreased order of frequency 50 Table 4. HLA-G genomic alleles in an admixed sample from São Paulo (city), Brazil. HLA-G genomic sequences (alleles) a Frequency (2n= 2646) G*01:01:01:01 0.2462 G*01:01:02:01 0.1616 G*01:04:01:01 0.0876 G*01:01:01:05 0.0865 G*01:03:01:02 0.0789 G*01:04:04 0.0642 G*01:01:03:03 0.0517 G*01:06 0.0502 G*01:01:01:04 0.0468 G*01:01:01:08 0.0230 G*01:05N 0.0200 G*01:01:22:01 0.0098 G*01:01:02:02 0.0060 G*01:01:01:06 0.0053 G*01:01:17 0.0049 G*01:01:15 0.0042 G*01:01:14 0.0038 G*01:01:01:09 0.0034 G*01:04:01:02 0.0030 G*01:01:19 0.0023 G*01:04:05 0.0023 G*01:01:12 0.0019 G*01:08:02 0.0011 G*01:11 0.0011 G*01:01:03:04 0.0004 G*01:09 0.0004 G*01:21N 0.0004 unknown 0.0329 a alleles are listed in decreased order of frequency 51 Table 5. HLA-A CDS alleles in an admixed sample from São Paulo (city), Brazil. HLA-A CDS sequence a Frequency (2n = 2.646) (continue) HLA-A CDS sequence a Frequency (2n = 2.646) A*02:01:01 0.2139 A*01:02 0.0011 A*24:02:01 0.0915 A*26:02:01 0.0011 A*03:01:01 0.0907 A*26:03:01 0.0011 A*01:01:01 0.0869 A*31:15 0.0011 A*11:01:01 0.0522 A*66:02 0.0011 A*23:01:01 0.0518 A*02:09:01 0.0008 A*31:01:02 0.0378 A*02:12 0.0008 A*29:02:01 0.0374 A*11:12 0.0008 A*32:01:01 0.0344 A*24:02:13 0.0008 A*26:01:01 0.0295 A*33:03:23 0.0008 A*30:02:01 0.0295 A*69:01:01 0.0008 A*33:01:01 0.0219 A*01:03:01 0.0004 A*68:02:01 0.0212 A*02:01:04 0.0004 A*68:01:02 0.0200 A*02:01:15 0.0004 A*30:01:01 0.0193 A*02:07:01 0.0004 A*68:01:01 0.0166 A*02:14 0.0004 A*02:05:01 0.0155 A*02:17:02 0.0004 A*74:01:01 0.0113 A*02:20:01 0.0004 A*33:03:01 0.0102 A*02:221 0.0004 A*25:01:01 0.0094 A*02:334 0.0004 A*02:02:01 0.0091 A*02:52 0.0004 A*02:11:01 0.0087 A*02:66 0.0004 A*66:01:01 0.0079 A*11:02:01 0.0004 A*24:03:01 0.0076 A*23:05 0.0004 A*34:02:01 0.0068 A*24:104 0.0004 A*23:17:01 0.0064 A*24:218 0.0004 A*03:02:01 0.0042 A*29:02:07 0.0004 A*36:01 0.0042 A*30:29 0.0004 A*29:01:01 0.0038 A*31:01:13 0.0004 A*80:01:01 0.0038 A*31:04:01 0.0004 A*02:04 0.0034 A*32:02 0.0004 A*26:08:01 0.0034 A*68:01:24 0.0004 A*02:06:01 0.0030 A*68:02:02 0.0004 A*30:04:01 0.0023 A*68:17:01 0.0004 A*30:95 0.0015 unknown 0.0042 a alleles are listed in decreased order of frequency 52 Table 6. HLA-A genomic alleles in an admixed sample from São Paulo (city), Brazil. HLA-A genomic sequencea Frequency (2n = 2.646) (continue) Frequency (2n = 2.646) (continue) Frequency (2n = 2.646) HLA-A genomic sequencea HLA-A genomic sequencea A*02:01:01:01 0.2109 A*36:01 0.0042 A*02:01:01:71 0.0004 A*01:01:01:01 0.0862 A*29:01:01:01 0.0038 A*02:01:04 0.0004 A*24:02:01:01 0.0824 A*80:01:01:02 0.0038 A*02:01:15 0.0004 A*03:01:01:01 0.0658 A*02:04 0.0034 A*02:02:01:04 0.0004 A*11:01:01:01 0.0518 A*26:08:01:01 0.0034 A*02:07:01:01 0.0004 A*23:01:01:01 0.0514 A*02:02:01:02 0.0030 A*02:14 0.0004 A*31:01:02:01 0.0348 A*02:06:01:01 0.0030 A*02:17:02:01 0.0004 A*32:01:01:01 0.0336 A*24:02:01:05 0.0026 A*02:20:01 0.0004 A*26:01:01:01 0.0280 A*31:01:02:04 0.0026 A*02:221 0.0004 A*29:02:01:01 0.0276 A*30:04:01:01 0.0023 A*02:334 0.0004 A*03:01:01:05 0.0215 A*33:03:01:03 0.0023 A*02:52 0.0004 A*33:01:01:01 0.0208 A*03:01:01:07 0.0019 A*02:66 0.0004 A*68:02:01:01 0.0197 A*02:01:01:05 0.0015 A*11:01:01:06 0.0004 A*30:01:01:01 0.0193 A*03:01:01:03 0.0015 A*11:02:01:01 0.0004 A*30:02:01:01 0.0185 A*30:95 0.0015 A*23:05 0.0004 A*68:01:01:02 0.0163 A*01:02 0.0011 A*24:02:01:12 0.0004 A*02:05:01:01 0.0147 A*26:02:01 0.0011 A*24:02:01:36 0.0004 A*68:01:02:01 0.0117 A*26:03:01:01 0.0011 A*24:03:01:02 0.0004 A*74:01:01:01 0.0113 A*29:02:01:03 0.0011 A*24:104 0.0004 A*25:01:01:01 0.0091 A*31:15 0.0011 A*24:218 0.0004 A*02:11:01:01 0.0087 A*66:02 0.0011 A*26:01:01:13 0.0004 A*29:02:01:02 0.0087 A*68:02:01:03 0.0011 A*29:02:07 0.0004 A*33:03:01:01 0.0079 A*02:01:01:08 0.0008 A*30:02:01:05 0.0004 A*66:01:01:01 0.0076 A*02:09:01:01 0.0008 A*30:29 0.0004 A*68:01:02:02 0.0076 A*02:12 0.0008 A*31:01:13 0.0004 A*24:03:01:01 0.0068 A*11:12 0.0008 A*31:04:01:02 0.0004 A*34:02:01:01 0.0068 A*24:02:13 0.0008 A*32:01:01:09 0.0004 A*23:17:01:01 0.0064 A*26:01:01:02 0.0008 A*32:02 0.0004 A*02:02:01:01 0.0057 A*33:03:23 0.0008 A*68:02:02 0.0004 A*30:02:01:02 0.0057 A*68:01:02:04 0.0008 A*68:17:01 0.0004 A*03:02:01 0.0042 A*69:01:01:01 0.0008 unknown 0.0128 A*24:02:01:04 0.0042 A*01:01:01:23 0.0004 A*30:02:01:03 0.0042 A*01:03:01:02 0.0004 a alleles are listed in decreased order of frequency 53 Table 7. HLA-F genomic alleles and the AluyHF haplotypes in an admixed sample from São Paulo (city), Brazil. HLA-F genomic sequences (alleles) AluyHF Frequency (2n= 2612) F*01:01:01:01 Absent 0.1531 F*01:01:01:01 Present 0.0004 F*01:01:01:08 Present 0.1325 F*01:01:01:08 Absent 0.0245 F*01:01:01:09 Absent 0.1723 F*01:01:01:11 Absent 0.0168 F*01:01:01:13 Absent 0.0019 F*01:01:01:17 Absent 0.0536 F*01:01:01:17 Present 0.0004 F*01:01:01:18 Absent 0.0410 F*01:01:01:18 Present 0.0195 F*01:01:01:19 Absent 0.0004 F*01:01:01:21 Absent 0.0004 F*01:01:02:07 Absent 0.0214 F*01:01:02:08 Absent 0.0394 F*01:01:02:09 Absent 0.0482 F*01:01:02:10 Absent 0.0356 F*01:01:02:10 Present 0.0027 F*01:01:02:11 Absent 0.0188 F*01:01:04 Present 0.0011 F*01:01:05 Present 0.0008 F*01:03:01:01 Absent 0.0850 F*01:03:01:03 Absent 0.0521 F*01:03:01:04 Absent 0.0011 F*01:04:01:02 Absent 0.0057 F*01:05 Absent 0.0004 unknown - 0.0708 54 Table 8. HLA-G genomic alleles and the AluyHG haplotypes in an admixed sample from São Paulo (city), Brazil. HLA-G genomic sequence AluyHG Frequency (2n= 2612) G*01:01:01:01 Present 0.2377 G*01:01:01:01 Absent 0.0092 G*01:01:01:04 Absent 0.0364 G*01:01:01:04 Present 0.0103 G*01:01:01:05 Absent 0.0869 G*01:01:01:05 Present 0.0004 G*01:01:01:06 Absent 0.0050 G*01:01:01:08 Present 0.0226 G*01:01:01:09 Present 0.0023 G*01:01:01:09 Absent 0.0004 G*01:01:02:01 Absent 0.1394 G*01:01:02:01 Present 0.0203 G*01:01:02:02 Absent 0.0061 G*01:01:03:03 Absent 0.0528 G*01:01:12 Present 0.0019 G*01:01:14 Absent 0.0038 G*01:01:15 Absent 0.0038 G*01:01:17 Present 0.0046 G*01:01:19 Absent 0.0019 G*01:01:22:01 Present 0.0092 G*01:01:22:01 Absent 0.0004 G*01:03:01:02 Absent 0.0789 G*01:03:01:02 Present 0.0004 G*01:04:01:01 Absent 0.0861 G*01:04:01:01 Present 0.0015 G*01:04:01:02 Absent 0.0031 G*01:04:04 Absent 0.0647 G*01:04:05 Absent 0.0023 G*01:05N Absent 0.0199 G*01:06 Absent 0.0505 G*01:06 Present 0.0004 G*01:08:02 Present 0.0011 G*01:09 Absent 0.0004 G*01:11 Absent 0.0011 G*01:21N Absent 0.0004 unknown - 0.0337 55 MATERIAL SUPLEMENTAR 56 Table S1. List of variants observed in the across the HLA-F gene in 1323 Brazilians from São Paulo, and their frequencies. hg38 chr6 position SNPid Reference Reference frequency Alternative 1 Alternative 1 frequency Alternative 2 Alternative 2 frequency Alternative 3 Alternative 3 frequency Alternative 4 Alternative 4 frequency 29723031 rs981327114 A 0.9996 AG 0.0004 29723061 rs17875378 G 0.9773 A 0.0227 29723079 rs1035728209 A 0.9996 C 0.0004 29723167 . A 0.9996 G 0.0004 29723226 rs56044823 C 0.9509 T 0.0491 29723241 rs41559212 G 0.9996 A 0.0004 29723242 rs1362126 G 0.6141 A 0.3859 29723262 rs373360195 C 0.9981 T 0.0019 29723270 rs888682815 C 0.9996 T 0.0004 29723313 rs1362125 T 0.5412 A 0.4588 29723320 rs2075682 A 0.8212 T 0.1788 29723363 rs2072896 C 0.8212 G 0.1788 29723392 rs191310737 G 0.9996 A 0.0004 29723420 rs771810518 A 0.9996 AT 0.0004 29723464 rs372241614 A 0.9996 G 0.0004 29723478 rs750462794 C 0.9996 T 0.0004 29723501 rs17875379 C 0.9974 T 0.0026 29723526 rs2076183 G 0.8212 A 0.1788 29723608 rs373876297