DESENVOLVIMENTO DE UMA FERRAMENTA COMPUTACIONAL AMIGÁVEL PARA ANÁLISE DE DADOS DE SEQUENCIAMENTO GENÔMICO DE BAIXA COBERTURA E APLICAÇÃO EM ESTUDO DE GENÉTICA DE POPULAÇÕES MARCUS VINICIUS NIZ ALVAREZ BOTUCATU – SP 2024 UNIVERSIDADE ESTADUAL PAULISTA “Júlio de Mesquita Filho” INSTITUTO DE BIOCIÊNCIAS DE BOTUCATU DESENVOLVIMENTO DE UMA FERRAMENTA COMPUTACIONAL AMIGÁVEL PARA ANÁLISE DE DADOS DE SEQUENCIAMENTO GENÔMICO DE BAIXA COBERTURA E APLICAÇÃO EM ESTUDO DE GENÉTICA DE POPULAÇÕES MARCUS VINICIUS NIZ ALVAREZ ORIENTADOR: PAULO EDUARDO MARTINS RIBOLLA Tese submetida ao Programa de Pós- Graduação em Ciências Biológicas (Genética) do Instituto de Biociências de Botucatu da Universidade Estadual Paulista “Júlio de Mesquita Filho” para obtenção do título de Doutor em Ciências Biológicas (Genética). BOTUCATU – SP 2024 DEDICATÓRIA Dedico este trabalho à minha mãe Cecília e meu pai Raul (in memoriam) que sempre acreditaram em mim e me apoiaram até nos momentos mais difíceis. Não há palavras para expressar o tamanho do meu amor e eterna gratidão por vocês. Pai, a vida te levou pouco antes deste momento, mas seu legado vive em mim, e é com determinação que carrego adiante seus ensinamentos e amor que permearam tua existência. À minha irmã Danielle Amanda que amo incondicionalmente. Você é a pessoa que mais confio neste mundo. O laço entre nós é o que me permite voar sem medo de cair. Sei que compartilharemos muitas risadas, conquistas e momentos juntos que estão por vir. À minha amada Isabelle Tenori, que ilumina minha vida a cada dia que passo com você. Seu sorriso faz meu mundo ficar mais bonito. Juntos, me sinto invencível, completo e profundamente feliz. Meu coração sempre será teu. Nosso amor transcende a compreensão. Aos meus avós Luzia, Maria, Milciades (in memoriam) e Raul (in memoriam). Em vocês, sempre encontrei sabedoria, apoio e aconchego. Cada lembrança com vocês é um tesouro inestimável que carregarei para sempre no meu coração. A toda minha família, tios John e Milciades Jr., tias Ana e Fernanda, primos Ítalo e Igor e prima Lara. Sou grato por todo o carinho e suporte que vocês me proporcionam. Aos meus amigos Filipe e Daniel que são como verdadeiros irmãos para mim. É como dizem: “Amizade verdadeira é raro encontrar, difícil de acabar e impossível de esquecer”. Aos cães e gatos que fizeram ou fazem parte da minha vida. “Casa cheia de pelo, coração cheio de amor”. AGRADECIMENTOS Agradeço ao programa “O Mundo de Beakman” e a TV Cultura por terem despertado o encanto pela ciência em mim, uma criança que estava apenas começando a entender as maravilhas ao seu redor. Agradeço aos meus colegas de laboratório do grupo Pangene, em especial ao Diego pelas imensas contribuições acadêmicas ao longo desses anos, mas também pelas valiosas conversas sobre altos e baixos da vida, intercaladas com muita nostalgia e videogames. Agradeço ao Samir não apenas pelas colaborações, mas também pelo compartilhamento de suas experiências de vida e conselhos, que foram fundamentais para me ajudar a trilhar meu caminho. Um agradecimento especial a minha amiga Lorena que desde o mestrado trouxe muita cooperatividade, boas risadas e grandes desabafos nos momentos mais difíceis dessa nossa jornada acadêmica. Agradeço ao meu orientador Prof. Dr. Paulo Ribolla, com todo meu respeito e gratidão. Obrigado por acreditar em mim, por ter compartilhado seu inestimável conhecimento comigo, mas também por ter me desafiado para que eu buscasse o meu próprio. Você é uma inspiração, mas também um grande amigo que sempre soube como iluminar meu caminho não só como pesquisador, mas como ser humano. Agradeço imensamente a UNESP por ter sido minha casa nos últimos 11 anos. Sinto orgulho e o privilégio ter me formado e amadurecido aqui. Agradeço a todos os Professores e Servidores e que contribuíram para minha formação acadêmica e possibilitaram meu sonho de ser um cientista. Agradeço a Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) pela bolsa concedida processo 88887.817610/2023-00. CAPES sempre promovendo e tornando possível o desenvolvimento científico do nosso país. EPÍGRAFE “Não são nossas habilidades que revelam quem realmente somos, são as nossas escolhas.” Alvo Dumbledore RESUMO Desenvolvimento de uma ferramenta computacional amigável para análise de dados de sequenciamento genômico de baixa cobertura e aplicação em estudo de genética de populações Os avanços nas tecnologias de sequenciamento genômico nos últimos anos têm reduzido consideravelmente o custo desta técnica. No entanto, projetos de larga escala continuam apresentando custo elevado, muitas vezes impraticável. Uma das alternativas é o sequenciamento genômico de baixa cobertura (lcWGS). Este estudo teve como objetivo avaliar o desempenho da genotipagem em diversos cenários diferentes usando um programa específico para dados de lcWGS que foi desenvolvido contendo pipeline completo, otimizado para a genotipagem por sequenciamento de baixa cobertura. Foram testados parâmetros como profundidade de sequenciamento e tamanho amostral para avaliação do desempenho da genotipagem desta ferramenta. Os resultados demonstraram que dados de lcWGS processados adequadamente tem confiabilidade satisfatória e grande potencial no processo de genotipagem em larga escala. Subsequentemente, a ferramenta desenvolvida intitulada LCSeqTools foi aplicada em um segundo estudo, no qual foram utilizadas amostras de Nyssorhynchus darlingi coletados no município de Mâncio Lima, estado do Acre, para detecção de inversões cromossômicas. O objetivo foi testar a capacidade e precisão de detecção de inversões cromossômicas utilizando dados de lcWGS, além de contribuir para melhor entendimento das inversões do genoma de Ny. darlingi, bem como a estrutura e dinâmica do genoma dessa espécie. Dez inversões cromossômicas foram detectadas no genoma de Ny. darlingi e a análise de genômica comparativa revelou nenhuma inversão em sintenia com inversões conhecidas de Anopheles gambiae. Palavras-chave: Genômica; Bioinformática; Genética de populações; Inversões cromossômicas; SNPs; Mosquito; ABSTRACT Development of a user-friendly computational tool for analyzing low-coverage genomic sequencing data and application in population genetics studies Advances in genomic sequencing technologies in recent years have significantly reduced the cost of this technique. However, large-scale projects still present high costs, often impractical. An alternative strategy is low-coverage whole-genome sequencing (lcWGS). This study aimed to evaluate the genotyping performance in various scenarios using a specific program for lcWGS data that was developed with a complete workflow optimized for low- coverage sequencing genotyping. Parameters such as sequencing depth and sample size were tested to evaluate the performance of this tool. The results demonstrated that properly processed lcWGS data have satisfactory genotyping reliability and great potential for large- scale genotyping studies. Subsequently, the developed tool, entitled LCSeqTools, was applied in a second study, in which samples of Nyssorhynchus darlingi collected in the municipality of Mâncio Lima, Acre state, were used to detect chromosome inversions. The objective was to test the capacity and accuracy of detecting chromosome inversions using lcWGS data, as well as to contribute to a better understanding of the inversions in the genome of Ny. darlingi, as well as the structure and dynamics of this species genome. Ten chromosome inversions were detected in the genome of Ny. darlingi, and comparative genomic analysis revealed no inversions in synteny with known inversions of Anopheles gambiae. Keywords: Genomics; Bioinformatics; Population genetics; Chromosome inversions; SNPs; Mosquito; LISTA DE ILUSTRAÇÕES Figura 1. Histórico do custo de sequenciamento de DNA ao longo dos anos..........................16 Figura 2. Representação esquemática de estratégias de sequenciamento genômico................17 Figura 3. Representação esquemática do processo de imputação.............................................18 Figura 4. Descrição das etapas de interação com programas de computador por linhas de comando (esquerda) ou interface gráfica (direita)....................................................................19 Figura 5. Número de mortes por malária reportado em 2022...................................................21 Figura 6. Classificação de Risco dos municípios segundo a incidência de Malária.................22 Figura 7. Ny. darlingi fêmea adulta.......................................................................................... 23 Figura 8. Representação esquemática de Inversões Cromossômicas........................................24 Figura 9. Representação esquemática do cariótipo de Ny. Darlingi.........................................25 CAPÍTULO 1 Figure 1. Contrast analysis for false positive SNP rate.............................................................35 Figure 2. Contrast analysis for MAF percentage change..........................................................36 Figure 3. Contrast analysis for Homozygosity percentage change...........................................37 Figure 4. Contrast analysis for Genotype concordance rate.....................................................38 CAPÍTULO 2 Figure 1. Chromosome Inversion Detection Analysis for Ny. darlingi dataset........................55 Figure 2. Inversion Genotyping Analysis................................................................................. 57 Figure 3. Chromosome Inversion Detection Analysis for An. gambiae dataset.......................58 Figure 4. Genome Synteny Analysis for Ny. darlingi with An. gambiae and An. albimanus..59 Figure 5. Chromosome Inversion Synteny Analysis................................................................60 APÊNDICE A Figure A1. Line graph for false positive and true positive SNP rate (CM dataset). Horizontal facets (legend on the right corner) represent sequencing depth................................................66 Figure A2. Line graph for false positive and true positive SNP rate (BF dataset). Horizontal facets (legend on the right corner) represent sequencing depth................................................67 Figure A3. Q-Q Plot and histogram of residuals for false positive SNP rate...........................68 Figure A4. Boxplots for MAF percentage change (CM dataset)..............................................70 Figure A5. Boxplot for MAF percentage change (BF dataset).................................................71 Figure A6. Q-Q Plot and histogram of residuals for MAF percentage change........................72 Figure A7. Boxplot for homozygosity percentage change (CM dataset).................................73 Figure A8. Boxplot for homozygosity percentage change (BF dataset)...................................74 Figure A9. Q-Q Plot and histogram of residuals for Homozigosity percentage change...........75 Figure A10. Boxplot for genotype concordance rate (CM dataset)..........................................77 Figure A11. Boxplot for genotype concordance rate (BF dataset)...........................................78 Figure A12. Q-Q Plot and histogram of residuals for genotype concordance rate...................81 LISTA DE TABELAS CAPÍTULO 1 Table 1. Samples from the Anopheles gambiae 1000 Genomes Consortium project genotype panel used for simulations.........................................................................................................32 CAPÍTULO 2 Table 1. Ny. darlingi chromosome inversion coordinate estimates based on Genome-wide association test..........................................................................................................................56 APÊNDICE A Table A1. Linear regression estimates for false positive SNP rate...........................................68 Table A2. Linear regression estimates for false positive SNP rate, including MAF ranges coefficient..................................................................................................................................69 Table A3. Marginal Means and contrasts estimates for false positive SNP rate......................69 Table A4. Linear regression estimates for MAF percentage change........................................72 Table A5. Marginal Means and contrasts estimates for MAF percentage change....................72 Table A6. Linear regression estimates for homozigosity percentage change...........................75 Table A7. Marginal Means and contrasts estimates for Homozigosity percentage change......76 Table A8. Linear regression estimates for Genotype concordance rate (Minor Allele Homozygotes)........................................................................................................................... 79 Table A9. Linear regression estimates for Genotype concordance rate (Heterozygotes).........79 Table A10. Linear regression estimates for Genotype concordance rate. (Major Allele Homozygotes)........................................................................................................................... 80 Table A11. Marginal Means and contrasts estimates for Genotype concordance rate (Minor Allele Homozygotes)................................................................................................................ 82 Table A12. Marginal Means and contrasts estimates for Genotype concordance rate (Heterozygotes)......................................................................................................................... 83 Table A13. Marginal Means and contrasts estimates for Genotype concordance rate (Major Allele Homozygotes)................................................................................................................ 84 Table A14. Summary of models used in the linear regression analyses...................................85 LISTA DE SÍMBOLOS bp: Pares de base. Exemplo: 1kbp equivale a 1.000 pares de base. p-value: Nível descritivo do teste estatístico. Q1: Primeiro quartil (ou 25º percentil). Q3: Terceiro quartil (ou 75º percentil). Q-Q: Quantil-Quantil. r²: Coeficiente de correlação do desequilíbrio de ligação. R²: Coeficiente de determinação. ρ: Coeficiente de correlação de Spearman. SUMÁRIO 1. INTRODUÇÃO GERAL......................................................................................................16 1.1. Sequenciamento Genômico de Baixa Cobertura: Aplicações e Desafios.....................16 1.2. Bioinformática Amigável: Facilitando o Acesso à Análise Genética para Usuários....19 1.3. Malária no Brasil........................................................................................................... 20 1.4. Principal vetor de malária no Brasil: Nyssorhynchus darlingi......................................22 1.5. Inversões cromossômicas no genoma de Ny. darlingi..................................................24 2. OBJETIVOS.........................................................................................................................27 CAPÍTULO 1 – LCSeqTools – A user friendly tool for cost-effective genotyping by sequencing.................................................................................................................................28 Abstract................................................................................................................................28 Background.......................................................................................................................... 29 Methods................................................................................................................................ 30 Results.................................................................................................................................. 34 Discussion............................................................................................................................ 39 Conclusion............................................................................................................................41 References............................................................................................................................ 41 CAPÍTULO 2 – Chromosome Inversion detection using low-coverage sequencing data (lcWGS).................................................................................................................................... 44 Abstract................................................................................................................................44 Background.......................................................................................................................... 45 Methods................................................................................................................................ 46 Results.................................................................................................................................. 49 Discussion............................................................................................................................ 50 Conclusion............................................................................................................................51 References............................................................................................................................ 52 Figures and tables.................................................................................................................55 3. CONSIDERAÇÕES FINAIS................................................................................................61 REFERÊNCIAS........................................................................................................................62 GLOSSÁRIO............................................................................................................................ 64 APÊNDICE A – Material Suplementar do Capítulo 1.............................................................65 INTRODUÇÃO GERAL - 16 ___________________________________________________________________________ 1. INTRODUÇÃO GERAL 1.1. Sequenciamento Genômico de Baixa Cobertura: Aplicações e Desafios Muitos estudos envolvendo sequenciamento genômico tem sido conduzidos nos últimos anos. Com o desenvolvimento das tecnologias de sequenciamento de genoma tem resultado em reduções consideráveis no custo da técnica (Figura 1). No entanto, projetos em larga escala que demandam por grandes quantidades de amostras sequenciadas continuam apresentando custo elevado, muitas vezes, impraticáveis. Por conta disso, alternativas têm sido adotadas para que o uso de dados genômicos em larga escala seja viável, auxiliando no entendimento da estrutura genética de populações. Figura 1. Histórico do custo de sequenciamento de DNA ao longo dos anos. Fonte: WETTERSTRAND. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program. Disponível em: . Acesso: 28 mai. 2024. (Adaptado). Existem diversas estratégias de sequenciamento do genoma atualmente, por exemplo: sequenciamento de genoma completo (do inglês, whole genome sequencing, WGS); sequenciamento de fragmentos de DNA associado a sítios de restrição (do inglês, Restriction site associated DNA sequencing, RADseq). Uma das alternativas de baixo custo para sequenciamento em larga escala é o sequenciamento de baixa cobertura do genoma completo http://www.genome.gov/sequencingcostsdata INTRODUÇÃO GERAL - 17 ___________________________________________________________________________ (do inglês, low-coverage whole genome sequencing, lcWGS). Há uma ambiguidade no uso do termo cobertura e profundidade de sequenciamento na literatura. No presente estudo, foi utilizado a padronização dos termos de acordo com ao Figura 2, na qual as leituras (do inglês, reads) estão representadas em cores de acordo com o tipo de sequenciamento. O termo profundidade de sequenciamento refere-se a quantidade de vezes que uma região do genoma foi lida no processo de sequenciamento e o termo cobertura do sequenciamento refere-se a porcentagem do genoma que foi efetivamente sequenciado, independente do número de vezes. Quanto menor a profundidade de sequenciamento, menor é a confiabilidade do genótipo identificado (SIMS et al., 2014; RUBINACCI et al., 2020). No WGS tradicional, o processo de sequenciamento resultará em amostras com alta profundidade e alta cobertura. No entanto, o volume de DNA total sequenciado será elevado, aumentando consideravelmente o custo da corrida de sequenciamento, além de outros fatores como número de amostras e o tamanho do genoma da espécie. O RADSeq resultará em um sequenciamento com alta profundidade, mas a cobertura é significantemente menor, consequentemente a densidade de polimorfismos detectados será muito menor. Figura 2. Representação esquemática de estratégias de sequenciamento genômico. Fonte: Imagem gerada pelo autor. O lcWGS aliado a técnica de imputação de genótipos confere informações genômicas suficientes para seleção de marcadores com confiabilidade aceitável e custo reduzido INTRODUÇÃO GERAL - 18 ___________________________________________________________________________ (GORJANC et al., 2017; RUBINACCI et al., 2020). A acurácia na detecção de variantes é reduzida em sequenciamento genômico de baixa cobertura, tendendo apresentar taxa de falso positivo elevada, mas isso é atenuado quando a informação entre as amostras é combinada, proporcionando bom poder de identificação de variantes comuns (SIMS et al., 2014; RUBINACCI et al., 2020). A técnica de imputação de genótipos é uma técnica que apresenta ótimo custo- benefício para aumentar o poder em análises genômicas, tais como estudos de associação ampla de genoma, seleção genômica, entre outros. Existem diversos programas e algoritmos de imputação disponíveis atualmente, os quais se baseiam em recuperar informação de genótipos faltantes utilizando estimativas de haplótipos da população (Figura 3), etapa conhecida como faseamento (do inglês, Phasing). Figura 3. Representação esquemática do processo de imputação. Fonte: Imagem gerada pelo Autor. A inferência de genótipos por imputação, tanto para painéis de genotipagem quanto para genotipagem por sequenciamento, demonstra ser uma técnica com bons resultados acurados, possibilitando o uso de sequenciamento de genoma completo de baixa cobertura para descoberta de variantes com uma redução substancial no custo quando comparada com o WGS padrão (PASANIUC et al., 2012; RUSTAGI et al., 2017) Li et al. (2011) demonstraram que variantes raras em amostras de WGS de baixa cobertura apresentam maior dificuldade de serem detectadas por conta da dificuldade de distinguir alelos raros genuínos de erros de sequenciamento. A quantidade de variantes identificadas é superior quando a proporção de polimorfismos na população que segregou entre os indivíduos sequenciados é maior. O presente estudo desenvolveu um programa com interface amigável aos usuários que executa um pipeline completo de genotipagem por sequenciamento, tratamento de dados e INTRODUÇÃO GERAL - 19 ___________________________________________________________________________ imputação otimizados para dados de sequenciamento genômico de baixa cobertura, formando um kit de ferramentas completo e de fácil utilização que contribuirá com o desenvolvimento de futuros projetos que dependam de alternativas de custo reduzido para sequenciamento em larga escala. 1.2. Bioinformática Amigável: Facilitando o Acesso à Análise Genética para Usuários Diversos programas de bioinformática, inclusive para análise de dados de sequenciamento, têm sido desenvolvidos ao longo dos últimos anos. Repositórios de código fonte e binários como GitHub, Sourceforge e LaunchPad, disponibilizam diversos programas e workflows para diferentes aplicações bioinformáticas. Na maioria dos casos, são programas desenvolvidos para serem executados em interfaces de linhas de comandos. Embora poderosas, esse tipo de interface não é tão amigável para usuários sem experiência em linguagens de programação, especialmente se comparadas às interfaces gráficas que são projetadas para serem mais acessíveis e convidativas aos usuários. Figura 4. Descrição das etapas de interação com programas de computador por linhas de comando (esquerda) ou interface gráfica (direita). Fonte: Imagem gerada pelo Autor. Programas como QIAGEN CLC Genomics Workbench, Geneious e Unipro UGENE são importantes por disponibilizarem um conjunto de ferramentas poderosas para lidar com grande volume de dados biológicos e, ao mesmo tempo, fornecer uma interface amigável que pode economizar tempo e potencializar a acessibilidade da ferramenta por pesquisadores de laboratório experimental sem experiência em interfaces de linha de comando, por exemplo. A INTRODUÇÃO GERAL - 20 ___________________________________________________________________________ implementação de uma interface gráfica amigável aos usuários é extremamente importante para aproximação da ferramenta e dos usuários sem experiência em linhas de comando. Quanto mais intuitiva e amigável for a interface do programa de computador, maior o potencial para cativar novos usuários. Existem diversas formas e com diferentes possibilidades para implementar interface gráfica em programas de computador. Uma das formas é através do framework Qt (disponível em https://www.qt.io/). Um framework é um conjunto de códigos de vários projetos que promovem uma funcionalidade genérica, permitindo ao desenvolvedor dedicar seus esforços na resolução de problemas específicos do programa em vez de reescrever códigos comuns para funcionalidades genéricas. O Qt é um framework para desenvolvimento de software em linguagem C++, permitindo inclusive o desenvolvimento em múltiplas plataformas como Linux, Windows, macOS, Android, entre outras. Além disso, o Qt é disponibilizado através de licenças gratuitas como GPL e LGPLv3 (disponível em https://www.qt.io/licensing/), sendo ideal para casos de projetos com distribuição de código aberto com propósito acadêmico/estudantil. Com base no descrito, o projeto teve como objetivo desenvolver um programa utilizando o framework Qt, implementando uma interface gráfica amigável aos usuários que facilitará a execução do pipeline completo para genotipagem por sequenciamento utilizando dados de sequenciamento genômico de baixa cobertura, resultando em um arquivo do tipo VCF (do inglês, Variant call format), o qual poderá ser utilizado em diversos programas de análises estatísticas e genômica populacional, inclusive programas que também possuem interface amigável aos usuários. 1.3. Malária no Brasil A malária é considerada a enfermidade transmitida por artrópode mais impactante em países em desenvolvimento. De acordo com World Malaria Reports (2023), a estimativa de casos no mundo é de 249 milhões de casos de malária e 608 mil mortes em 2022, aproximadamente 94% concentradas na África (Figura 5). Além da África, esta doença afeta outras populações pobres em regiões tropicais e subtropicais, devido às condições ambientais favoráveis para o desenvolvimento do agente causador da doença bem como do seu transmissor (SNOW et al., 2005). https://www.qt.io/licensing/ https://www.qt.io/ INTRODUÇÃO GERAL - 21 ___________________________________________________________________________ Figura 5. Número de mortes por malária reportado em 2022 Fonte: Imagem gerada pelo autor, dados conforme World Malaria Reports (WHO, 2023) O Brasil é um país com alta incidência de malária, foram 131 mil casos registrados em 2022 segundo o Boletim Epidemiológico da Secretaria de Vigilância em Saúde de 2024 (BRASIL, 2024). Cerca de 99,9% dos casos registrados estão concentrados na região amazônica (Figura 6). Essa doença é caracterizada por desencadear acessos periódicos de febres intensas que debilitam profundamente o indivíduo infectado. O ciclo de transmissão é composto por protozoários (Reino Protozoa) do gênero Plasmodium spp. que são transmitidos ao homem através da picada de mosquitos do gênero Anopheles spp. Há cinco espécies e duas subespécies dentro do gênero Plasmodium spp que podem causar a malária humana: Plasmodium vivax, P. falciparum, P. malariae, P. ovale curtisi, P. ovale wallikeri e P. knowlesi (SU, 2010). INTRODUÇÃO GERAL - 22 ___________________________________________________________________________ Figura 6. Classificação de Risco dos municípios segundo a incidência de Malária. Fonte: BRASIL, Boletim Epidemiológico nº55. 2024. Disponível em: . Acesso em: 28 mai. 2024 As principais diferenças entre esses agentes patogênicos são infecção, sintomas e terapias utilizadas para o tratamento. Do ponto de vista epidemiológico, a principal diferença é a mortalidade: casos de P. falciparum não tratados possuem elevados índices de óbitos, sendo a espécie mais letal (WORLD HEALTH ORGANIZATION, 2023). Aproximadamente 500 espécies de anofelinos já foram identificadas, 70 são vetores de parasitos e destes, cerca de 20 são importantes transmissores da malária ao homem (SERVICE, 2008). 1.4. Principal vetor de malária no Brasil: Nyssorhynchus darlingi Nyssorhynchus darlingi é o principal vetor da malária no Brasil, altamente suscetível aos plasmódios humanos e, capaz de transmitir a doença dentro e fora das moradias, mesmo em baixa densidade (Figura 7). Os criadouros deste mosquito são caracteristicamente representados por coleções de águas límpidas, com certa profundidade, sombreadas, dotadas de vegetação pobre em sais e matéria orgânica (FORATTINI, 2002). Além disso, é notavelmente antropofílico, pois na medida em que o ambiente natural se transforma em antrópico, ou desmatado, a população local de Ny. darlingi tenderá a coabitar com o homem, https://www.gov.br/saude/pt-br/centrais-de-conteudo/publicacoes/boletins/epidemiologicos/edicoes/2024/boletim-epidemiologico-volume-55-no-01/view https://www.gov.br/saude/pt-br/centrais-de-conteudo/publicacoes/boletins/epidemiologicos/edicoes/2024/boletim-epidemiologico-volume-55-no-01/view INTRODUÇÃO GERAL - 23 ___________________________________________________________________________ invadindo-lhe os domicílios, traduzindo a capacidade de adaptação do mosquito ali presente e potencializando seu papel de vetor (ROZENDAAL, 1990). Figura 7. Ny. darlingi fêmea adulta Fonte: James Gathany | CDC. Disponível em: . Acesso em: 28 de mai. 2024 Na Amazônia, é o vetor que melhor e mais rapidamente se beneficia das alterações causadas pelo ser humano no ambiente silvestre (CONSOLI & LOURENÇO-DE-OLIVEIRA, 1994). O controle do vetor é realizado com borrifação de inseticida e, uso de mosquiteiros ao entardecer e durante a noite, horário de pico da atividade hematofágica do vetor (BAIA-DA- SILVA et al., 2019). O uso de inseticida apresenta periculosidade à saúde de habitantes locais e de funcionários que o manejam, além de causar seleção de indivíduos resistentes (VEZENEGHO et al., 2009). Sendo assim, o estudo do comportamento e dinâmica biológica do vetor, sua dispersão e interação com humanos é de grande importância para a entomologia médica e para a saúde pública. Até o presente momento, poucos estudos foram conduzidos utilizando dados de sequenciamento genômico para Ny. darlingi, especialmente WGS. Alvarez et al. (2022) demonstraram que utilizando dados de lcWGS associados à imputação, foi possível identificar SNPs estatisticamente significativos associados ao comportamento de picada do mosquito, além de demonstrar sinais de estratificação populacional em escala microgeográfica. Mais estudos que utilizem WGS são necessários para caracterizar a nível molecular o processo de adaptação e evolução do mosquito a fim de estabelecer estratégias de alta precisão de controle vetorial, bem como rastrear as mudanças na dinâmica populacional para processos de intervenção. Além disso, uma das vantagens de se utilizar estratégias de https://agencia.fapesp.br/discovery-paves-the-way-for-blocking-malaria-transmission-in-brazil/31979 https://agencia.fapesp.br/discovery-paves-the-way-for-blocking-malaria-transmission-in-brazil/31979 INTRODUÇÃO GERAL - 24 ___________________________________________________________________________ sequenciamento genômico completo é a possibilidade de explorar os dados em busca de testar múltiplas hipóteses biológicas simultaneamente. 1.5. Inversões cromossômicas no genoma de Ny. darlingi Inversões cromossômicas são variantes estruturais que ocorrem quando um segmento do cromossomo rotaciona 180º em sua posição cromossômica original (Figura 8). Inversões podem apresentar tamanhos variados, podendo alcançar uma parte substancial de um braço cromossômico, bem como podem ter cerca de apenas alguns milhares de pares de base de comprimento. Regiões adjacentes aos pontos de quebra das inversões cromossômicas apresentam taxa de recombinação fraca ou praticamente nula, sendo regiões particularmente valiosas e importantes para inferir o histórico evolutivo das inversões (CORBETT-DETIG et al, 2019). São componentes fundamentais no processo de evolução, por exemplo, em Anopheles gambiae desempenham importante papel no processo de adaptação a diferentes condições ambientais (CORBETT-DETIG et al., 2019; AYALA et al., 2019), bem como associadas ao tamanho do corpo, resistência ao estresse térmico e de dessecação, consistentes com um papel importante das inversões na tolerância do mosquito à aridez (CHENG et al., 2018). Figura 8. Representação esquemática de Inversões Cromossômicas. Fonte: Imagem gerada pelo Autor. INTRODUÇÃO GERAL - 25 ___________________________________________________________________________ O genoma de Ny. darlingi apresenta uma composição cromossômica de 2n=6, constituído por dois pares de cromossomos autossômicos e um par de cromossomos sexuais (Figura 9). Dentre os cromossomos autossômicos, 2 é metacêntrico e 3 é submetacêntrico. Os cromossomos sexuais são X e Y, sendo acrocêntrico e puntiforme (totalmente heterocromático), respectivamente (RAFAEL & TADEI, 1998; RAFAEL & TADEI, 2000). Figura 9. Representação esquemática do cariótipo de Ny. darlingi Fonte: Imagem gerada pelo Autor, dados conforme Rafael & Tadei (2000) Em Ny. darlingi, já foram descritas inversões em cromossomos politênicos de glândulas salivares de mosquitos do sudeste do Brasil e região amazônica, sendo inversões localizadas nos cromossomos X, braços 2L e 2R do cromossomo 2 e braços 3R e 3L do cromossomo 3 (KREUTZER et al., 1972; TADEI & SANTOS, 1982; RAFAEL et al., 2010; CORNEL et al., 2016). É possível observar sinais de inversões cromossômicas utilizando dados de SNPs oriundos de sequenciamento genômico completo. Múltiplas abordagens já foram descritas na literatura, possibilitando utilizar diferentes parâmetros genômicos para tal processo (CÁCERES & GONZÁLEZ, 2015). Por exemplo, é possível identificar alterações no padrão de desequilíbrio de ligação entre SNPs por da taxa de recombinação fraca em regiões adjacentes ao ponto de quebra das inversões. Outro maneira utilizável é a redução de dimensões por análise de escalonamento multidimensional (MDS) ou análise de componentes principais (PCA). Ao aplicar esses métodos de redução de dimensões, a variabilidade genética causada pela fixação de alelos por conta do aumento do desequilíbrio de ligação nas regiões das inversões cromossômicas pode ser identificadas através da formação de agrupamentos genéticos ou pela inspeção dos dos autovetores utilizados na transformação dos dados no INTRODUÇÃO GERAL - 26 ___________________________________________________________________________ PCA, onde é possível relacionar a importância de cada SNP com cada componente principal avaliado. Até o presente momento, não há estudos que descrevem inversões cromossômicas para a espécie Ny. darlingi a nível molecular utilizando tecnologia de sequenciamento genômico completo. OBJETIVOS - 27 ___________________________________________________________________________ 2. OBJETIVOS O objetivo deste projeto foi desenvolver um programa intuitivo e autônomo para a genotipagem de dados de lcWGS, avaliar sua eficiência por meio de simulações computacionais, e aplicar o método desenvolvido a estudos de genômica populacional e comparativa utilizando amostras de Ny. darlingi coletadas no município Mâncio Lima no estado do Acre. CAPÍTULO 1 - 28 ___________________________________________________________________________ CAPÍTULO 1 – LCSeqTools – A user friendly tool for cost-effective genotyping by sequencing. Authors: Marcus Vinicius Niz Alvarez, Diego Peres Alonso, Paulo Eduardo Martins Ribolla ═════════════════════════════════════════════════════ Abstract Introduction: Over the past decade, low-coverage whole-genome sequencing (lcWGS) has been used as an alternative approach for efficient and cost-effective genotyping strategy. The combination of lcWGS and genotype imputation is a known method for improving data accuracy. We proposed in this study a user-friendly genotyping tool that performs a full genotyping by sequencing pipeline optimized for lcWGS data and does not require haplotype reference panels. This is mainly important for non-model organisms that lack reference data. We also tested the pipeline in different scenarios and compared the genotyping performance with traditional WGS data. Methods: We generated in silico lcWGS sequencing data using a high coverage publicly available data of Anopheles gambiae samples as reference. Different levels of sequencing depth {0.5,1,2,3,4,5,10} and sample size {10,25,50,100,200} were tested. We tested three combinations: traditional genotyping pipeline without imputation and NextSeq500 platform (REF), LCSeqToos pipeline with NextSeq500 (NS5) and LCSeqTools with NovaSeq6000 (NS6). We estimated parameters for each simulation iteration as a benchmark: False positive SNP rate, MAF percentage change, individual homozygosity percentage change and genotype concordance rate. Results: Both N5 and N6 pipeline combinations showed significant lower false positive SNPs, lower change in allele frequencies estimates and higher genotype concordance rate in all low coverage scenarios when compared to REF. Lower or equal homozygosity percentage change was also __________________________________________________________________________________________ Conforme a instrução normativa nº1 de 2018 (Resolução UNESP no. 92/2012), este capítulo segue o formato de artigo científico adotado pela revista escolhida, não estando sujeito às normas ABNT aplicáveis aos demais capítulos desta tese. CAPÍTULO 1 - 29 ___________________________________________________________________________ observed, but only at 4x and 5x sequencing depth, for lower depth scenarios, the homozygosity change was greater than REF, but sample size also affects it. Conclusion: LCSeqTools pipeline significantly improves the genotyping performance of lcWGS sequencing data and provides an alternative reliable and cost-effective strategy for low budget projects aiming population genomics studies using biallelic SNPs. Keywords: Bioinformatics, Genomics, Biostatistics, Genotyping, Cost-effectiveness Background Population genomics studies over the past decade have used low-coverage whole- genome sequencing (lcWGS) as an alternative approach for efficient and cost-effective genotyping by sequencing. Despite the significant cost reduction of whole-genome sequencing over the last few years (WETTERSTRAND, K.A., 2022; PRESTON, J. et al, 2021), the sequencing cost per base pair is still expensive for some applications, especially for studies demanding a high number of samples or species with large genomes. lcWGS is an alternative approach, in which by sacrificing sequencing depth one can sequence a massive number of samples with the same budget. However, the major lcWGS drawback caused by the reduction in sequencing depth is the lower genotyping reliability, given that the probability of identifying heterozygous loci in diploid individuals is lower and the probability of assuming false positives as genuine variants is higher. (NIELSEN, R. et al, 2011). lcWGS combined with genotype imputation has shown great results to improve genotyping accuracy, as imputation algorithms have been improved in recent years (HUI, R. et al, 2020), surpassing even SNP arrays in terms of performance and accuracy in some scenarios (DAVIES, R.W. et al. 2021; RUBINACCI, S. et al. 2021). Nevertheless, factors such as sample size and average sequencing depth cause significant changes on imputation performance, therefore, they are important variables to be considered in the experimental design stage (LOU, R.N. et al, 2021). CAPÍTULO 1 - 30 ___________________________________________________________________________ Although the lcWGS-based genotyping accuracy is lower when compared to traditional WGS, many population level parameters can still be inferred with good reliability from the analyzed sample, such as allele frequency and homozygosity. lcWGS combined with imputation has shown to be a highly efficient approach by presenting reliable estimates of population genetic parameters (LOU, R.N. et al, 2021) and also greater statistical power in genome-wide association studies when compared to SNP array genotyping method (ZHANG, Z. et al, 2022; DENG, T. et al, 2021). This is especially important for genome-wide association studies, which often need large sample sizes to obtain sufficient statistical power. Therefore, the lcWGS can be a great alternative for projects that depend on large sample sizes, but reduced budget, such as epidemiological studies. Currently, several model organisms have haplotype reference panels that are useful to improve imputation performance. However, many other organisms still lack any type of genomic data. In these cases, it is necessary to use an approach in which the haplotypes present in the sample itself are used as a reference in the imputation stage. The goal of this study is to provide a user-friendly tool that executes a complete genotyping pipeline with customizable parameters for lcWGS data of non-model organisms. We also evaluate the genotyping performance under different scenarios of sequencing depth and sample sizes using simulated lcWGS data. Methods Genotyping by sequencing pipeline The LCSeqTools pipeline consists of six main steps: Sequence quality control, sequence alignment, variant calling, pre-imputation variant filtering, imputation and post- imputation variant filtering. The only required input files are the raw sequencing files in modern FASTQ format and a reference genome in FASTA format. Sequencing quality statistics are estimated using FASTQC v0.11.9 (ANDREWS, S., 2010) and displayed with the program graphical interface. The sequence trimming and filtering is performed by CAPÍTULO 1 - 31 ___________________________________________________________________________ Trimmomatic v0.39 (BOLGER , A. M., LOHSE, M., & USADEL, B., 2014) using predefined or custom parameters by the user. The pre-processed sequences are mapped to the reference genome using Burrows-Wheeler Aligner v0.7.17 program (LI, H., DURBIN, R., 2009). The properly mapped sequences are submitted to biallelic-only SNP calling using SAMtools v1.10 and BCFtools v1.10 programs (LI, H., 2011). Pre-imputation variant filtering is performed by LCSeqTools through the integration of LCVCFtools v1.0.2 (ALVAREZ, M.V.N., 2021; ALVAREZ, M.V.N., 2022). The pre-imputation variant filtering method was implemented specifically for low coverage data. Genotypes were omitted according to genotype quality and sequencing depth parameters, keeping only genotype likelihood data used as a priori probability in the imputation process. SNP are removed by allele-depth-based MAF estimate and missing data rate per sample and per SNP. Missing genotypes are imputed using the BEAGLE v4.1 (BROWNINGS. R., BROWNING B. L., 2016) with GTGL method. An optional post-imputation variant filtering is performed using BCFtools v1.10 program, removing genotypes according to a posteriori probability threshold. Genotyping performance evaluation The pipeline genotyping performance was evaluated through low-coverage whole genome sequencing simulations under different depths and sample sizes scenarios. Public genome-wide genotyping data of Anopheles gambiae mosquito was used as a reference to generate in silico low-coverage sequencing data. The simulation used a high coverage genome-wide variants panel of the mosquito An. gambiae from the first phase of the "The Anopheles gambiae 1000 Genomes Consortium" (ANOPHELES GAMBIAE 1000 GENOMES CONSORTIUM et al, 2017) available at EMBL-EBI (ERZ373588). Genotypes from two independent sample groups in the An. gambiae data were used as reference in the sequencing simulations. Two datasets were chosen using the sample size criteria, selecting the datasets with the biggest number of contemporaneous individuals, therefore, samples from the Republic of Cameroon (CM) and Burkina Faso (BF), represented in Table 1. CAPÍTULO 1 - 32 ___________________________________________________________________________ Table 1. Samples from the Anopheles gambiae 1000 Genomes Consortium project genotype panel used for simulations. Country Region Sex N Total Republic of Cameroon (CM) Daiguene F 69 232 Gado-Badzere F 55 Mayos F 87 Zembe-Borongo F 21 Burkina Faso (BF) Bana F 37 127Pala F 44 Sourukoudinga F 46 F: Female. N: Region sample size. Total: Country sample size The in silico sample sequencing was generated using the art_illumina program from ART software package (HUANG, W. et al., 2012). Illumina NextSeq500 and NovaSeq6000 real sequencing data were used to calibrate art_illumina to reproduce sequencing data from these equipment. To calibrate the NextSeq500 quality profile and read error probabilities, publicly available sequencing data from the National Center for Biotechnology Information – NCBI (SAYERS, E.W. et al, 2023) Sequence Read Archive (SRA), identified as Bioproject PRJNA683015, was used. To calibrate the NovaSeq6000 quality profile and reading error probabilities, publicly available sequencing data from the NCBI SRA, identified as Bioproject PRJNA744883, was used. The genome reference used is available at NCBI as AgamP3 from An. gambiae (GCF_000005575.2), the reference used in the “Anopheles gambiae 1000 Genomes” project. The Y chromosome, mitochondrial genome and unplaced scaffold genomic regions were discarded for the analysis, using only the 2L, 2R, 3L, 3R and X chromosomes. The tested parameters were: genotyping method, average sequencing depth and sample size. For genotyping methods, three different pipelines were used: NextSeq500 sequencing data with standard variant calling pipeline without imputation as reference group (REF), NovaSeq6000 sequencing data with default LCSeqTools variant calling pipeline (NS6), NextSeq500 sequencing data with default LCSeqTools variant calling pipeline (NS5). The average sequencing depth (DP) levels generated were: DP={0.5,1,2,3,4,5,10}. DP=0.5 level was considered as extremely-low coverage and DP=10 as high coverage, both were used only for descriptive analysis. The sample sizes (N) used were: N={10,25,50,100,200}. Each simulation iteration selects random unique individual genotyping data from the reference. Statistical Analysis CAPÍTULO 1 - 33 ___________________________________________________________________________ Descriptive and inferential analysis were performed using R v4.3.2 with RStudio Server “Ocean Storm” (R CORE TEAM, 2013; TEAM RSTUDIO et al, 2015). Data processing and plotting were performed using packages included in tidyverse v2.0.0 meta- package (WICKHAM, H. et al., 2019). Statistical parameters were calculated for each chromosomal region (2L, 2R, 3R, 3L and X), comparing the resulting genotypes with the reference panel for each simulation iteration. The estimated statistical parameters for descriptive and inferential analyzes were: false positive SNP rate, MAF percentage change, individual homozygosity percentage change and genotype concordance rate. Estimates were summarized in quartiles between chromosomal regions for the parameters MAF percent change, homozygosity percent change, and genotype concordance rate. Inferential analysis was performed through statistical tests for all parameters, using median estimates across the chromosomes or sum across chromosomes for false positive SNP rate. A multiple linear regression analysis was performed and marginal means estimate used for group comparison. The predictor variable selection was based on a stepwise regression approach (coefficient estimate ≠ 0, p-value ≤ 0.05). The inclusion of predictor variables was performed starting with the null model (no variable), adding the variables one by one, including interactions. Model comparison was performed using the Akaike Information Criterion (AIC) and F test. Atypical observations (outliers) were removed when identified by the interquartile range rule, given that if x < Q1 − 1.5 × IQR or x > Q3 + 1.5 × IQR, x is removed from the respective data set. Residuals of the models were evaluated for normality by the Kolmogorov-Smirnov test with Lilliefors correction and homoscedasticity by the Levene test. Data were submitted to logarithmic transformation to reduce deviation from normality when necessary. Comparisons were performed by contrasting the estimates of marginal means. Statistically significant differences were considered when the adjusted p- value for multiple comparisons by the Bonferroni method was ≤ 0.05. Genotyping performance was evaluated using the following criteria: lowest rate of false positive SNP, lowest percentage variation in MAF, lowest percentage variation in CAPÍTULO 1 - 34 ___________________________________________________________________________ homozygosity, and highest rate of genotype agreement. Results The genotyping method and sample size parameters showed statistically significant estimates in the linear regression analysis (Table A1) for the false positive SNPs rate, indicating significantly lower false positives in NS5 and NS6 data than REF data (Figure 1). The negative sample size coefficient indicates that the larger the sample size, the lower false positive SNPs rate. A separate model including false positive SNP by MAF range indicated that the higher the frequency of the MAF, the lower the rate of false positive SNPs (Table A2, figure A1 and figure A2). CAPÍTULO 1 - 35 ___________________________________________________________________________ Estimated Marginal Means (CM) Marginal contrasts (CM) Estimated Marginal Means (BF) Marginal contrasts (BF) Figure 1. Contrast analysis for false positive SNP rate. Points represent individual data. Error bars indicate the 95% confidence intervals. Different letters represent significant differences between groups (CLD). Values below the horizontal lines indicate Bonferroni-corrected contrasts p-values. All tested parameters showed statistically significant estimates in the linear regression analysis (Table A4) for MAF percentage change. The allele frequency percent change was significantly lower in the NS5 and NS6 data than in the REF data (Figure 2). As the sequencing depth increases, the MAF percentage change becomes smaller and approaches zero. This is shown by the positive sequencing depth coefficient. Sample size coefficient indicates that there is a slight negative effect on MAF percentage change. However, it is possible to observe in figures A4 and A5 that, the larger the sample size, the greater the correlation of allele frequencies. CAPÍTULO 1 - 36 ___________________________________________________________________________ Estimated Marginal Means (CM) Marginal contrasts (CM) Estimated Marginal Means (BF) Marginal contrasts (BF) Figure 2. Contrast analysis for MAF percentage change. Points represent individual data. Error bars indicate the 95% confidence intervals. Different letters represent significant differences between groups (CLD). Values below the horizontal lines indicate Bonferroni-corrected contrasts p-values. All tested parameters showed statistically significant estimates (Table A4). A significant interaction effect was observed between genotyping method and sequencing depth, indicating that data NS5 and NS6 showed homozygosity percentage variation lower than or equal to REF data when the sequencing depth was greater than approximately three times (Figure 3). Negative coefficients for both sequencing depth and sample size indicate that the homozygosity percentage variation is reduced as these variables are incremented. CAPÍTULO 1 - 37 ___________________________________________________________________________ Estimated Marginal Means (CM) Marginal contrasts (CM) Estimated Marginal Means (BF) Marginal contrasts (BF) Figure 3. Contrast analysis for Homozygosity percentage change. Points represent individual data. Error bars indicate the 95% confidence intervals. Different letters represent significant differences between groups (CLD). Values below the horizontal lines indicate Bonferroni-corrected contrasts p-values. Facets and solid connected lines indicate interaction effect for sequence depth. The genotyping method and sequencing depth parameters showed statistically significant estimates in the linear regression analysis (Table A6) for genotype concordance rate. Significant interaction effect was observed between genotyping method and sequencing depth for heterozygotes and homozygotes for the minor allele. The homozygous genotypes CAPÍTULO 1 - 38 ___________________________________________________________________________ concordance rate was significantly higher in NS5 and NS6 data than REF data (Figure 4). However, REF data showed a lower concordance rate, both for heterozygotes and for homozygotes for the major allele. Linear and quadratic coefficients of sequencing depth indicate that the greater the sequencing depth, the greater the genotype concordance rate. Estimated Marginal Means (CM) Marginal contrasts (CM) Estimated Marginal Means (BF) Marginal contrasts (BF) Figure 4. Contrast analysis for Genotype concordance rate. Points represent individual data. Error bars indicate the 95% confidence intervals. Different letters represent significant differences between groups (CLD). Values below the horizontal lines indicate Bonferroni-corrected contrasts p-values. Facets and solid connected lines indicate interaction effect for sequence depth. CAPÍTULO 1 - 39 ___________________________________________________________________________ Discussion One of the main problems lcWGS genotyping is the increase in SNP false positive rate. The smaller the average sequencing depth, the greater the likelihood of false positive SNP identification (SIMS, D. et al, 2014). LCSeqTools pipeline significantly reduces the false positive rate when compared to unprocessed data, resulting in around four times less false positive at average sequencing depth from one to five times (Table A3). Another important factor for significantly reduce false positives it the dataset sample size. Beside of that, false positives occur more frequently in lower allele frequencies, mainly for MAF ranges between 0.1 to 0.2, causing a small noise in most of cases. Nevertheless, researchers can validate a smaller set of candidate SNPs using more accessible and cost-effective genotyping techniques and still reach a good cost-benefit ratio compared to traditional WGS. The descriptive statistics indicated that an average sequencing depth < 1 of less than one time causes a higher variation in the estimated parameters compared to the observed variation between an average sequencing depth between one to five times. LCSeqTools pipeline does not generate reliable results with extremely-low coverage, so it should not be used for lcWGS data with average DP < 1. The LCSeqTools pipeline performance was very similar to the traditional genotyping protocol at average DP = 10 times, indicating that the ideal scenario for application of data processing is between one to five times of average sequencing depth. The variance in allele frequency estimates caused by low-coverage sequencing was significantly reduced using LCSeqTools pipeline, resulting in an average variance of treated data about 60% lower compared to untreated data (Table A5). It is noteworthy that most estimates show negative variation, indicating an underestimation of the real allele frequency of the SNP (Figures A4 and A5). The sequencing depth and sample size have a direct impact on allele frequency estimates, and the greater the sample size and sequencing depth, the smaller the variation in the allele frequency estimate. Homozygosity inflation is another main drawback of genotyping by low coverage sequencing. CAPÍTULO 1 - 40 ___________________________________________________________________________ The mean sequencing depth is directly linked to homozygosity inflation, and this effect can be seen in figures A7 and A8, given that the smaller the mean sequencing depth, the greater the inflation of the homozygosity estimate. The sample size also has an impact on the homozygosity estimate, where the larger the sample size, the lower the homozygosity inflation. LCSeqTools pipeline showed different results in different scenarios, in which there was a reduction in homozygosity inflation when the sequencing depth was greater than three times and amplification of homozygosity inflation when the sequencing depth was less than three times (Table A7). This result suggests that research designs that depend on high reliability estimates for homozygosity, sample size and sequencing depth should be very well structured, given that the reduction in homozygosity inflation is much more associated with sequencing depth than the sample size. LCSeqTools significantly improved the concordance rate of homozygous genotypes for the smallest allele, which had the greatest impact caused by low coverage sequencing (Table A8). However, divergent results for levels of sequencing depth were observed in the concordance rates of heterozygous and homozygous genotypes for the major allele (Table A9 and Table A10), given that LCSeqTools slightly reduces the concordance rate when the sequencing depth is also reduced. Nevertheless, the reduction caused by data processing keeps the concordance rates sufficiently high, being practically negligible in homozygotes for the major allele. The lcWGS approach offers an excellent cost-benefit ratio as an alternative strategy, particularly for large-scale sequencing projects. It allows for an increase in the total number of sequenced samples, which is inversely proportional to the reduction in sequencing depth, up to the technical limit of the proposed pipeline. This is particularly crucial for studies requiring large sample sizes, such as epidemiological research, where there is a superior statistical advantage in having more samples than certainty in individual genotypes (LOU, R.N. et al, 2021; ALEX BUERKLE, C. & GOMPERT. Z. et al, 2013). CAPÍTULO 1 - 41 ___________________________________________________________________________ Conclusion LCSeqTools pipeline significantly improves the genotyping performance of lcWGS data, providing a robust alternative for low budget projects aiming population genomics studies using high frequency biallelic SNPs. The number of false positive SNPs is significantly reduced using this pipeline, resulting in an efficient method to identify a large and reliable set of SNPs that can be easily validated by more accessible and less expensive genotyping techniques. Availability of data and material All documentation, including scripts and links necessary to replicate the study’s findings or to further extend the research, is available at: • https://github.com/marcusnizalvarez/PhD-Scripts • https://github.com/marcusnizalvarez/LCSeqTools • https://doi.org/10.5281/zenodo.11267120 • https://doi.org/10.5281/zenodo.13755768 References ALEX BUERKLE, C.; GOMPERT, Zachariah. Population genomics based on low coverage sequencing: how low should we go?. Molecular ecology, v. 22, n. 11, p. 3028-3035, 2013. ALVAREZ, Marcus Vinicius Niz et al. Nyssorhynchus darlingi genome-wide studies related to microgeographic dispersion and blood-seeking behavior. Parasites & Vectors. v. 15, n. 1, p. 106. 2022. ALVAREZ, Marcus Vinicius Niz. LCVCFtools v1.0.2-alpha (v1.0.2-alpha). Zenodo. https://doi.org/10.5281/zenodo.5259931. 2021. ANDREWS, Simon et al. FastQC: a quality control tool for high throughput sequence https://doi.org/10.5281/zenodo.13755768 https://doi.org/10.5281/zenodo.11267120 https://github.com/marcusnizalvarez/LCSeqTools https://github.com/marcusnizalvarez/PhD-Scripts CAPÍTULO 1 - 42 ___________________________________________________________________________ data. Babraham Bioinformatics. 2010. ANOPHELES GAMBIAE 1000 GENOMES CONSORTIUM et al. Genetic diversity of the African malaria vector Anopheles gambiae. Nature, v. 552, n. 7683, p. 96, 2017. BOLGER, Anthony M.; LOHSE, Marc; USADEL, Bjoern. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, v. 30, n. 15, p. 2114-2120, 2014. BROWNING, Brian L.; BROWNING, Sharon R. Genotype imputation with millions of reference samples. The American Journal of Human Genetics, v. 98, n. 1, p. 116-126, 2016. DAVIES, Robert W. et al. Rapid genotype imputation from sequence with reference panels. Nature Genetics, v. 53, n. 7, p. 1104-1111, 2021. DENG, Tianyu et al. Comparison of Genotype Imputation for SNP Array and Low- Coverage Whole-Genome Sequencing Data. Frontiers in genetics, v. 12, p. 704118-704118, 2021. HUANG, Weichun et al. ART: a next-generation sequencing read simulator. Bioinformatics, v. 28, n. 4, p. 593-594, 2012. HUI, Ruoyun et al. Evaluating genotype imputation pipeline for ultra-low coverage ancient genomes. Scientific reports, v. 10, n. 1, p. 1-8, 2020. LI, Heng; DURBIN, Richard. Fast and accurate short read alignment with Burrows– Wheeler transform. bioinformatics, v. 25, n. 14, p. 1754-1760, 2009. LI, Heng. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, v. 27, n. 21, p. 2987-2993, 2011. LOU, Runyang Nicolas et al. A beginner's guide to low‐coverage whole genome sequencing for population genomics. Molecular Ecology, v. 30, n. 23, p. 5966-5993, 2021. NIELSEN, Rasmus et al. Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, v. 12, n. 6, p. 443-451, 2011. PRESTON, J.; VANZEELAND, A.; PEIFFER, D. A. Innovation at Illumina: The road to the $600 human genome. Nat. Portf, 2021. CAPÍTULO 1 - 43 ___________________________________________________________________________ R CORE TEAM et al. R development core team. R: A language and environment for statistical computing, v. 55, p. 275-286, 2016. RUBINACCI, Simone et al. Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, v. 53, n. 1, p. 120-126, 2021. SAYERS, Eric W. et al. Database resources of the National Center for Biotechnology Information in 2023. Nucleic acids research, v. 51, n. D1, p. D29, 2023. TEAM, RStudio et al. RStudio: integrated development for R. RStudio, Inc. Boston, MA, USA, v. 42, p. 14, 2015. WETTERSTRAND, Kris A. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). National Human Genome Research Institute, 2013. WICKHAM, Hadley. et al. Welcome to the Tidyverse. Journal of open source software, v. 4, n. 43, p. 1686, 2019. ZHANG, Zhe et al. The construction of a haplotype reference panel using extremely low coverage whole genome sequences and its application in genome-wide association studies and genomic prediction in Duroc pigs. Genomics, v. 114, n. 1, p. 340-350, 2022. CAPÍTULO 2 - 44 ___________________________________________________________________________ CAPÍTULO 2 – Chromosome Inversion detection using low-coverage sequencing data (lcWGS). Authors: Marcus Vinicius Niz Alvarez, Filipe Trindade Bozoni, Lorena Leles, Diego Peres Alonso and Paulo Eduardo Martins Ribolla ═════════════════════════════════════════════════════ Abstract Introduction: Nyssorhynchus darlingi is the primary malaria vector in the Neotropical region. Previous studies have detected chromosome inversions in salivary glands politene chromosomes of Ny. darlingi samples from Amazon region. Chromosomal inversions have been recognized as important evolutionary mechanisms in diverse organisms. In mosquitoes, structural variants are associated with important ecological adaptation, vectorial capacity and even speciation. A powerful strategy to understand molecular evolution and population dynamics is the use of molecular markers. Although there has been rapid progress in genome sequencing technology, it continues to be an economical challenge for large scale projects. A cost-effective alternative approach is to use low-coverage whole genome sequencing (lcWGS) techniques. In this study, we used genome-wide SNPs to detect chromosome inversions at sequencing level, reliably, with lcWGS data. We detected chromosome inversions in Ny. darlingi samples from Amazon region and compared with Anopheles gambiae and Anopheles albimanus genomes in synteny analysis. Methods: We used 321 Ny. darlingi samples from Brazil and 200 An. gambiae samples from the Republic of Cameroon from publicly available sequencing data. The An. gambiae dataset was used to benchmark the inversion detection pipeline with known inversions. Genotyping by sequencing was performed using the LCSeqTools workflow for the lcWGS dataset with an average sequencing depth of 2x. Genetic distance estimates were calculated with PLINK 1.9 software with the pruned SNPs dataset. PCA was applied and each principal component analyzed individually with a sliding window approach for detecting chromosome inversion footprints. A synteny analysis was performed for Ny. darlingi inversions regions with An. gambiae and An. albimanus genomes. __________________________________________________________________________________________ Conforme a instrução normativa nº1 de 2018 (Resolução UNESP no. 92/2012), este capítulo segue o formato de artigo científico adotado pela revista escolhida, não estando sujeito às normas ABNT aplicáveis aos demais capítulos desta tese. CAPÍTULO 2 - 45 ___________________________________________________________________________ Results: The sliding window analysis of PCA components revealed 10 high confidence candidate regions for chromosome inversions in Ny. darlingi genome and two known inversions for An. gambiae with satisfactory precision of breakpoints. Synteny analysis showed no similar arrangements for Ny. darlingi with An. gambiae and An. albimanus genomes with synteny analysis. Conclusion: We demonstrate that lcWGS is a cost-effective and accurate method for detecting chromosome inversions. We reliably detected chromosome inversions in Ny. darlingi sample from the Brazilian Amazon that do not share similar inversion arrangements in An. gambiae or An. albimanus genomes. Keywords: Mosquito, Bioinformatics, lcWGS, Comparative Genomics, Structural Variants Background The Nyssorhynchus darlingi specie stands out as the primary vector for malaria transmission in the Neotropical region, including the Amazon basin (Nagaki et al., 2021; Rafael et al., 2021; WHO, 2023). Ny. darlingi presents a chromosomal set of 2n = 6, with two pairs of autosomes: the largest (III) is submetacentric and the smallest (II) is metacentric. For sex chromosomes, X is acrocentric and Y is pointed (Tadei et al., 1998). Previous studies have identified chromosomal inversions in Ny. darlingi populations, particularly in the Amazon region, but a comprehensive understanding of their prevalence, distribution, and functional significance is lacking. Chromosome inversions were detected in salivary glands polytene chromosomes from southeastern Brazil and the Amazon region, with inversions located in all chromosomes arms, except Y chromosome (KREUTZER et al., 1972; TADEI & SANTOS, 1982; RAFAEL et al., 2010; CORNEL et al., 2016). Structural rearrangements like inversions have long been recognized as important evolutionary mechanisms in diverse organisms (Cornel et al., 2016). In mosquitoes, chromosome inversions are associated with ecological adaptation, population divergence, and even speciation. Chromosome inversions can suppress recombination, leading to the maintenance of coadapted gene complexes and potentially facilitating the rapid evolution of traits related to vectorial capacity, such as insecticide resistance, host preference, thermal and desiccation stress resistance (Cornel et al., 2016; CORBETT-DETIG et al., 2019; AYALA et al., 2019; CHENG et al., 2018). Although the cost of per-base sequencing has decreased significantly due to advances in whole genome sequencing (WGS) technologies, conducting large-scale sequencing projects CAPÍTULO 2 - 46 ___________________________________________________________________________ can still be financially challenging and may pose significant obstacles in different contexts. An alternative cost-effective strategy is genotyping by sequencing using low-coverage whole genome sequencing (lcWGS) with genotype imputation methods that improves genomic data reliabilty for precise marker selection (Gorjanc et al., 2017; Sims et al., 2014). Genotype inference performance has been validated for both panel-based genotyping and sequencing genotyping modalities, enabling the use of lcWGS to identify variants with lower cost compared to traditional WGS method (Pasaniuc et al., 2012; Rustagi et al., 2017; Zheng et al, 2018; Li et al., 2011). In the present study, we used genome-wide SNPs from lcWGS data to investigate the population of Ny. darlingi collected in Mâncio Lima, Acre state, Brazil. We performed a comprehensive genomic analysis of chromosome inversions in Ny. darlingi and a comparative genomics analysis with Anopheles albimanus and Anopheles gambiae genomes. This study examines related species providing insights into the evolutionary dynamics across different mosquito lineages. Our findings demonstrated that lcWGS enables chromosome inversion detection and can contribute significantly to the broader understanding of mosquito genomic evolution. Methods Sequencing data acquisition For this study, we used publicly available sequencing data obtained from the National Center for Biotechnology Information – NCBI (Sayers et al., 2023). Sequencing data types were selected based on their relevance to ensure comprehensive analysis. Sequencing data from lcWGS data of 321 Ny. darlingi larvae and adults were obtained from Bioproject PRJNA683015. The samples used in this study were collected in the municipality of Mâncio Lima, Acre state of Brazil. Genome reference The Ny. darlingi, An. gambiae and An. albimanus reference genomes used are publicly available at NCBI database with respective accession numbers: GCF_943734745.1, GCF_943734735.2 and GCF_013758885.1. CAPÍTULO 2 - 47 ___________________________________________________________________________ Genotyping by sequencing and variant calling The genotyping by sequencing pipeline was performed using LCSeqTools v0.1.0 (Alvarez, 2024). Through the LCSeqTools, the read alignment was performed with Burrows- Wheeler Aligner v0.7.17 (Li & Durbin 2009) and the variant calling with SamTools v1.10 (Li, 2011) using the default mapping parameters. The LCSeqTools variant filtering parameters applied before genotype imputation were: Minor allele frequency < 0.1, max missing data < 0.5, genotype sequencing depth < 5 and genotype quality < 20. Genotypes were imputed with BEAGLE v4.1 software (Browning & Browning, 2016) using genotype normalized probability field (PL). A final filtering step was applied to the variant dataset with SNPs pruning algorithm from PLINK v1.9 software (Prucel et al., 2007). Data processing and statistical analysis All data manipulation, analysis and plotting was performed using R v4.3.2 and RStudio Server “Ocean Storm” (Team R, 2013; Team Rstudio, 2015), including packages provided by tidyverse v2.0.0 meta-package (Wichkham et al., 2019). Chromosome Inversion Identification Multiple approaches were combined to check the presence of chromosome inversion signals. Principal component analysis (PCA) for each chromosome was performed using PLINK and the sliding windows variance of SNPs weights (eigenvectors) estimates for each principal component were estimated within 10 kbp sliding windows. Based on the absolute PCA SNPs weights, the top 1% SNPs for each component were retained and the Identical by State (IBS) matrix was calculated using the respective SNPs subset. The multidimensional scaling (MDS) technique was applied to the euclidean IBS matrix with cmdscale function from R base (k=1 for genotype clustering and k=2 for pairwise comparisons) and, subsequently, a clustering technique with cmeans function from e1071 package for R (Dimitriadou et al., 2006). Only components that clustered into three well defined groups were retained, later identified as AA, AB and BB inversion genotypes. A genome-wide association test was applied for each candidate chromosome inversion, using the linear model from PLINK, considering AA, AB and BB samples as quantitative 0, 1 and 2 phenotypes, respectively. A Bonferroni post-hoc test was applied for multiple comparisons correction and SNPs were considered statistically significant SNPs CAPÍTULO 2 - 48 ___________________________________________________________________________ when adjusted p-value < 0.05. Alvarez et al (2022) showed that the expected r² value at 12.57 kbp distance is approximately 0.1 in this Ny. darlingi population. A pairwise SNPs linkage disequilibrium estimate analysis was performed for SNPs pairs within ranges of 12 kbp of distance and sliding windows r² median was calculated for each chromosome within sliding windows of 0.5% of the respective chromosome length. The Spearman's correlation coefficient ρ was estimated for r² and the mean SNPs absolute weights using cor.test function from R and were considered statistically significant when p-value < 0.05. All chromosome inversion were submitted to a pairwise Spearman correlation test using the estimated sample genotypes data and the p-values were adjusted for multiple comparison using Benjamini-Hochberg False Discovery Rate method (FDR). Correlations were considered statistically significant when FDR-adjusted p ≤ 0.05. Pipeline validation with known variants for Anopheles gambiae High coverage genome-wide variant data from the first phase of the "The Anopheles gambiae 1000 Genomes Consortium" (Anopheles Gambiae 1000 Genomes Consortium et al., 2017) publicly available at EMBL-EBI were used as reference (ERZ373588 with GCF_000005575.2 genome reference). An in silico lcWGS dataset was generated for chromosome identification pipeline validation, matching results with known An. gambiae structural variants (AgamP4 variant panel publicly available at Ensembl Metazoa). A subset of 200 samples from the Republic of Cameroon group were selected from the full variant panel and lcWGS data were generated with the art_illumina program from ART software package (HUANG et al., 2012) with approximately 2x of sequencing depth per sample. Comparative Genomics analysis An orthology inference was performed for An. gambiae and An. albimanus with Ny. darlingi as reference using BLASTp (Camacho et al., 2009) and the respective proteomes through the accession codes. The OrthoFinder software (Emms & Kelly, 2019) was also used as a second approach for orthologs inference and convergence check with BLASTp results. A genome synteny analysis was performed between the An. gambiae, An. albimanus and Ny. darlingi genomes using MCScanX (Wang et al, 2012). CAPÍTULO 2 - 49 ___________________________________________________________________________ Results Chromosome inversions detection A variant panel containing 4,241,254 SNPs was generated by the LCSeqTools pipeline for Ny. darlingi sample. After the SNPs pruning, PLINK retained 626,409 SNPs, around 3.4 SNPs per Kbp. The PCA components analysis revealed continuous high-shifted regions from SNPs weight sliding windows variance estimates, counting a total of 10 candidate regions: 2, 4 and 4 shifted regions for chromosomes X, 2 and 3, respectively (figure 1). Clustering analysis of each region revealed three well defined genetic clusters with approximately the same 0.5 genetic distance between samples AA at leftmost and AB in the middle, as well as AB and BB at rightmost (figure 2). Genome-wide association test results for each remaining component showed statistically significant SNPs concentrated in converging chromosome regions of its respective PCA shifted variance windows (figure 1). The chromosome inversion coordinate estimates are represented in table 1. The SNPs pairwise linkage disequilibrium analysis showed that, despite the median r² within a distance of 12 to 13 kbp being 0.08, 0.06 and 0.12 for chromosomes 2, 3 and X, respectively, spikes reaching almost double the median r² were found in different regions along the chromosomes (figure 1). Significant correlations were observed between r² and mean SNPs absolute weight from PCA components, as the ρ estimates were 0.62, 0.43 and 0.58 for chromosomes 2, 3 and X, respectively. In silico validation of the pipeline The An. gambiae lcWGS simulated data resulted in a variant panel containing 3,825,409 SNPs and PLINK retained 658,413 SNPs after pruning step, resulting in 2.36 SNPs per Kbp. Two meaningful components were detected with the sliding windows of SNPs weights analysis approach and, subsequently, Genome-wide association test revealed statistically significant SNPs concentrated within two regions, as coordinates estimates were [20466374,_42223401] and [19275187,_26282153] for 2L and 2R (figure 3). Both inversion estimates greatly coincided with well known An. gambiae inversions: 2La [20524057,_42165532] and 2Rb [19023924,_26758676]. Significant correlations were observed between r² and mean SNPs absolute weight from PCA components, as the ρ CAPÍTULO 2 - 50 ___________________________________________________________________________ estimates were 0.88 and 0.49 for 2L and 2R, respectively. Genome synteny analysis Comparative genomics analysis revealed similar arrangements between Ny. darlingi and An. albimanus genomes and a major rearrangement when comparing chromosome 2 and 3 from An. gambiae and Ny. darlingi (figure 4). Synteny analysis for each chromosome inversion region showed that no similarity was found between Ny. darlingi detected inversions and An. gambiae known chromosome inversions (figure 5). No data was available comparing An. albimanus inversions. Discussion First of all, it is important to know that there is a divergence between the Rafael & Tadei (1998) chromosome nomenclature for Ny. darlingi and the GCF_943734745.1 genome assembly, as the chromosome 2 and 3 names are swapped in the genome assembly. The following discussion is based on the genome assembly nomenclature. This study shows that lcWGS provides a powerful cost-reduced method for detecting chromosome inversions in large sample sizes. The in silico experiment reinforces that, besides the higher genotyping error probability of lcWGS, high confidence inversion signal detection is possible because multiple independent SNPs are reduced into a single genotype data with PCA technique, reducing any lcWGS genotyping bias. Nowling et al (2020) performed a similar approach with traditional WGS data for An. gambiae dataset and found a comparable performance for PCA-based analysis on chromosome inversion detection. Although there are various pipelines for detecting chromosome inversion using SNPs or different genetic parameters as they summarized, a simple sliding window analysis is enough to detect inversion footprints if an in-depth eigenvectors analysis is performed. The An. gambiae inversion range estimates from lcWGS data in this study were satisfactorily close to the known ranges and ranges estimated from other dedicated tools (Nowling et al, 2022). Even on genome drafts in which chromosome data is not well assembled, it is possible to use PCA clustering to detect inversion presence because eigenvectors are estimated essentially with a set of independent SNPs, so contiguity isn’t needed for detection. This is an important feature for non-model organisms studies, but the CAPÍTULO 2 - 51 ___________________________________________________________________________ more shattered the assembly is, the more difficult it will be to estimate inversion coordinate range. There is a significant correlation between linkage disequilibrium and absolute PCA eigenvectors, although PCA-based seems to be a more precise method (Cáceres & González, 2015). The observed linkage disequilibrium oscillation towards the inversion breakpoints is a known effect caused by the change in the recombination rate (Seich al Basatena et al, 2013). The genome synteny analysis revealed an interesting rearrangement between arms of chromosomes 2 and 3 for Ny. darlingi and An. albimanus when compared to An. gambiae genome. This is a known phenomenon evidenced by other comparative genomics studies focusing on Anophelinae, which have consistently reported analogous chromosomal rearrangements (Wei et al, 2017; Jiang et al, 2014). The ten chromosome inversions detected in Ny. darlingi genome did not show similar ranges or arrangements for known An. gambiae inversions in the figure 5. Although An. albimanus revealed some interesting equivalent syntenic regions for the chromosome inversions, An. albimanus typically lacks chromosome inversions (Artemov et al, 2016). Unfortunately, during the period of this study, there was no publicly accessible sequencing data for An. albimanus that would allow for a comprehensive analysis of inversion footprints using the methodology employed in this study. Soboleva et al (2024) also found two nested X chromosome inversions in Anopheles atroparvus and Anopheles messeae, although the ranges seem to be slightly different. Conclusion Chromosomal inversions are an important biological marker related to genome evolution, environment adaptation and speciation. Our study showed that lcWGS is particularly promising for large-scale genomic studies, offering a cost-effective alternative strategy to detect chromosome inversions without compromising accuracy. The sliding window analysis of PCA eigenvectors to detect inversions ensures accurate inversion detection and range estimation with a simple standalone approach. Ten chromosome inversions were found using lcWGS data from Ny. darlingi samples collected in one municipality located in the Brazilian Amazon basin. Also, in-silico simulations showed that the pipeline has enough sensitivity for accurate inversion detection. This could be an effective strategy for comprehending the population dynamic on a much larger scale. Availability of data and material CAPÍTULO 2 - 52 ___________________________________________________________________________ All documentation, including scripts and links necessary to replicate the study’s findings or to further extend the research, is available at: • https://github.com/marcusnizalvarez/PhD-Scripts • https://doi.org/10.5281/zenodo.13755768 . References Alvarez M.V.N. LCSeqTools v0.1.0-alpha. 2024. Zenodo. https://doi.org/10.5281/zenodo.11267121 Alvarez, M. V. N., Alonso, D. P., Kadri, S. M., Rufalco-Moutinho, P., Bernardes, I. A. F., de Mello, A. C. F., ... & Ribolla, P. E. (2022). Nyssorhynchus darlingi genome-wide studies related to microgeographic dispersion and blood-seeking behavior. Parasites & Vectors, 15(1), 106. Anopheles gambiae 1000 Genomes Consortium. (2017). Genetic diversity of the African malaria vector Anopheles gambiae. Nature, 552(7683), 96. Artemov, G. N., Peery, A. N., Jiang, X., Tu, Z., Stegniy, V. N., Sharakhova, M. V., & Sharakhov, I. V. (2017). The physical genome mapping of Anopheles albimanus corrected scaffold misassemblies and identified interarm rearrangements in genus Anopheles. G3: Genes, Genomes, Genetics, 7(1), 155-164. Browning, B. L., & Browning, S. R. (2016). Genotype imputation with millions of reference samples. The American Journal of Human Genetics, 98(1), 116-126. Cheng, C., Tan, J. C., Hahn, M. W., & Besansky, N. J. (2018). Systems genetic analysis of inversion polymorphisms in the malaria mosquito Anopheles gambiae. Proceedings of the National Academy of Sciences, 115(30), E7005-E7014. Cáceres, A., & González, J. R. (2015). Following the footprints of polymorphic inversions on SNP data: from detection to association tests. Nucleic acids research, 43(8), e53-e53. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC bioinformatics, 10, 1- 9. Cornel, A. J., Brisco, K. K., Tadei, W. P., Secundino, N. F., Rafael, M. S., Galardo, A. K., ... & Lanzaro, G. C. (2016). Anopheles darlingi polytene chromosomes: revised maps including newly described inversions and evidence for population structure in Manaus. https://doi.org/10.5281/zenodo.13755768 https://github.com/marcusnizalvarez/PhD-Scripts CAPÍTULO 2 - 53 ___________________________________________________________________________ Memórias do Instituto Oswaldo Cruz, 111, 335-346. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., & Leisch, M. F. (2006). The e1071 package. Misc Functions of Department of Statistics (e1071), TU Wien, 297-304. Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20, 1-14. Gorjanc, G., Dumasy, J. F., Gonen, S., Gaynor, R. C., Antolin, R., & Hickey, J. M. (2017). Potential of low‐coverage genotyping‐by‐sequencing and imputation for cost‐effective genomic selection in biparental segregating populations. Crop Science, 57(3), 1404-1420. Huang, W., Li, L., Myers, J. R., & Marth, G. T. (2012). ART: a next-generation sequencing read simulator. Bioinformatics, 28(4), 593-594. Jiang, X., Peery, A., Hall, A. B., Sharma, A., Chen, X. G., Waterhouse, R. M., ... & Tu, Z. (2014). Genome analysis of a major urban malaria vector mosquito, Anopheles stephensi. Genome biology, 15, 1-18. Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993. Li, Y., Sidore, C., Kang, H. M., Boehnke, M., & Abecasis, G. R. (2011). Low- coverage sequencing: implications for design of complex trait association studies. Genome research, 21(6), 940-951. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows– Wheeler transform. bioinformatics, 25(14), 1754-1760. MINISTRY OF HEALTH OF BRAZIL. Epidemiological bulletin SVS 35. 2020. Available at: https://antigo.saude.gov.br/images/pdf/2019/novembro/20/Boletim- epidemiologico-SVS-35.pdf. Nagaki, S. S., Chaves, L. S., López, R. V. M., Bergo, E. S., Laporta, G. Z., Conn, J. E., & Sallum, M. A. M. (2021). Host feeding patterns of Nyssorhynchus darlingi (Diptera: Culicidae) in the Brazilian Amazon. Acta tropica, 213, 105751. Nowling, R. J., Manke, K. R., & Emrich, S. J. (2020). Detecting inversions with PCA in the presence of population structure. PLoS One, 15(10), e0240429. Pasaniuc, B., Rohland, N., McLaren, P. J., Garimella, K., Zaitlen, N., Li, H., ... & Price, A. L. (2012). Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nature genetics, 44(6), 631-635. Pimenta, P. F., Orfano, A. S., Bahia, A. C., Duarte, A. P., Ríos-Velásquez, C. M., CAPÍTULO 2 - 54 ___________________________________________________________________________ Melo, F. F., ... & Secundino, N. F. (2015). An overview of malaria transmission from the perspective of Amazon Anopheles vectors. Memórias do Instituto Oswaldo Cruz, 110, 23- 47. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575. R Core Team. (2016). R development core team. R: A language and environment for statistical computing, 55, 275-286. Rafael, M. S., Bridi, L. C., Sharakhov, I. V., Marinotti, O., Sharakhova, M. V., Timoshevskiy, V., ... & Tadei, W. P. (2021). Physical mapping of the Anopheles (Nyssorhynchus) darlingi genomic scaffolds. Insects, 12(2), 164. Rustagi, N., Zhou, A., Watkins, W. S., Gedvilaite, E., Wang, S., Ramesh, N., ... & Xing, J. (2017). Extremely low-coverage whole genome sequencing in South Asians captures population genomics information. BMC genomics, 18, 1-12. Sayers, E. W., Bolton, E. E., Brister, J. R., Canese, K., Chan, J., Comeau, D. C., ... & Sherry, S. T. (2023). Database resources of the National Center for Biotechnology Information in 2023. Nucleic acids research, 51(D1), D29. Seich al Basatena, N. K., Hoggart, C. J., Coin, L. J., & O’Reilly, P. F. (2013). The effect of genomic inversions on estimation of population genetic parameters from SNP data. Genetics, 193(1), 243-253. Sims, D., Sudbery, I., Ilott, N. E., Heger, A., & Ponting, C. P. (2014). Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics, 15(2), 121-132. Soboleva, E. S., Kirilenko, K. M., Fedorova, V. S., Kokhanenko, A. A., Artemov, G. N., & Sharakhov, I. V. (2024). Two Nested Inversions in the X Chromosome Differentiate the Dominant Malaria Vectors in Europe, Anopheles atroparvus and Anopheles messeae. Insects, 15(5), 312. Tadei, W. P., & Santos, J. M. M. D. (1982). Biologia de anofelinós amazônicos. VII. Estudo da variação de frequências das inversões cromossômicas de Anopheles darlingi Root (Diptera, Culicidae). Acta Amazonica, 12, 759-785. Tadei, W. P., Thatcher, B. D., Santos, J. M. M., Scarpassa, V. M., Rodrigues, I. B., & Rafael, M. S. (1998). Ecologic observations on anopheline vectors of malaria in the Brazilian Amazon. Journal Of Tropical Medicine And Hygiene. 59(2), 325-335. Team, R. (2015). RStudio: integrated development for R. RStudio. Inc., Boston, MA, CAPÍTULO 2 - 55 ___________________________________________________________________________ 700, 879. Wang, Y., Tang, H., DeBarry, J. D., Tan, X., Li, J., Wang, X., ... & Paterson, A. H. (2012). MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic acids research, 40(7), e49-e49. Wei, Y., Cheng, B., Zhu, G., Shen, D., Liang, J., Wang, C., ... & Xia, A. (2017). Comparative physical genome mapping of malaria vectors Anopheles sinensis and Anopheles gambiae. Malaria Journal, 16, 1-10. Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D. A., François, R., ... & Yutani, H. (2019). Welcome to the Tidyverse. Journal of open source software, 4(43), 1686. WORLD HEALTH ORGANIZATION et al. World malaria report 2023. World Health Organization, 2023. Zheng, C., Boer, M. P., & van Eeuwijk, F. A. (2018). Accurate genotype imputation in multiparental populations from low-coverage sequence. Genetics, 210(1), 71-82. Figures and tables Figure 1. Chromosome Inversion Detection Analysis for Ny. darlingi dataset. PCA Weights Var: Sliding window of SNPs weights variance estimate. ρ: Spearman correlation coefficient for r² and mean absolute SNPs weights. *: Correlation test p value < 0.05. Legend stands for Chromosome: Principal Component. Horizontal colored segments represent the inversion coordinate estimates described in table 1. Horizontal scale in Mbp. CAPÍTULO 2 - 56 ___________________________________________________________________________ Table 1. Ny. darlingi chromosome inversion coordinate estimates based on Genome-wide association test. Chromosome Accession Code Component Start End Length 2 NC_064874.1 C1 71998919 84371386 12372467 2 NC_064874.1 C3 32875093 39167094 6292001 2 NC_064874.1 C4 19048046 22354438 3306392 2 NC_064874.1 C5 38207551 45614642 7407091 3 NC_064875.1 C1 5786821 22248013 16461192 3 NC_064875.1 C3 60944934 67169663 6224729 3 NC_064875.1 C4 47375561 54594956 7219395 3 NC_064875.1 C5 21879305 26791351 4912046 X NC_064873.1 C1 169219 12292796 12123577 X NC_064873.1 C2 10405329 12856594 2451265 Component: Principal component from PCA analysis. CAPÍTULO 2 - 57 ___________________________________________________________________________ Figure 2. Inversion Genotyping Analysis. Top biplots display in two dimensions (MDS with k=2) the relative distances between samples based on genetic similarity from the top 1% most representative SNPs for each respective inversion. ²:𝜒 Chi-square test between genotypes from two respective inversions. : Spearman correlation coefficient for genotypes⍴ states of both inversions. Bottom line plots represent in one dimension the density of samples grouping for each genetic cluster for each inversion. MAF: Minor inversion frequency. F: Inversion fixation index. CAPÍTULO 2 - 58 ___________________________________________________________________________ Figure 3. Chromosome Inversion Detection Analysis for An. gambiae dataset. PCA Weights Var: Sliding window of SNPs weights variance estimate. ρ: Spearman correlation coefficient for r² and mean absolute SNPs weights. *: Correlation test p value < 0.05. Legend stands for Chromosome: Principal Component. Horizontal scale in Mbp. CAPÍTULO 2 - 59 ___________________________________________________________________________ Figure 4. Genome Synteny Analysis for Ny. darlingi with An. gambiae and An. albimanus. Black arrow inside each segment represents the original reference orientation. CAPÍTULO 2 - 60 ___________________________________________________________________________ Figure 5. Chromosome Inversion Synteny Analysis. Green color: Ny. darlingi. Blue color: An. gambiae. Red color: An. albimanus. Upper left label indicates the chromosome and the i-th principal component in which chromosome inversion was detected for Ny. darlingi. Known An. gambiae inversions for chromosome 2 are represented with the black line segments outside the circles. Black arrow inside each segment represents the original reference orientation. Only chromosomes with at least one link were retained in each plot. CONSIDERAÇÕES FINAIS - 61 ___________________________________________________________________________ 3. CONSIDERAÇÕES FINAIS O sequenciamento genômico completo de baixa cobertura aliado ao tratamento de dados adequado e imputação de genótipos, como é realizado pelo programa LCSeqTools, produz resultados confiáveis e satisfatórios para projetos em larga escala. Ganhos significativ