SÃO PAULO STATE UNIVERSITY– UNESP Institute of Biosciences, Humanities and Exact Sciences - São José do Rio Preto Campus DANIEL SIQUEIRA DE OLIVEIRA CONTRIBUTION OF TRANSPOSABLE ELEMENTS TO THE HOST SHIFT OF CACTOPHILIC Drosophila SPECIES São José do Rio Preto 2024 DANIEL SIQUEIRA DE OLIVEIRA CONTRIBUTION OF TRANSPOSABLE ELEMENTS TO THE HOST SHIFT OF CACTOPHILIC Drosophila SPECIES Thesis presented under Joint PhD Program between São Paulo State University (UNESP), Institute of Biosciences, Humanities and Exact Sciences of São José do Rio Preto Campus, and Université Claude Bernard Lyon 1 (UCBL), for graduating as PhD in Biosciences (UNESP) and Biomath, Bioinfo, and Evolutionary Biology (UCBL) Concentration Area: Genetics and Evolutionary Biology Supervisor: Prof. Dr Claudia Marcia Aparecida Carareto Supervisor: Prof. Dr Cristina Vieira São José Do Rio Preto – Brazil Lyon - France 2024 O48c Oliveira, Daniel Siqueira de Contribution of transposable elements to the host shift of cactophilic Drosophila species/ Daniel Siqueira de Oliveira. -- São José do Rio Preto, 2024 139 p. : il. Tese (doutorado com dupla titulação) - Universidade Estadual Paulista (Unesp), Instituto de Biociências Letras e Ciências Exatas, São José do Rio Preto e Université de Lyon. Supervisora: Claudia Marcia Aparecida Supervisora: Cristina Vieira 1. Elementos de Transposição. 2. Mudança de hospedeiro. 3. Adaptação 4. Transcritos quiméricos I. Título. Sistema de geração automática de fichas catalográficas da Unesp. Biblioteca do Instituto de Biociências Letras e Ciências Exatas, São José do Rio Preto. Dados fornecidos pelo autor(a). Essa ficha não pode ser modificada. DANIEL SIQUEIRA DE OLIVEIRA CONTRIBUTION OF TRANSPOSABLE ELEMENTS TO THE HOST SHIFT OF CACTOPHILIC Drosophila SPECIES Thesis presented under Joint PhD Program between São Paulo State University (UNESP), Institute of Biosciences, Humanities and Exact Sciences of São José do Rio Preto Campus, and Université Claude Bernard Lyon 1 (UCBL), for graduating as PhD in Biosciences (UNESP) and Biomath, Bioinfo, and Evolutionary Biology (UCBL). Concentration Area: Genetics and Evolutionary Biology Defense date: 13/12/2024 Jury: ______________________________________ Prof. Dr Claudia Marcia Aparecida Carareto UNESP – Institute of Biosciences, Humanities and Exact Sciences - São José do Rio Preto ______________________________________ Prof. Dr Cristina VIEIRA Université Claude Bernard - Lyon 1 ______________________________________ Prof. Dr Josefa GONZÁLEZ Pérez Institut Botànic de Barcelona, CSIC, CMCNB ______________________________________ Prof. Dr Élgion Lúcio da Silva LORETO Santa Maria federal University ______________________________________ Prof. Dr Lilian MADI-RAVAZZI UNESP – Institute of Biosciences, Humanities and Exact Sciences - São José do Rio Preto ______________________________________ Prof. Dr Emmanuel DESOUHANT Université Claude Bernard - Lyon 1 ______________________________________ Prof. Dra Patrícia BELDADE University of Lisbon ______________________________________ Prof. Dr. Gabriel da Luz WALLAU Oswaldo Cruz Foundation THÈSE de DOCTORAT DE L’UNIVERSITÉ DE LYON opérée par l’Université Claude Bernard – Lyon 1 École doctorale 341: Évolution Écosystèmes, Microbiologie, et Modélisation (E2M2) Spécialité de doctorat: Biomath – Bioinfo – Génomique Évolutive À soutenir publiquement à São José do Rio Preto, São Paulo, Brésil le 13/12/2024, par: Daniel Siqueira de Oliveira Contribution des éléments transposables au changement d’hôte des espèces de Drosophila cactophiles Devant le jury compose de: VIEIRA, Cristina, Professeure à l’Université Claude Bernard - Lyon 1 (France): Directrice de thèse CARARETO, Claudia Marcia Aparecida, Professeure à Universidade estadual Paulista (Brésil): Directrice de thèse GONZÁLEZ, Josefa Pérez, Chercheure à Institut Botànic de Barcelona, CSIC, CMCNB. (Espagne): Examinatrice LORETO, Élgion Lúcio da Silva, Professeur à Universidade federal de Santa Maria (Brésil): Examinateur MADI-RAVAZZI, Lilian, Professeure à Universidade estadual Paulista (Brésil): Presidente du jury/Examinatrice DESOUHANT, Emmanuel, Professeur à l’Université Claude Bernard - Lyon 1 (France): Examinateur BELDADE Patrícia, Professeure à Universidade de Lisboa (Portugal): Rapporteure WALLAU Gabriel da Luz - Fundação Oswaldo Cruz (Brésil): Rapporteur AGRADECIMENTOS Agradeço profundamente à minha mãe, Marinês Siqueira de Oliveira, e ao meu pai, João Alfredo Maciel de Oliveira, que sempre acreditaram no meu potencial e nunca mediram esforços para me apoiar na busca pelos meus objetivos. A integridade que testemunhei em cada um de vocês e a educação que me proporcionaram foram fundamentais para a realização desta tese. Sem o exemplo de perseverança, ética e dedicação que recebi em casa, esta trajetória não teria sido possível. Agradeço também aos meus irmãos, Marielle Oliveira e Émerson Oliveira, que sempre estiveram ao meu lado me apoiando em cada etapa do meu desenvolvimento acadêmico. À família, sou imensamente grato por tudo o que me ensinaram e por terem sido meus grandes incentivadores ao longo deste caminho. À minha companheira de todos os momentos, Camila Militz Pereira, minha profunda gratidão por sempre estar ao meu lado, me apoiando incondicionalmente sem hesitar em acreditar em mim. Obrigado pelo carinho, amor e incentivo constantes. Mesmo nos momentos mais desafiadores, obrigado por ser minha força e o meu refúgio. O seu suporte incessante foi imprescindível para nesta jornada. Não há palavras que possam descrever precisamente minha gratidão pelas minhas excelentíssimas orientadoras, professoras Cristina Vieira e Claudia Carareto. Vocês me ensinaram o verdadeiro significado de excelência profissional. Agradeço profundamente pela dedicação do seu tempo e paciência, sempre balanceando uma orientação cuidadosa com estímulo à minha autonomia. A nítida satisfação que ambas possuem em fazer ciência foi o combustível para que eu pudesse desenvolver este trabalho. Agradeço imensamente por todas as oportunidades, ensinamentos, incentivo e confiança desde o começo do desenvolvimento desta tese. Vocês sempre serão uma grande inspiração para mim. Agradeço também aos tantos amigos que tive o prazer de conhecer no ambiente acadêmico. Obrigado Maryanna Simão por ter facilmente me convencido a trabalhar com o modelo de D. mojavensis. Agradeço a amizade, apoio e parceria com os colegas da UNESP, William Nunes, Bianca Manfré e Ana Pires, foi enriquecedor dividir o lab com vocês. Aos amigos da UCBL, Miriam Merenciano, Camille Mayeux e Marius Poulain, obrigado por tornarem minha vida em Lyon muito mais leve. O “best office” do edifício Mendel sempre será lembrado com muito carinho. Aos demais amigos e colegas de Lyon, Anaïs Larue, Chloe Garambois, Tomás Carrasco e Alicia Ramirez, muito obrigado pelo convívio e gestos de apoio e encorajamento. Vocês foram um pilar essencial na minha formação, trazendo leveza e alegria ao longo do caminho. Aos PIs que conheci em Lyon e tive o prazer de dividir o ambiente de trabalho, Rita Rebollo, Marie Fablet e Matthieu Boulesteix, obrigado pela maneira excepcional que vocês possuem em compartilhar o seu valioso conhecimento científico. Além disso, é notável e exemplar como vocês trabalham sinergicamente com estudantes. Aprendi muito sobre soft skills de maneira indireta, apenas observando-os, um aprendizado que certamente levarei para o meu futuro profissional. O presente trabalho foi realizado com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Código de Financiamento 001. Agradeço à FAPESP pelo apoio financeiro, concedido por meio do Processo nº 2020/06238-2, Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP); ao CNPq, sob o processo 308020/2021- 9; à CAPES sob processo 88887.716810/2022-00, e à Agence Nationale de la Recherce (ANR) sob o processo ANR-14-CE19-0016-01. Agradeço a CAPES e ao Programa Eiffel (Campus France) pela concessão de bolsas de estudo para internacionalização, que foram de grande valia para a realização deste trabalho. ABSTRACT Transposable elements (TEs) are dynamic components of genomes that drive structural variation and serve as sources of genetic diversity, contributing significantly to evolutionary processes. During transcription, TEs can interact with genes incorporating their sequence into gene’s mRNA through transcription initiation/termination and alternative splicing, named as gene-TE chimeric transcripts. In this thesis, we developed ChimeraTE, a novel bioinformatics pipeline designed to detect gene-TE chimeras, using paired-end RNA-seq data through two complementary modes: a reference-guided mode for canonical genome alignment and a reference-free mode that identifies chimeras without the need for a reference genome. Using Drosophila melanogaster strains, we validated ChimeraTE and discovered that approximately 1.12% of genes generate chimeric transcripts, with around 23% of these TEs absent from the reference genome, suggesting recent or polymorphic insertions contributing to transcriptome variability. We used this approach to investigate the role of TEs in the adaptive evolution and host-shifting of cactophilic Drosophila species, focusing on their interactions with genes associated with host location, acceptance, and usage. Since host shift is a crucial evolutionary process that can lead to reproductive isolation and speciation, we conducted genomic and transcriptomic analyses across seven cactophilic Drosophila genomes to explore how TEs contribute to host-shift-related gene regulation and expression. We identified an enrichment of TEs at promoter regions of host shift-associated genes, with Helitrons representing about 60% of these cases, indicating a likely cis-regulatory role of TEs in influencing gene expression. Differential expression analyses in species with different host preferences revealed notable divergence in gene expression profiles in larvae and head tissues, which are critical to host recognition and adaptation. Although the overall impact of TEs on gene expression levels appears minimal, approximately 1.31% of genes, including those linked to host preference, generate chimeric transcripts, pointing out a direct influence of TEs on transcript diversity. The findings provide compelling evidence that TEs play a role in the adaptive evolution of host shift, fostering genetic and transcriptomic changes that may facilitate ecological specialization and reproductive isolation. This thesis underscores the evolutionary significance of TEs in promoting genetic novelty and highlights their potential to drive adaptive divergence in natural populations of cactophilic Drosophila. Key-words: Transposable elements; Host shift; Adaptation; Chimeric transcripts RESUMO Elementos de transposição (TEs) são componentes dinâmicos dos genomas que geram variantes estruturais e servem como fontes de diversidade genética, contribuindo significativamente para processos evolutivos. Durante a transcrição, os TEs podem interagir com genes, incorporando sua sequência no mRNA do gene por meio de iniciação/terminação da transcrição e splicing alternativo, formando os chamados transcritos quiméricos gene-TE. Nesta tese, desenvolvemos o ChimeraTE, uma nova pipeline de bioinformática projetada para detectar quimeras gene-TE, utilizando dados de RNA-seq paired-end por meio de dois modos complementares: um modo guiado por referência para alinhamento genômico canônico e um modo independente de referência, que identifica quimeras sem necessidade de um genoma de referência. Utilizando linhagens de Drosophila melanogaster, validamos o ChimeraTE e descobrimos que aproximadamente 1,12% dos genes geram transcritos quiméricos, com cerca de 23% destes TEs ausentes do genoma de referência, sugerindo que inserções recentes ou polimórficas que contribuem para a variabilidade do transcriptoma. Utilizamos essa abordagem para investigar o papel dos TEs na evolução adaptativa e na troca de hospedeiro de espécies cactofílicas de Drosophila, com foco em suas interações com genes associados à localização, aceitação e uso de hospedeiros. Considerando que a mudança de hospedeiro é um processo evolutivo que pode levar ao isolamento reprodutivo e à especiação, conduzimos análises genômicas e transcriptômicas em sete genomas de espécies de Drosophila cactofílicas para explorar como os TEs contribuem para a regulação e expressão de genes relacionados à mudança de hospedeiro. Identificamos um enriquecimento de TEs em regiões promotoras de genes associados à mudança de hospedeiro, sendo que os Helitrons representam cerca de 60% desses casos, indicando um provável papel cis-regulador dos TEs na influência sobre a expressão gênica. Análises de expressão diferencial entre espécies com diferentes preferências de hospedeiro revelaram uma notável divergência nos perfis de expressão gênica em tecidos de larvas e cabeça, que são essenciais para o reconhecimento e adaptação ao hospedeiro. Embora o impacto geral dos TEs nos níveis de expressão gênica pareça ser mínimo, aproximadamente 1,31% dos genes, incluindo aqueles ligados à preferência de hospedeiro, geram transcritos quiméricos, apontando para uma influência direta dos TEs na diversidade de transcritos. Os resultados fornecem evidências convincentes de que os TEs desempenham um papel na evolução adaptativa da troca de hospedeiro, promovendo mudanças genéticas e transcriptômicas que podem facilitar a especialização ecológica e o isolamento reprodutivo. Esta tese destaca a importância evolutiva dos TEs na promoção de novidades genéticas e enfatiza seu potencial em impulsionar a divergência adaptativa em populações naturais de Drosophila cactofílicas. Palavras-chave: Elementos de Transposição; Mudança de hospedeiro; Adaptação; Transcritos quiméricos. RÉSUMÉ Les éléments transposables (TEs) sont des composants dynamiques des génomes qui génèrent des variantes structurales et servent de sources de diversité génétique, contribuant de manière significative aux processus évolutifs. Lors de la transcription, les TEs peuvent interagir avec des gènes en intégrant leur séquence dans l’ARNm du gène par des mécanismes d’initiation/terminaison de la transcription et d’épissage alternatif, formant ainsi des transcrits chimériques gene-TE. Dans cette thèse, nous avons développé ChimeraTE, une nouvelle pipeline bioinformatique conçue pour détecter les chimères gene-TE, utilisant des données de RNA-seq paired-end à travers deux modes complémentaires : un mode guidé par référence pour l’alignement génomique canonique et un mode indépendant de référence, qui identifie les chimères sans besoin de génome de référence. En utilisant des lignées de Drosophila melanogaster, nous avons validé ChimeraTE et découvert qu’environ 1,12% des gènes produisent des transcrits chimériques, avec environ 23% de ces TEs absents du génome de référence, suggérant des insertions récentes ou polymorphiques contribuant à la variabilité du transcriptome. Nous avons utilisé cette approche pour étudier le rôle des TEs dans l’évolution adaptative et la transition d’hôte des espèces de Drosophila cactophiles, en nous concentrant sur leurs interactions avec les gènes associés à la localisation, l’acceptation et l’utilisation de l’hôte. Étant donné que le changement d’hôte est un processus évolutif crucial pouvant conduire à l’isolement reproductif et à la spéciation, nous avons mené des analyses génomiques et transcriptomiques sur sept génomes d’espèces de Drosophila cactophiles pour explorer comment les TEs contribuent à la régulation et à l’expression des gènes liés au changement d’hôte. Nous avons identifié un enrichissement de TEs dans les régions promotrices des gènes associés au changement d’hôte, les Helitrons représentant environ 60% de ces cas, ce qui indique un rôle cis-régulateur probable des TEs dans l’influence sur l’expression génique. Des analyses d’expression différentielle entre des espèces avec différentes préférences d’hôte ont révélé une divergence notable dans les profils d’expression génique dans les tissus de larves et de tête, essentiels à la reconnaissance et à l’adaptation à l’hôte. Bien que l’impact global des TEs sur les niveaux d’expression génique semble minime, environ 1,31% des gènes, y compris ceux liés à la préférence d’hôte, génèrent des transcrits chimériques, soulignant une influence directe des TEs sur la diversité des transcrits. Les résultats fournissent des preuves convaincantes que les TEs jouent un rôle dans l’évolution adaptative liée au changement d’hôte, en favorisant des modifications génétiques et transcriptomiques qui peuvent faciliter la spécialisation écologique et l’isolement reproductif. Cette thèse souligne l’importance évolutive des TEs dans la promotion de nouvelles variations génétiques et met en avant leur potentiel à stimuler la divergence adaptative dans les populations naturelles de Drosophila cactophiles. Mots clés: Eléments transposable; Changement de hôte ; Adaptation ; Transcripts chimériques. TABLE OF CONTENTS 1 INTRODUCTION .......................... 15 2 OBJECTIVES ........................ 29 3 CHIMERATE: A PIPELINE TO DETECT CHIMERIC TRANSCRIPTS DERIVED FROM GENES AND TRANSPOSABLE ELEMENTS ........................ 30 4 TRANSPOSABLE ELEMENTS AS EVOLUTIONARY DRIVING FORCE TO HOST SHIFT IN CACTOPHILIC DROSOPHILA SPECIES ........................ 60 5 DISCUSSION ...................... 107 6 CONCLUSIONS ....................... 112 REFERENCES ........................ 113 APPENDICES ...................... 128 15 INTRODUCTION Transposable elements as engines of molecular evolution Transposable elements (TEs) are short DNA sequences capable of moving from one genomic site to another. They have been identified in virtually all eukaryotic genomes studied to date. In insects, their abundance ranges from 0.12% in the Antarctic midge B. antartica (KELLEY et al., 2014) to 72% in the weevil Sitophilus oryzae (PARISOT et al., 2021). TEs are classified into two major classes based on their transposition mechanism, and further divided into respective families and subfamilies based on their structure and sequence homology (WICKER et al., 2007). Class I elements, or retrotransposons, transpose via an RNA intermediate and the reverse transcriptase enzyme through a process known as “copy and paste”. These include long terminal repeat (LTR)/endogenous retrovirus (ERV) elements, non- LTR retrotransposons such as long interspersed nuclear elements (LINEs), and non- autonomous elements like short interspersed nuclear elements (SINEs). Class II elements, or DNA transposons, move by a “cut and paste” mechanism facilitated by the transposase enzyme. TEs were first discovered by Barbara McClintock in the 1950s (MCCLINTOCK, 1951), and for many years, they were regarded by the scientific community as “junk DNA” (OHNO, 1972). Their ability to move across the genome led to their classification as “genomic parasitic elements”. The mutations caused by transposition can have severe impacts on host fitness, particularly if they disrupt genes. The insertions that remain in the genome are likely to be neutral, mildly deleterious and rarely become adaptive (MONTGOMERY; CHARLESWORTH; LANGLEY, 1987). Since mobilization is a mutagenic process producing structural variants, several host mechanisms have evolved to prevent it. These mechanisms include chromatin remodeling factors, DNA methylation, and small RNAs (SLOTKIN; MARTIENSSEN, 2007). For example, in Drosophila melanogaster germline, TEs are targeted by piwi-interacting RNAs (piRNAs), which not only mediate the cleavage of TE transcripts but also promote the deposition of repressive chromatin marks at the TE insertion sites, effectively preventing further transcription (FABRY et al., 2021). In Arabidopsis thaliana, most TEs are silenced through cytosine methylation, preventing their activation and spread. However, environmental or genomic stress can sometimes reactivate TEs, creating bursts of transposition that introduce genetic variability (ROQUIS et al., 2021). Over the past two decades, advances in biotechnology, genomic analyses and the extensive sequencing of genomes have accumulated evidence about the role of TEs in genome evolution. Functional genomic studies have demonstrated that TEs not only influence gene 16 transcription but are also essential for maintaining chromatin structure due to their ability to recruit chromatin proteins (HORVÁTH; MERENCIANO; GONZÁLEZ, 2017; REBOLLO; ROMANISH; MAGER, 2012). Consequently, TEs have been implicated in several key evolutionary processes, such as the structural evolution of genomes through chromosomal rearrangements (HUGHES; COFFIN, 2001; KAZAZIAN, 2004); co-option of regulatory sequences that control nearby gene expression (JORDAN et al., 2003); integration into coding sequences at the transcript level (GOTEA; MAKALOWSKI, 2006; LIPATOV et al., 2005; PIRIYAPONGSA et al., 2007; WANG et al., 2007); alterations in the transcriptome driven by environmental changes (HORVÁTH; MERENCIANO; GONZÁLEZ, 2017; MOSCHETTI et al., 2020; RECH et al., 2019); and the recruitment of TE-derived genes by host genomes for essential functions (VOLFF, 2006). The recurrent interaction between genes and transposable elements The specific location of a TE insertion in the genome often determines its evolutionary fate in a population. In response, many TEs have preferential insertional sites to integrate themselves into genomic regions where their propagation is more likely to succeed. It is well- known that TEs are not randomly distributed in the genome, being often found at gene-poor regions such as heterochromatin, pericentromeric, and telomeric regions (BOURQUE et al., 2018). These regions also have low recombination rates, which minimizes the harmful effects that TEs can have by reducing the likelihood of disrupting essential genes (BOURGEOIS; BOISSINOT, 2019). In the arms race against host control, some TEs also show a tendency to insert themselves in euchromatic regions. For instance, the retrotransposons in mold and yeast have evolved to target regions upstream of RNA polymerase III-transcribed genes (CHEUNG; MANHAS; MEASDAY, 2018). This strategy allows them to be transcribed independently without disrupting host gene expression, ensuring their own propagation while minimizing damage to the host genome (SPALLER et al., 2017). The insertion site nearby genes can also persist without a propagation-related cause. In the clonal raider ant, Ooceraea biroi, TEs are enriched nearby odorant receptor (OR) genes, possibly facilitating gene duplications via unequal crossing over (MCKENZIE; KRONAUER, 2018). This expansion of OR duplicates in ants is associated to pheromone detection, and it may be crucial for the evolution of complex social behaviors. Silenced TEs tend to accumulate mutations over time, losing their mobility but potentially keeping functional protein domains/motifs (MILLER et al., 2000). Due to such potential, a TE copy inserted nearby a host gene may cause alterations on gene expression or 17 alternative splicing (CHUNG et al., 2007; HIRSCH; SPRINGER, 2017; REBOLLO; FARIVAR; MAGER, 2012). The evolutionary outcomes derived from successful gene-TE interactions are diverse and well documented in many taxa (SINZELLE; IZSVÁK; IVICS, 2009; VOLFF, 2006). Long-term stable and adaptive interactions are often termed as domestication or exaptation when become essential for the cell fate or functioning (CAPY, 2021). For instance, in the vertebrate immune system, both genes Rag1 and Rag2 have conserved TE-derived protein domains that are crucial to the somatic recombination process known as V(D)J (AGRAWAL; EASTMAN; SCHATZ, 1998). They work together as a transposase, capable of excising a DNA segment containing recombination signals from a donor site and inserting it into a target DNA molecule. In D. melanogaster, the telomeric LINE elements TART, Heta-A, and Tahre remain actively transposing specifically into telomeres, being essential to maintain telomere’s length and protect against replication-induced erosion (DANILEVSKAYA et al., 1994; GEORGE et al., 2006; PARDUE; DEBARYSHE, 2000). Another example is the Syncytin proteins which are originated from the envelope (env) genes of ERVs and have been co-opted for placental development in mammals (LAVIALLE et al., 2013; MI et al., 2000), being essential for nutrient exchange between the mother and the developing embryo. Autonomous TEs encode genes that enable their mobilization, relying on the host cell’s machinery to recognize self cis-regulatory sequences that effectively mimic host promoters (reviewed on Chuong, Elde, and Feschotte 2017). Many studies have demonstrated spread TE- derived promoters and enhancers (CAO et al., 2019; CHUONG et al., 2013; YE et al., 2020), transcription factor binding sites (TFBSs) (BOURQUE et al., 2008; THORNBURG; GOTEA; MAKAŁOWSKI, 2006), and insulators (WANG et al., 2015). A remarkable example is the insecticide resistance in D. melanogaster, which is caused by a co-option of alternative TFBSs from an Accord TE inserted upstream to the Cyp6g1 gene, causing its overexpression and higher efficiency to metabolize insecticides (DABORN et al., 2002). In Caenorhabditis sp., heat shock response is regulated by a specific transcription factor that binds onto heat shock elements, activating the expression of important genes for cellular protection upon stress (VIHERVAARA; DUARTE; LIS, 2018). Notably, a significant number of these elements in several Caenorhabditis species are located within Helitrons, contributing to the diversification and rewiring heat shock response in Caenorhabditis upon stress (GARRIGUES et al., 2019). 18 Gene-TE chimeric transcripts as evidence of TE-related novelties The interactions between genes and TEs can generate mRNA isoforms with both exons and TE-derived sequences, named gene-TE chimeric transcripts (CORONADO-ZAMORA; GONZÁLEZ, 2023; LIPATOV et al., 2005; TREIBER; WADDELL, 2020). Chimeras are produced through molecular mechanisms during transcription or mRNA alternative splicing (reviewed on Ayarpadikannan et al. 2015), and classified into three categories accordingly: 1) TE-initiated transcript: the TE is co-opted as an alternative promoter, providing a novel transcription start site to the gene (LAMPRECHT et al., 2010; MCGINNIS; SHERMOEN; BECKENDORF, 1983); 2) TE-exonized transcript: the TE is located in the intron and retained in the mature mRNA through splicing (ALMEIDA et al., 2008; SELA et al., 2010; SOREK, 2007); 3) TE-terminated transcript: the TE provides an alternative polyadenylation site during RNA maturation (DABORN et al., 2002; FARRÉ; ENGEL; ANGULO, 2016). TE-initiated and TE-terminated transcripts are more likely to be involved in changes of gene expression, whereas the TE-exonized transcripts might alter the protein sequence, having a direct effect on the protein function. Gene-TE chimeras play significant roles in driving evolutionary change by contributing with genetic novelties in populations, potentially leading to adaptive functions in organisms across diverse taxa. In D. melanogaster, the gene CHKov1 produces a TE-exonized transcript with a TE insertion from the Doc family that has high frequency in South American populations (MAGWIRE et al., 2011). This chimera represents a truncated isoform compared to its ancestral sequence, changing the protein function and resulting in resistance to organophosphates and viral infection (MAGWIRE et al., 2011). Another relevant example is the case of the industrial melanism in British peppered moths (Biston betularia), where the black (carbonaria) form became dominant during the Industrial Revolution due to a mutation caused by a TE copy inserted in the first intron of the cortex gene. The TE is exonized in the carbonaria form, inducing higher expression of cortex, which in turn causes the adaptive dark melanization of the moth (HOF et al., 2016). In the oil palm Elaeis guineensis, the hypomethylation state of a LINE copy in the intron of the gene Deficiens generates a TE-exonized transcript with premature termination, generating poorly developed fruits with mantled phenotype (ONG- ABDULLAH et al., 2015). In cancer, TEs become active due to a state of global hypomethylation (EHRLICH, 2009). This activation can lead to the formation of new chimeric transcripts with harmful effects, a process known as onco-exaptation (BABAIAN; MAGER, 2016). A well- 19 characterized example of this phenomenon occurs in Hodgkin Lymphoma, where the oncogene CSF1R has an increase on its transcriptional level caused by an alternative promoter from an ERV copy, acting as a trigger to the development of malignant cells (BABAIAN et al., 2016). In large B-cell lymphoma, the FABP7 gene utilizes an endogenous retrovirus LTR as a promoter, leading to the production of a novel protein that contributes to abnormal cell proliferation (LOCK et al., 2014). Another study has documented over 4,800 chimeric transcripts distributed among 723 cancer-associated genes, mostly from Alu and L1 elements (CLAYTON et al., 2020). More recently, a comprehensive study has identified 1,068 antigens with TE-derived sequences from cancer samples, named ad TE-chimeric antigens (SHAH et al., 2023). Importantly, these specific chimeras produced functional proteins present on the surface of cancer cells, with appeal for therapeutical exploitation (FUDGE, 2023). Transcriptome-wide studies revealing gene-TE chimeric transcripts are scarce in insect species, mainly for the non-model ones. Even in the model D. melanogaster, comprehensive analysis with tissue specific samples were assessed recently (CORONADO-ZAMORA; GONZÁLEZ, 2023; OLIVEIRA et al., 2023; TREIBER; WADDELL, 2020). The authors found similar quantitative results, showing that ~1.5% of the genes producing chimeric transcripts, which can be considered low compared with up to 14% in human and mice (FAULKNER et al., 2009). Importantly, the overall chimeric isoform contribution to gene expression in D. melanogaster was ~27% in head, gut and ovary, although 314 genes are expressed solely with the chimeric isoform and common in wild-type strains (CORONADO- ZAMORA; GONZÁLEZ, 2023). This result demonstrates the extension of TE co-option in D. melanogaster, and the possibility to identify them through the detection of chimeras. Additionally, it has been shown that over 80 TE insertions are adaptive across D. melanogaster global populations (RECH et al., 2019), impacting the expression of genes associated with ecologically significant traits. Considering the extensive presence of TEs in virtually all eukaryotes, it is plausible that similar adaptive patterns could be observed in other species. Since alternative splicing has a stronger impact on phenotypic changes than gene expression modifications (HEINZEN et al., 2008; YU et al., 2008), the extent to which gene-TE chimeras contribute to environmental adaptability remains an open question. This influence likely affects the evolutionary fitness of insect populations. The bioinformatic techniques to identify gene-TE chimeric transcripts has evolved following the advances on sequencing technologies. The first studies demonstrating the presence of TE-derived sequence in mRNA were conducted in human, based on expressed sequence tags (EST) evidence (SOREK; AST; GRAUR, 2002). In D. melanogaster, ESTs were 20 also applied to chimeras’ identification, demonstrating that only ~1% of genes produce chimeric transcripts in this species (LIPATOV et al., 2005). Importantly, these findings based on ESTs were limited to identify TE-derived sequences at the 5’ UTR of genes, so they produced an underestimation of the transcriptome-wide context. Other studies were conducted using Cap Analysis Gene Expression (CAGE) technique, showing a significant proportion of TE-initiated transcripts, ranging from 3% to 14% in human and mice, depending on the tissue (FAULKNER et al., 2009). In the last years, with the popularization of Illumina RNA-seq, a few bionformatic tools to detect gene-TE chimeric transcripts have been published. The pipeline CLIFinder is able to identify gene-TE chimeras derived specifically from long interspersed nuclear elements (LINEs) restricted to the human genome (PINSON et al., 2018). LIONS is not specific to any taxa, but it is designed to identify only TE-initiated transcripts (BABAIAN et al., 2019). FREDDIE is an algorithm able to identify exonization events involving intragenic TEs in mammalian genomes, through transcriptome assembly, quantification, and assessing the coding potential of chimeras (MERCURI et al., 2024). Importantly, all the three methods rely on the reference genome, therefore they only detect chimeras from TE copies present in the reference, losing the relevant information regarding novel and/or polymorphic TE insertions. TEchim was the only pipeline capable of identifying chimeras from polymorphic TEs (TREIBER; WADDELL, 2020), but it was not designed to run automatically with other species than D. melanogaster. Although the proven relevance to study gene-TE chimeric transcripts in the evolutionary biology context, their study is constrained by the pervasive repetitive nature of TE copies and the limited availability of methods for their identification. In this thesis, we developed ChimeraTE, a novel pipeline to identify chimeras derived from TEs regardless their presence on the reference genome, from any species, using Illumina paired-end reads (OLIVEIRA et al., 2023). Host shift as an evolutionary driving force The idea that adaptation to different environments can drive speciation has been proposed since Darwin’s On the Origin of Species (1859). Benjamin Dann Walsh, a 19th- century’ entomologist, explored such idea hypothesizing that speciation in phytophagous insects could be driven by shifting to new host plants (WALSH, 1864). Walsh’s arguments to the process can be summarized in three key points regarding phytophagous insects. Firstly, most of them have preference to feed on or deposit their eggs in one or a few specific host plants. Secondly, if reared on different host plants, they frequently display variations in observable traits, which he referred to as “phytophagic varieties” (WALSH, 1864). Finally, he 21 hypothesized that if a “phytophagic variety” remains on the same host plant for multiple generations, it might become so well-adapted to that plant that it could no longer feed on an alternative host, or possibly even mate with individuals that utilize the other plant. In the last decades, many studies have been conducted with phytophagous insects to understand the relationship between host shift (HS) and reduced gene flow between populations (DRÈS; MALLET, 2002; HEARD, 2012; HERNÁNDEZ-ROLDÁN et al., 2016). The conclusions from many species indicate that reproductive isolation emerges due to a series of adaptations driven by HS, which is the result of divergent selection between distinct ecological environments (FORBES et al., 2017; HOWARD; BERLOCHER, 1998; RUNDLE; NOSIL, 2005; WHITEMAN; PIERCE, 2008). Consequently, natural selection increases the prevalence of alleles that provide fitness advantages during HS. Thus, the molecular effects of HS evolution are strongly associated with ecological divergence, specialization, and, ultimately, ecological speciation (WHITEMAN; PIERCE, 2008). The latter is defined as the speciation process by which reduced gene flow occurs as a consequence of ecologically-based divergent selection (MATSUBAYASHI; OHSHIMA; NOSIL, 2010; SCHLUTER, 2009). Although ecological speciation is a well-established theory, its extent on nature compared to other speciation mechanisms remains under study (HENDRY, 2009). In non- ecological speciation, random chance plays a crucial role, as seen in processes like speciation through genetic drift, founder events, and population bottlenecks (COYNE; ORR, 2004). These events do not necessarily involve environmental differences driving the divergence. Additionally, non-ecological speciation includes models where natural selection is involved, but this selection is either not tied to environmental differences or not divergent across habitats. Examples of these include speciation through sexual selection (LANDE, 1980), where mate choice drives divergence, and sexual conflict (CHAPMAN et al., 2003), where differences between male and female reproductive strategies lead to speciation. Another scenario involves what is known as “mutation-order speciation” (SCHLUTER, 2009), in which populations under similar environmental pressures accumulate different, incompatible genetic changes, resulting in speciation. An important consideration is that ecological speciation can occur in any geographical arrangement, whether the populations are separated (allopatry), partially overlapping (parapatry), or fully overlapping (sympatry), as long as ecologically-based divergent selection is the driving force behind their evolutionary divergence (COYNE; ORR, 2004; RUNDLE; NOSIL, 2005). 22 Cactophilic Drosophila species: A model for host shift studies The Drosophila genus has been an important model organism for evolutionary studies regarding the evolution of HS (MARKOW, 2019). The diversification of Drosophila species and wide-range of plant hosts provides an ideal opportunity to understand the process of host shifting, and its potential in the plasticity that enables organisms to survive upon climatic and environmental changes. In the subgenus Sophophora, species from the melanogaster group such as D. simulans and D. melanogaster are cosmopolitan and have expanded their distribution beyond their ancestral regions through commensal association with humans (LACHAISE; SILVAIN, 2004). D. sechellia is a specialist species in Morinda citrifolia fruits, which are considered toxic hosts for other species (R’KHA; CAPY; DAVID, 1991). Another specialist, D. erecta, has its reproductive cycle in Pandanus species (LACHAISE; SILVAIN, 2004). In the obscura group, D. persimilis and D. pseudoobscura have a geographic distribution mainly in the western part of North America, although a small population of D. pseudoobscura is found in Bogotá, Colombia (DOBZHANSKY et al., 1964). These species are classified as opportunists, a characteristic of generalist species, as both are abundant at medium and high elevations during the summer, while in winter, these populations migrate to lower elevations near desert habitats throughout their geographic range. These regions are not occupied by these species during the summer. Depending upon availability, D. pseudoobscura and D. persimilis use mudflows, domestic fruits, cacti, and agave as feeding and breeding sites (POWELL, 1997). D. willistoni is considered one of the most numerous and widely distributed Drosophila species in the New World (AYALA; POWELL; DOBZHANSKY, 1971), and it uses a wide range of decomposing fruits as hosts. In the subgenus Drosophila, D. virilis is a species distributed in the Holarctic zone of the globe, which reproduces and feeds in mudflows and decomposing bark (THROCKMORTON, 1982). In the Hawaiian group, D. grimshawi is considered a generalist, utilizing decomposing bark from more than seven endemic Hawaiian plant families (MAGNACCA; O’GRADY, 2006). The evolution of cactophilic Drosophila species from the repleta group is closely associated with the transition from the use of fermented fruits in humid forests to fleshy- stemmed plants adapted to arid climates, such as cacti, particularly Opuntia sp. The vast majority of repleta species use fermenting cactus tissues to complete their life cycles in semi- arid or arid environments (RUIZ; HEED; WASSERMAN, 1990). The use of decomposing cacti as hosts in arid areas provides abundant and sustainable resources for Drosophila larvae and adults. Nevertheless, specialization in cacti represents a significant ecological challenge, 23 requiring specific adaptations for cactus-dwelling species. For instance, cacti have lower nitrogen and phosphorus content compared to hosts used by most other Drosophila species (OLIVEIRA et al., 2012). Additionally, cacti produce various toxic substances for most insects, an important adaptation that reduces damage caused by herbivores (CLOUDSLEY- THOMPSON, 1996). Finally, cacti are generally found in areas where temperatures are high, imposing desiccation and thermal stress on organisms associated with them (MATZKIN; MARKOW, 2009). It is believed that the ecological shifts towards using cacti as feeding and breeding sites involved numerous adaptive changes in the physiology, reproductive biology, and behavior of cactus-feeding species (MARKOW; O’GRADY, 2007). Therefore, cactophilic species are an excellent system to investigate the genetic basis of HS and adaptation. A study conducted with 65 species from the repleta group showed that the use of cacti from the Opuntia genus is common throughout the phylogeny, and the transition to higher complexity columnar cacti occurred independently several times during the evolutionary history of the group (OLIVEIRA et al., 2012). The authors pointed out that it is also common the usage of multiple cacti species, Opuntia sp. and columnar. From the phylogenetic context, they demonstrated that the usage of Opuntia sp. is a basal condition, while diversification to chemically more complex columnar cacti represent a derived state. In the phylogeny, they reconstructed three scenarios (specific use of Opuntia, use of columnar cacti, and polymorphism) that characterize the host shifts can be identified within the subgroups of repleta. In the mojavensis cluster, for instance, D. navojoa is the basal species and it uses exclusively Opuntia sp. as its host cactus, while the other two species in the cluster, D. arizonae and its sister species D. mojavensis, are polymorphic, using both Opuntia spp. and columnar cacti (OLIVEIRA et al., 2012). In the buzzatii complex, seven out of the ten species have polymorphic usage of cacti. For instance, D. buzzatii has preference for Opuntia sp., while its sister species D. koepferae uses preferentially columnar cacti (Figure 1) (OLIVEIRA et al., 2012). In contrast, species from the mercatorum and hydei subgroups exclusively use Opuntia ssp. as their host (OLIVEIRA et al., 2012). The pair of species D. mojavensis–D. arizonae, which share the use of both types of cacti, differ in many aspects. D. mojavensis exhibit genetic, behavioral, ecological, and morphological differences that have led to its classification into four subspecies (D. m. baja, D. m. mojavensis, D. m. sonorensis, and D. m. wrigleyi), occurring in different regions in the southwestern United States and western Mexico (Figure 1) (REED; MARKOW, 2004; REED; NYBOER; MARKOW, 2007; ROSS; MARKOW, 2006). Unlike D. arizonae, which is polymorphic in its host cactus preference, using both Opuntia sp. and columnar cacti throughout 24 its geographic range, D. mojavensis is oligophagous, utilizing Opuntia sp. as well as different species of columnar cacti (HOCUTT, 1998; RUIZ; HEED, 1988). D. m. sonorensis primarily uses the “organ pipe cactus” (Stenocereus thurberi), D. m. mojavensis uses the “barrel cactus” (Ferocactus cylindraceus), D. m. wrigleyi uses various species from the Opuntia genus (Opuntia spp.), and D. m. baja preferentially uses the agria cactus (Stenocereus gummosus), despite the presence of all other cacti used by the other subspecies within its distribution range (HOCUTT, 1998; REED; NYBOER; MARKOW, 2006; RUIZ; HEED, 1988). Figure 1: Geographical distribution and cacti preference for D. buzzatii and D. koepferae, in the South America, and the four D. mojavensis subspecies and D. arizonae, distributed in the southwestern United States and western Mexico. These species exhibit incomplete and asymmetric post-zygotic isolation in the hybrid offspring. In crosses from D. mojavensis males and D. arizonae females, the hybrid males are sterile, whereas in reciprocal crosses, the offspring can be fertile depending on the maternal subspecies (REED; MARKOW, 2004; RUIZ; HEED; WASSERMAN, 1990). In line with the Reinforcement Theory (DOBZHANSKY, 1937), it has been reported that D. mojavensis females from sympatric populations exhibit almost complete pre-zygotic reproductive isolation compared to D. arizonae males, while D. mojavensis females living in allopatry with D. arizonae demonstrate lower reproductive isolation (MASSIE; ANN MARKOW, 2006; RUIZ; HEED; WASSERMAN, 1990). Importantly, early stages of reproductive isolation have been observed between D. m. sonorensis and D. m. baja (MARKOW, 1991; PFEILER et al., 2009; ZOUROS; D’ENTREMONT, 1980). The premating isolation is attributed to differences in 25 cuticular hydrocarbons, which play a key role in mate recognition signaling pathways (HOWARD; BLOMQUIST, 2005). Notably, experimentally shifting the cactus host was the primary factor leading to changes in hydrocarbon profiles. Therefore, the reproductive barriers are closely tied to the usage of specific hosts (HAVENS; ETGES, 2013; STENNETT; ETGES, 1997). Factors affecting host shift The evolution of insects’ ability to specialize on particular hosts often requires evolutionary changes in two key aspects: 1) host acceptance behavior: the insect needs to evolve the metabolize and accept the new host as a viable food source; and 2) offspring performance: the offspring must be physiologically capable of surviving and thriving on the new host. In a scenario where oviposition behavior is determined by one genetic locus, and offspring fitness by another, if a new host plant is used for oviposition but is still less suitable for larvae development compared to the ancestral host, then either behavioral avoidance or physiological tolerance to the new host is expected to evolve, but not both simultaneously (CASTILLO‐ CHAVEZ; LEVIN; GOULD, 1988; GOULD, 1984). Therefore, the evolutionary outcome upon HS is dependent of the genetic background in the population. Behavior and physiology must evolve together, as one alone is not enough for the insect to successfully specialize on a new host (RAUSHER, 1983, 1993). Thus, evolving from a generalist to a specialist is easier than the reverse, because evolving either host acceptance behavior or physiological adaptation alone can allow the insect to specialize (VERTACNIK; LINNEN, 2017). The success of the plant-insect relationship is the result of a long and complex evolutionary history marked by the interaction between them. The selection of host plants by phytophagous insects depends on both the plant’s characteristics and availability, as well as physiological and behavioral traits of the insect (RUIZ; HEED, 1988). Suitability and location are essential properties for oviposition to occur. The location of the host plant is associated with both the plant’s abundance, as well as the insect’s sensitivity and behavior. Suitability, in turn, is determined by the plant’s physicochemical characteristics, which differ mainly in nitrogen and water content, components that are strictly related to larval growth rates. Additionally, secondary metabolites may assist the insect in locating the host or present high levels of toxicity (RUIZ; HEED, 1988). The adaptations that enabled cactus-feeding species to overcome the physicochemical challenges posed by columnar cacti include three main stages, which summarize the establishment of the relationship between the insect and its host: location, 26 acceptance, and host usage (MARKOW, 2019). Several morphological and sensory structures are involved in each component, with underlying genetic variability in each. The first step is to locate the host and distinguish it from many others. For doing so, the insect recognizes a specific odor chemical compound to guide itself, identified by various chemoreceptors distributed across its body (including antennae, wings, and the Johnston’s organ). The visual system will also integrate the system to assist in guiding to the host. Olfactory sensilla contain odorant-binding proteins (OBPs) (GALINDO; SMITH, 2001) that bind to different odors and connect them to odorant receptors (ORs), which will send the signal through neurons to the brain (LARTER; SUN; CARLSON, 2016). In Drosophila sp., OBPs correspond to a family of ~50 genes expressed in the sensilla. These are small globular proteins capable of binding odor molecules in the pores of chemosensory sensilla, transporting them to their ligands near the ORs (MARKOW, 2019; SÁNCHEZ-GRACIA; ROZAS, 2008; VIEIRA; SÁNCHEZ- GRACIA; ROZAS, 2007), which are genes expressed in antennae and olfactory sensilla (GOMEZ-DIAZ et al., 2018; HALLEM; HO; CARLSON, 2004). ORs are a family containing more than 60 genes, with the function of receiving external chemical stimuli and transform into electrical pulses (GUO; KIM, 2007; MARKOW, 2019). Once the insect has recognized its host, the second stage begins, involving the evaluation of the substrate. The host will be assessed not only for its nutritional quality, but also for the presence of competitors, predators, pathogens, and parasites. The substrate’s quality is related to two types of receptors: gustatory receptors (GRs), which are primarily expressed in the organs located on the head (proboscis, hypo- and epipharynx, and antennae), as well as in the legs and wings (MONTELL, 2009); and ionotropic receptors (IRs), a gene family expressed in coeloconic sensilla, antennae, and palpal and labial sensilla (GOMEZ-DIAZ et al., 2018; RYTZ; CROSET; BENTON, 2013). GRs compose a family with 60 genes that mediate substrate tasting (MONTELL, 2009) and assess its nutritional quality (DUS et al., 2011; FUJITA; TANIMURA, 2011; STAFFORD et al., 2012). IRs are a group of 55 genes involved in olfaction, taste, oviposition roles in the mouthparts and legs, and are implicated in both oviposition and feeding (CHEN; AMREIN, 2017). In the third and final stage, the insect must be able to utilize the substrate, in terms of reproduction and development, ensuring the survival of the larvae to become viable adults. Successful larval development depends on factors such as nutritional composition and overcoming the presence of toxic substances, when present. Since some plants, especially cacti, produce a variety of toxins to prevent herbivory, the larvae has a detoxification pathway as a viable strategy to overcome toxicity. Several gene families serve this purpose, including those 27 from the cytochrome P450 (CYPs), Glutathione S-Transferases (GSTs), UDP- glycosyltransferases (UGTs), Esterases (MARKOW, 2019), and ABC transporters (WU et al., 2019) families. The cytochrome P450 family has approximately 100 genes, highly divergent among Drosophila species. These genes are expressed in antennae, maxillary palps, labellum, legs, wings, and larvae (CHUNG et al., 2009; WILLINGHAM; KEIL, 2004), and they encode enzymes that catalyze various molecular reactions, mainly hydroxylations, on endogenous and exogenous proteins, many of which are associated with insecticide detoxification (GOOD et al., 2014). The GST gene family consists of around 30 genes, expressed in antennae, maxillary palps, and larvae (RIVERON; BOTO; ALCORTA, 2013; YOUNUS et al., 2014), with many of these also involved in insecticide resistance. They function by attaching the thiol group of glutathione to an electrophilic center in certain compounds, thereby eliminating toxins from cells by directing them to specific transporters or making them more water-soluble (LOW et al., 2007). The UGT (UDP-glycosyltransferase) family has its genes expressed in the antennae (FRAICHARD et al., 2020) and is a family with 30 genes that catalyze the transfer of a glycosyl group from a nucleotide sugar, such as UDP-glucuronic acid, UDP-galactose, UDP-glucose, or UDP-xylose, to a variety of small hydrophobic molecules, resulting in more hydrophilic compounds that are efficiently excreted (LUQUE; O’REILLY, 2002). The esterase family consists of approximately 30 genes (OAKESHOTT et al., 1993), expressed in the adult head, larvae, pupae, ovaries, and testes (CAMPBELL et al., 2003). Finally, the ABC (ATP-binding cassette) gene family is a large class of transmembrane proteins, expressed in the brain and eyes of Drosophila, playing an important role in transporting a range of substrates, including xenobiotics (WU et al., 2019). Increased expression of ABC transporters has been shown to be associated with insecticide resistance in some insects, such as pyrethroid-resistant Aedes aegypti strains (BARIAMI et al., 2012), resistant strains of Cimex lectularius (MAMIDALA et al., 2012), as well as in D. melanogaster (SUN; BUCHON; SCOTT, 2017). The genes associated with host location, acceptance, and host usage encode proteins related with feeding, reproduction, and development, hence their evolution is a key process during HS. (MARKOW, 2019). Rane et al. conducted a study with 14 Drosophila species and proposed 82 cases of positive selection occurring in the gene families of esterases, GSTs, CYPs, and UDPs, through analyses of selection (based on the ratio between nonsynonymous substitutions (Ka) and synonymous substitutions (Ks) (ω = Ka/Ks)) that have accumulated over time in the divergence of coding sequences. 28 In addition to the selective pressure acting on the gene families involved in HS, changes in gene expression must also play an important role in the adaptation of Drosophila sp. to new environments. Although detoxification occurs predominantly in digestive organs, a study on D. melanogaster (CATALÁN; HUTTER; PARSCH, 2012) detected genes differentially expressed in the brains from African and non-African populations. These genes were related to stress response, olfaction, and detoxification. It has also been shown that detoxification genes (CYP and GSTs) and olfaction genes (OBPs) are differentially expressed in larval brains between D. m. wrigleyi and D. m. sonorensis (BENOWITZ et al., 2020), indicating that gene expression in the nervous system may play a role in detoxification and the olfactory sensory system, thus contributing to HS in D. mojavensis. Furthermore, it has been proposed that during larval development in the four subspecies of D. mojavensis, D. navojoa, and D. arizonae, there are behavioral differences related to locomotion as well as pupation site selection (COLEMAN et al., 2018). Considering that the larval stage shows the most conserved gene expression compared to other developmental stages of Drosophila (ARTIERI; SINGH, 2010), and that genes related to HS are expressed in this tissue, as well as in organs present in the adult fly’s head, the transcriptomes of these two tissues are an excellent subject for evolutionary studies regarding HS in Drosophila. 29 OBJECTIVES This thesis had two main objectives, which are divided into two chapters. The first objective was to develop a pipeline to identify gene-TE chimeric transcripts with paired-end Illumina reads. The second objective was to analyze the genome and transcriptome of seven cactophilic Drosophila species to investigate whether TEs could contribute to the evolution of genes associated with host shift, applying the method developed in the first chapter. 30 ChimeraTE: a pipeline to detect chimeric transcripts derived from genes and transposable elements 9764–9784 Nucleic Acids Research, 2023, Vol. 51, No. 18 Published online 24 August 2023 https://doi.org/10.1093/nar/gkad671 ChimeraTE: a pipeline to detect chimeric transcripts derived from genes and transposable elements Daniel S. Oliveira 1 , 2 , Marie Fablet 2 , 3 , Ana ̈ıs Larue 2 , 4 , Agn ̀es Vallier 4 , Claudia M.A. Carareto 1 , Rita Rebollo 4 , * and Cristina Vieira 2 , * 1 S ̃ ao Paulo State University (Unesp), Institute of Biosciences, Humanities and Exact Sciences, S ̃ ao Jos ́e do Rio Preto, SP, Brazil, 2 Laboratoire de Biom ́etrie et Biologie Ev olutiv e, Univ ersit ́e Ly on 1, CNRS , UMR5558, Villeurbanne, Rhone-Alpes, 69100, France, 3 Institut Universitaire de France (IUF), Paris, ̂ Ile-de-FranceF-75231, France and 4 Univ Lyon, INRAE, INSA-Lyon, BF2I, UMR 203, 69621 Villeurbanne, France Receiv ed Nov ember 03, 2022; Revised J uly 25, 2023; Editorial Decision J uly 27, 2023; Accepted August 09, 2023 ABSTRACT Transposable elements (TEs) produce structural vari- ants and are considered an important source of ge- netic diver sity. Notabl y, TE-gene fusion transcripts, i.e. chimeric transcripts, have been associated with adaptation in several species. However, the identi- fication of these chimeras remains hindered due to the lack of detection tools at a transcriptome-wide scale, and to the reliance on a reference genome, even though different individuals / cells / strains have different TE insertions. Therefore, we developed ChimeraTE, a pipeline that uses paired-end RNA-seq reads to identify chimeric transcripts through two different modes. Mode 1 is the reference-guided ap- proach that employs canonical genome alignment, and Mode 2 identifies chimeras derived from fixed or insertionally polymorphic TEs without any refer- ence genome. We have validated both modes us- ing RNA-seq data fr om f our Dr osophila melanogaster wild-type strains. We found ∼1.12% of all genes generating chimeric transcripts, most of them from TE-e xonized sequences. Approximatel y ∼23% of all detected chimeras were absent from the reference g enome , indicating that TEs belonging to chimeric transcripts may be recent, polymorphic insertions. ChimeraTE is the first pipeline able to automati- cally uncover chimeric transcripts without a refer- ence g enome , consisting of two running Modes that can be used as a tool to investigate the contribution of TEs to transcriptome plasticity. GRAPHICAL ABSTRACT INTRODUCTION Transposable elements (TEs) are mobile DNA sequences that comprise a large fraction of eukaryotic genomes, from 15% in Drosophila melanogaster ( 1 ), 45% in humans ( 2 ), to 85% in maize ( 3 ). Many TE copies have lost their ability to transpose as a result of accumulated mutations and recom- bination throughout evolution ( 4 ). Despite their lack of mo- bility, such ancient TE insertions may still harbor functional protein domains, alternati v e splice sites, and cis -acting reg- ulatory sequences, as transcription factor binding sites (TF- BSs) and polyadenylation (PolyA) sites. Ther efor e, TEs ar e a major source of genetic di v ersity, not only due to their mo- bilization, but also because they donate protein domains to genes ( 5–8 ), and regulatory sequences that modify the ex- pression of nearby genes ( 9–13 ). The participation of TE- deri v ed sequences in the host biology is a process called domestica tion or exapta tion ( 14 ). The ancestral role of the TE sequence can be domesticated into an essential host * To whom correspondence should be addressed. Tel: +33 4 72 44 82 16; Email: cristina.vieira@univ-lyon1.fr Correspondence may also be addressed to Rita Rebollo. Tel: +33 4 72 43 62 46; Email: rita.rebollo@inrae.fr C ⃝ The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creati v e Commons Attribution License (http: // creati v ecommons.org / licenses / by / 4.0 / ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D ow nloaded from https://academ ic.oup.com /nar/article/51/18/9764/7249921 by guest on 05 Septem ber 2024 31 Nucleic Acids Research, 2023, Vol. 51, No. 18 9765 function, but it can also be modified, and adapted, into a ne w e xapted function that ma y also be essential f or the host species ( 14 ). Chimeric transcripts are RNAs stemming from two sequences of different origins ( 15 ). Hereafter we define chimeric transcripts as mature transcripts that have both gene and TE-deri v ed sequences. These transcripts can be di- vided into three types: ( 1 ) TE-initiated transcripts: chimeric transcripts with a TE transcription start site (TSS) ( 16 , 17 ); ( 2 ) TE-exonized transcripts: TE sequences are incorporated into the transcript either partially or as full-length exons ( 18–20 ); and ( 3 ) TE-terminated transcripts: chimeric tran- scripts with a TE transcription termination site ( 21 , 22 ). TE-initia ted and TE-termina ted transcripts might modu- late gene expression levels either by the presence of TF- BSs , PolyA sites , or chromatin changes; while TE-exonized transcripts may alter the protein sequence of coding genes and have a direct effect on the protein function. Regard- less of the TE position, such e v ents of TE e xaptation and domestication have been associated with many biological roles and are widespread among eukaryotic species ( 14 ). In D. melanogaster , the CHKov1 gene produces a chimeric transcript with a truncated mRNA resulting in resistance to insecticide and viral infection ( 23 ). In humans, the SET- MAR gene produces a chimeric transcript containing a Hs- mar1 copy, involved in non-homologous end-joining DNA repair ( 24 ). In cancer, TEs become acti v e due to a global h ypometh yla tion sta te ( 25 ) and such activation may gen- erate new chimeric transcripts with detrimental outcomes ( 26 ), a process called onco-exaptation ( 9 ). For example, in large B-cell lymphoma, the FABP7 gene has an endoge- nous retrovirus LTR co-opted as a promoter, generating a nov el protein involv ed in abnormal cell proliferation ( 27 ). Recently, such phenomenon has been observed in a larger scale. A pan-cancer study re v ealed 1,068 tumor-specific TE- initiated transcripts from genes coding antigens, showing the high prevalence of TE exaptation in cancer ( 28 ). There- fore, chimeric transcripts have a large impact on host bi- ology, but their study r emains hinder ed by the ubiquitous repetiti v e nature of TE copies, and lack of methods to iden- tify chimeric transcripts. Previous studies with Cap Analysis Gene Expression (CAGE) re v ealed a significant percentage of genes pro- ducing TE-initiated tr anscripts, r anging from 3–14% in humans and mice, depending on the tissue ( 29 ). More specifically, in human pluripotent stem cells, chimeric tran- scripts comprise 26% of coding and 65% of noncoding transcripts ( 30 ). In D. melanogaster , a study using ex- pressed sequence tags (ESTs) showed that the proportion of genes with chimeric transcripts is only ∼1% ( 31 ). Another study used an approach based on cap-enriched mRNA sequencing but with a preliminary transcript elongation step allowing not only to detect TSSs but also to estimate mRNA expression rates (RAMPAGE) ( 32 ). This study de- tected TE-initiated chimeric transcripts in 36 stages of D. melanogaster life cy cle, and observ ed that up to 1.6% of all tr anscripts were chimer as, representing up to ∼1% of all genes ( 33 ). More recently, a tissue-specific study has shown that 264 genes produce chimeric transcripts in the mid- brain of D. melanogaster , corresponding to ∼1.5% of all genes ( 34 ). Se v eral bioinformatic methods have been de v eloped to take advantage of RNA sequencing (RNA-seq) to identify chimeric transcripts, such as CLIFinder ( 35 ) and LIONS ( 36 ). The former is designed to identify chimeric transcripts deri v ed from long interspersed nuclear elements (LINEs) in the human genome, whereas the latter identifies only TE-initiated transcripts. Both methods need a r efer ence genome and they only detect chimeric transcripts deri v ed from TE insertions present in the r efer ence genome. Ther e- fore, it is impossible to identify chimeric transcripts derived from polymorphic TE insertions that may exist in other populations , strains , or individuals. Finally, the latest ad- dition to chimeric transcript detection, TEchim ( 34 ), can detect chimeras with TEs that are polymorphic and absent from the r efer ence genome of D. melanogaster , but it is not a pipeline designed to run automatically with any other genome. In this study, we hav e de v eloped ChimeraTE, a pipeline that uses paired-end RNA-seq reads to identify chimeric transcripts. The pipeline has two Modes: Mode 1 can predict chimeric transcripts through genome alignment, whereas Mode 2 performs chimeric transcript searches without a r efer ence genome, being able to identify chimeras deri v ed from fixed or polymorphic TE insertions. To bench- mark the pipeline, we have used RNA-seq from ovaries of four D. melanogaster wild-type strains, for which we have assembled and annotated genomes. We found that ∼1.12% of genes have chimeric transcripts in the ovarian transcrip- tome, of which ∼88.97% are TE-exonized transcripts. Our results also re v eal that the retrotransposon roo is the most frequent exonized TE family. In addition, with Mode 2, we found 11 polymorphic chimeric transcripts deriving from TE insertions that are absent from the D. melanogaster ref- er ence genome. Ther efor e, this w ork pro vides a new strategy to identify chimeric transcripts with or without the refer- ence genome, in a transcriptome-wide manner. MATERIALS AND METHODS ChimeraTE: the pipeline ChimeraTE was de v eloped to detect chimeric transcripts with paired-end RNA-seq reads. It is de v eloped in python3 v.3.6.15 language, and it is able to fully automate the process in only one command line. The pipeline has two detection Modes: ( 1 ) genome-guided, the r efer ence genome is pro- vided and chimeric transcripts are detected aligning reads against it; and ( 2 ) genome-blind, the r efer ence genome is not provided and chimeric transcripts are predicted for fixed or polymorphic TEs. These Modes have distinct approaches that may be used for different purposes. In Mode 1, chimeric transcripts will be detected considering the genomic loca- tion of TE insertions and exons. Chimeras from this Mode can be classified as TE-initiated transcripts (TE located up- stream of the gene), TE-exonized transcripts (TE within introns or embedded in gene exons), and TE-terminated transcripts (TE located downstream of the gene). In addi- tion, results from Mode 1 can be visualized in a genome bro wser, which allo ws a manual curation of chimeric tran- scripts in the r efer ence genome. Mode 1 does not detect chimeric transcripts deri v ed from TE insertions absent from the provided r efer ence genome. Mode 2 predicts chimeric D ow nloaded from https://academ ic.oup.com /nar/article/51/18/9764/7249921 by guest on 05 Septem ber 2024 32 9766 Nucleic Acids Research, 2023, Vol. 51, No. 18 transcripts considering the alignment of reads against tran- scripts and TE insertions, in addition to a transcriptome assembly (user optional). Hence, Mode 2 detects chimeric transcripts from de novo TE insertions and an assembled genome is not necessary. In this Mode, two alignments are performed: (i) transcript alignment and (ii) TE align- ment. Then, based on both alignments, the pipeline iden- tifies chimeric reads that support chimeric transcripts, re- gardless of the TE genomic location. In Mode 2, since there is no alignment against an annotated genome, it is not pos- sible to classify chimeric transcripts considering the TE po- sition as in Mode 1. Both ChimeraTE Modes use chimeric reads, which are defined by paired-end reads that contain both TE and exon sequences, as evidence of chimeric transcripts. This method has been widely demonstrated by other authors as a poten- tial source of artifactual reads, mainly due to the occur- r ence of mix ed clusters ( index hopping ) on the Illumina’s flow cell that may be too close to each other, as well as jumping PCR ( in vitro crossover artifact), generating read pairs that are connecting two cDNA portions that are not joined in the sample ( 37–40 ). Indeed, it has been shown that up to 1.56% of all reads produced by Illumina multiplexed approaches might generate chimeric reads ( 41 ), including cases that may support chimeric transcripts deri v ed from different genes. These artifactual reads originate more likely from highly expressed genes since ther e ar e mor e molecules on the Illumina’s cell. Conversely, because TE-derived se- quences might comprise a low proportion of the transcrip- tome, artifactual reads from TEs should be produced at a low fr equency. Furthermor e, it is unlikely that artifactual reads from the same gene and TE family among RNA-seq replicates would be produced. Nonetheless, to avoid includ- ing false chimeric reads, both Modes of ChimeraTE only call chimeric transcripts that are detected in at least two RNA-seq replicates (user optional for more). ChimeraTE mode 1: genome-guided approach In ChimeraTE Mode 1, paired-end RNA-seq reads, a whole genome assembly, and its respective gene / TE annotation are used to predict chimeric transcripts (Figure 1 A). We highlight the need for robust gene and TE annotations to submit an analysis to Mode 1, since each TE insertion and exon will be used to identify chimeric reads. Firstly, RNA-seq alignment to the genome is performed with STAR v.2.7.6a with default parameters (Figure 1 B), and transcript expression is assessed with Cufflinks v.2.2.1 ( 42 ) for each RNA-seq replicate. We consider FPKM ≥1 as an expressed gene by default, but it can be changed by the user with the – fpkm parameter. Only concordant reads (both reads from the pair are aligned) with unique alignment ( samtools -q 255 ) are selected and converted to BED format with sam- tools v.1.10 ( 43 ) and bedtools v.2.30.0 ( 44 ), implemented in p ython3 with p ybedtools v.0.9.0 ( 45 ). Split r eads ar e also con- sidered as chimeric reads when one mate of the pair stems from the TE to the exon, or vice-versa (Supplementary Fig- ure 1). The alignment position of these reads are considered independently with bedtools bamtobed -split . Finally, reads aligned into the forward and re v erse strands are separated with samtools ( 43 ). The IDs from reads that have aligned against exons are identified with bedtools intersect (Figure 1 B). Then, reads with at least 50% of their length ( –o ver lap 0.50 ) aligned against r efer ence TE copies have their IDs selected (Fig- ure 1 B), and TE copies without aligned reads are removed from the downstream analysis. These two lists of IDs are intersected between each other and reads that have aligned to both exons and TEs are identified, generating a raw list of chimeric reads (Figure 1 C). TE-initiated transcripts are the first to be searched by ChimeraTE Mode 1. Expressed genes that have TE copies within their 3 kb upstream re- gion (default but adjustable with –window parameter) are identified with bedtools ( 44 ). For each gene, the pipeline will check whether the TE located upstream has chimeric r eads shar ed with the gene exons (Figure 1 D). If there are no chimeric reads, both gene and TE are discarded. The same method is applied to identify TE-terminated transcripts, but with genes harbouring downstream TEs up to 3 kb (user adjustable value) (Figure 1 1D). Finally, TE-exonized tran- scripts are also identified thanks to chimeric reads aligned to exons and TEs located within genes. Based on the TE position towar ds e xons, they are classified as embedded, in- tronic and overlapped. Embedded TEs are located entirely within an exon. In these cases, reads aligned only to the TE sequence are not considered chimeric reads (Supplementary Figure 1). Indeed, the exon can be artificially divided into two portions, and chimeric reads must have a read (or a split read) aligned to the TE, whereas its mate (or the ex- tension from the split read) aligned to at least one exon portion (Supplementary Figure 1). This method avoids au- tonomous TE expression as evidence of chimeric transcript expression. The same is performed to identify TE-exonized transcripts deri v ed from TEs that are partiall y overla pping exons but are also located in introns (Supplementary Fig- ure 1). Finally, TE-exonized transcripts with TEs located in introns are identified with bedtools intersect , and chimeric r eads ar e detected to support chimeric transcripts. These steps are repeated for all RNA-seq replicates pro- vided in the input. Then, the raw results from replicates ar e compar ed, and all chimeric transcripts that were found in ≥2 replicates and have been identified with at least ≥2 chimeric reads on average between r eplicates ar e consid- ered true chimeric transcripts (Figure 1 C). These thresholds may be changed by the user with –cutoff and –replicate pa- rameters. Mode 1 output is a table with a list of predicted chimeric transcripts categorized by TE position, with gene ID, TE family, chimeric read coverage, TE location, gene location, and gene expression (Figure 1 D). ChimeraTE mode 2: a genome-blind approach to uncover chimeric transcripts ChimeraTE Mode 2 is the genome-blind approach of the pipeline. The input data are stranded RNA-seq reads, gene transcripts, and r efer ence TE insertions. TE consensus can be used, but it will decrease the alignment rate, and ther efor e it should be used only when a fasta file with TE insertions is not available (Figure 2 A). The data will be used to perform two alignments with bowtie2 v.2.4.5 ( 46 ), one against all transcripts and another against all TE sequences, both with parameters: - D 20 -R 3 -N 1 -L 20 -i S,1,0.50 (Figure 2 A). To D ow nloaded from https://academ ic.oup.com /nar/article/51/18/9764/7249921 by guest on 05 Septem ber 2024 33 Nucleic Acids Research, 2023, Vol. 51, No. 18 9767 Figure 1. ChimeraTE Mode 1 (genome-guided) workflow. Round white boxes: input data; square boxes: pipeline step; round gray boxes: thresholds that can be modified. ( A ) Input data: fasta file with the genome assembly, gtf files with gene and TE annotations, as well as stranded paired-end reads from RN A-seq (fastq). ( B ) Alignment: RN A-seq alignment to the genome is used to calculate gene expression levels. Genes with FPKM < 1 are removed from downstream analyses. A subsequent list of reads that have aligned against genes or TE insertions is created. ( C ) Chimeric read detection & filtering: both r ead lists ar e then compar ed and r ead pairs that have common r eads between the two lists are named chimeric r eads, i.e. pair ed-end r eads mapping to a gene and a TE copy. The average of these reads between replicates is used as chimeric read coverage for each putative chimeric transcript. All putative chimeras are then processed with three ChimeraTE scripts to categorize them into TE-initiated, TE-exonized, and TE-terminated transcripts. These steps are perf ormed f or all RN A-seq replicates. Finall y, all chimeric transcripts present in at least 2 replicates and with at least 2 chimeric reads on average between replicates are maintained. ( D ) Chimeric transcripts: Three predictions obtained from Mode 1. Blue bo xes: ex ons; red bo xes: TEs; arro whead in between TE and ex on bo xes: transcription sense; blue and red boxes linked by a line: chimeric reads. The ChimeraTE mode 1 output is divided into three predictions: (i) TE-initiated transcript: the TE insertion is located upstream of the gene region; (ii) TE-exonized transcript: the TE insertion is present within e xons (embedded), ov erlapping e xons (ov erlapped), or intr ons (intr onic); (iii) TE-terminated transcript: the TE insertion is located downstream of the gene region. avoid very low-expressed transcripts predicted as chimeras and to ensure reasonable processing time, the SAM align- ment is converted to BAM with samtools v.1.10 ( 43 ), and FPKMs are computed for the reference transcripts pro- vided in the input using eXpress v.1.5.1 ( 47 ). Then, all genes with average FPKM < 1 are removed from the downstream analysis (Figure 2 B). To identify chimeric reads between TEs and gene transcripts, both alignments are converted to BED with bedtools v.2.30.0 ( 44 ). Among all aligned paired- end reads, the pipeline considers as chimeric transcripts the ones that have at least one read aligned to the TE sequence (singleton mapping) and its mate to the gene transcript, or when both reads have aligned (concordant mapping) to the TE and gene transcript. In order to identify these reads, the TE alignment output is used to create a list with all read 1 IDs that have aligned against TEs, and another list with all D ow nloaded from https://academ ic.oup.com /nar/article/51/18/9764/7249921 by guest on 05 Septem ber 2024 34 9768 Nucleic Acids Research, 2023, Vol. 51, No. 18 Figure 2. ChimeraTE Mode 2 (genome-b lind)wor kflow. Round white boxes: input data; square boxes: pipeline step; round gray boxes: thresholds that can be modified. ( A ) Input data: two fasta files containing r efer ence transcripts and TE insertions, as well as stranded pair ed-end r eads from RNA-seq (fastq). ( B ) Alignment and chimeric reads: The alignment against transcripts is performed and their expression is calculated. Transcripts with FPKM < 1 are removed from the downstream analysis. Next, a list of reads aligned against transcripts is created. Through the alignment of reads against TE insertions, a second list with reads stemming from TEs is also cr eated. Then, mapped pair ed-end r eads and singletons are identified, generating the list of chimeric reads, for all replicates. All chimeric transcripts that have an average of chimeric reads > = 2 and are present in > = 2 replicates are maintained as true chimer as. ( C ) Tr anscriptome assembly and chimeric reads: The de novo transcriptome assembly is a non-default option of ChimeraTE Mode 2. It performs a transcriptome assembly and aligns reads against the assembled transcripts. Then, TE insertions in the assembled transcripts are identified with RepeatMasker and the TE reads are recovered. Using the two lists of reads (transcripts and TEs), the chimeric read list is generated and the putati v e assembled chimeric transcripts ar e pr edicted. Next, a blastn is performed between these transcripts and the r efer ence transcripts provided in the input. All transcripts with length > = 80% are selected. The process is repeated for all RNA-seq replicates and chimeric transcripts assembled > = 2 replicates are maintained as true chimeras. ( D ) Chimeric transcripts: if the assembly is activated, ChimeraTE mode 2 provides three outputs: (1) Chimeric reads: predicted only based on the method depicted in B; (2) Assembled transcripts: predicted only based on the transcriptome assembly method depicted in C; and (3) Double evidence: predicted by both methods -B and C-. r ead 2 IDs, r egardless if their mates have also aligned (con- cordant mappings or singleton mappings). The same lists ar e cr eated by using the transcript BED file: ( 1 ) r ead 1 IDs of transcript mapping reads and ( 2 ) mate 2 IDs of all mate 2 r eads, r egardless of mate mapping. All mate 2 IDs that have a TE-aligned read 1 are searched in the list of transcript- aligned mate 2. The same is performed in the opposite direc- tion (TE-aligned read 2, transcript-aligned mate 1). These read pairs will therefore be comprised of two mates from the same pair that were singletons in the alignments, i.e. pairs comprised of one read that has aligned against a TE, and its mate against a gene transcript. The cases in which the chimeric transcript does not have the TE insertion in the r efer ence transcript will be supported only by these single- ton chimeric reads (Figure 2 D). For cases in which the TE insertion is present inside the reference transcript, chimeric reads supporting it may either be singleton or concordant r eads. Ther efor e, chimeric r eads can be concordant r eads in both alignments (TEs and genes), or they may be con- cordant only in the gene transcript alignment and single- ton in the TE alignment. Due to the repetiti v e nature of TEs, short-read alignment methods provide very few unique aligned reads against loci-specific TE copies as most reads align ambiguously between similar TE insertions. There- fore, when a chimeric transcript has been identified involv- ing more than one TE family, the TE family with the high- est coverage of chimeric reads is maintained. Subsequently, ChimeraTE uses two chimeric reads as a threshold for call- ing a chimeric transcript, that can be modified by the user with the –cutoff parameter. Such value does not r epr esent D ow nloaded from https://academ ic.oup.com /nar/article/51/18/9764/7249921 by guest on 05 Septem ber 2024 35 Nucleic Acids Research, 2023, Vol. 51, No. 18 9769 transcript expression nor TE expression, but it represents the coverage supporting the junction between a gene tran- script (CDSs / UTRs) and a TE sequence. Finally, the output tables show the list of genes and the respective TE families detected as chimeras, r efer ence transcript ID and the total coverage of chimeric reads supporting it. The support of chimeric transcripts performed by ChimeraTE Mode 2 is from chimeric reads aligned by an end-to-end approach. Such alignment may reduce alignment sensiti vity, since e xon / TE junctions may be covered by split reads. To mitigate the loss of detection power due to the alignment method with Mode 2, alongside with chimeric read detection using alignments against transcripts and TEs, there is an option to run Mode 2 with a transcriptome assembl y a pproach, w hich can be activated with –assembly parameter (Figure 2 C). This approach will use RNA-seq reads to perform a de novo transcriptome assembly with Trinity v2.9.1 ( 48 ). RepeatMasker v4.1.2 ( 49 ) is used to iden- tify assembled transcripts that may hav e TE-deri v ed se- quences providing –ref TEs , a custom TE library, or pre- defined TE consensus sequences from Dfam v3.3 ( 50 ), ac- cording to the tax onom y level, i.e.: flies , mouse , humans. Then, RNA-seq r eads ar e aligned with bowtie2 v.2.4.5 ( 46 ) against the assembled transcripts. Subsequently, the align- ment is used to identify whether transcripts containing TE- deri v ed sequences have chimeric reads, including split reads. All assembled transcripts with chimeric transcripts are se- lected as candidate chimeric transcripts. Next, these candi- dates are submitted to a homolo gy anal ysis with r efer ence transcripts using blastn v2.11.0 ( 51 ). Finally, all assembled transcripts with masked TEs that have at least 80% of sim- ilarity with r efer ence transcripts across 80% of their length (can be modified with – min length parameter) are consid- ered chimeric transcripts. All these steps are repeated for all RNA-seq replicates provided in the input. Finally, the list of chimeric transcripts obtained from all replicates with the transcriptome assembly approach is compared, and all chimeras that have been identified with at least ≥2 chimeric r eads and wer e found in ≥2 r eplicates, ar e consider ed as true chimeric transcripts. By activating the –assembly option in Mode 2, the output table will provide chimeric transcripts that have been predicted based on different evidences (Fig- ure 2 D): ( 1 ) only based on chimeric reads; ( 2 ) only based on transcript assembly; ( 3 ) based on chimeric reads and tran- script assembly. D. melanogaster wild-type strains: genome assemblies and RNA-seq To assess ChimeraTE’s performance, as well as the ef- ficiency in the identification of chimeric transcripts de- ri v ed from polymorphic TE insertions, we have mined pre- viousl y available RN A-seq data from ovaries of four D. melanogaster wild-type strains ( 52 ), two from Gotheron, France (44 56 ′ 0 ′′ N 04 53 ′ 30 ′′ E), named dmgoth101 and dmgoth63; and two from S ̃ ao Jos ́e do Rio Preto, Brazil (20 41 ′ 04.3 ′′ S 49 21 ′ 26.1 ′′ W), named dmsj23 and dmsj7. Such RNA-seq data was produced with mRNA libraries, which is strongly advised to use with ChimeraTE, since to- tal RNA libraries contain pre-mRNA sequences and might introduce many false positi v es of TE-e xonized transcripts. Sequencing was performed on an Illumina Hiseq (125 bp pair ed-end r eads), with two biological r eplicates. All RNA- seq libraries were trimmed by quality, and adapters were removed with Trimmomatic ( 53 ). Each strain had also its genome previously sequenced by Nanopore long reads and assembled ( 54 ). The high-quality assemblies allowed us to manually check whether chimeric transcripts predicted by both ChimeraTE Modes have the predicted TE insertions inside / near genes, as well as manually curate the presence of chimeric reads. Running ChimeraTE with D. melanogaster data To run ChimeraTE Mode 1 on the available RNA- seq data, we performed gene annotation in the four D. melanogaster genomes with Liftoff v.1.6.3 ( 55 ) us- ing -s 0.5 -s 0.75 -ex c lude par tial parameters and the r6.49 D. melanogaster genome ( dm6 strain) available in Flybase as r efer ence. TE annotation was performed with RepeatMasker v4.1.2 ( 49 ), with parameters: - nolow ; - norna ; -s; and -lib with the TE sequence library for D. melanogaster provided by Bergman’s lab v.10.2 ( https://github.com/bergmanlab/drosophila-transposons/ blob/master/current/D mel transposon sequence set.fa ). Small TE insertions with length < 80 bp were discarded due to the lack of robustness to define them as TEs, following the 80–80–80 rule ( 56 ). In addition, due to the presence of simple sequence repeats (SSRs) in many TEs from D. melanogaster , such as r oo , 412 , tir ant ( 57 ), se v eral insertions from RepeatMasker might be indistinguishable from SSRs if the r epeat r egion comprises a large propor- tion of the TE sequence ( 58 , 59 ). Ther efor e, to incr ease the robustness of the TE annotation, we identified the presence of SSRs with TRF v.4.09 ( 60 ) across all insertions annotated on each strain, with parameters 2 , 5 , 6 , 75 , 20 , 50 , 500 , -m , -h and removed TEs that presented SSRs over > 50% of their sequences. We used ChimeraTE Mode 1 with default parameters, setting –strand rf-stranded , since our paired-end reads have the orientation as re v erse and f orward f or the first and second r eads, r especti v ely. In case the stranded paired-end reads were produced as forward and reserve (first and second r eads, r especti v ely), the parameter would be –strand fwd-stranded . For ChimeraTE Mode 2, we demonstrate its potential in detecting chimeric transcripts deri v ed from TE insertions that are not present in a reference genome, e v en though the transcript sequences and TE copies provided as input stemmed from the dm6 r efer ence genome. Ther efor e, we have used ChimeraTE mode 2 with RNA-seq from the four wild-type strains listed previously, with reference transcripts from D. melanogaster ( dm6 strain) available in Flybase r6.49 ( http://ftp.flybase.net/genomes/Drosophila melano gaster/ current/fasta/dmel- all- transcript- r6.49.fasta.gz ) ( 61 ). The TE annotation for the dm6 genome was assessed with the same protocol used on the four wild-type strains, with RepeatMasker ( 49 ). We have used default parameters, ex cept b y –assembly . Additional species data In order to demonstrate that ChimeraTE can be used with other species than Drosophila spp., we applied it to D ow nloaded from https://academ ic.oup.com /nar/article/51/18/9764/7249921 by guest on 05 Septem ber 2024 36 9770 Nucleic Acids Research, 2023, Vol. 51, No. 18 Arabidopsis thaliana , Homo sapiens and the fish Poecilia reticulata (guppy) RNA-seq datasets. The latter was se- lected as an example of non-model species. The genome and gene annotation of A. thaliana was obtained from NCBI (GCF 000001735.14), version TAIR10.1. The TE annotation was assessed with RepeatMasker v.4.1.2 ( 49 ), with par ameter s -species ar abidopsis -cutoff 225 -nolo w - norna -a -s . We downloaded two replicates of paired- end RNA-seq from A. thaliana leaf, available on NCBI (SRR21230172, SRR21230173). Regarding human data, the genome, gene and TE annotations wer e r etrie v ed from NCBI ( 62 ), version GRCh38.p14. Two paired-end RNA- seq replicates from the human leukemic cell-line K562 were downloaded from NCBI (SRR521457, SRR521461). Fi- nally, P. reticulata’ s genome and gene annotation were also retrie v ed from NCBI (GCF 000633615.1), and TE annota- tion was assessed with RepeatMasker v.4.1.2 , parameters : -species Actinopteri -cutoff 225 -nolow -norna -a -s . Two RNA-seq replicates from ovary follicular tissue were down- loaded from NCBI (SRR17332506, SRR17332508). Both ChimeraTE Mode 1 and Mode 2 were used with default pa- rameters, except by –threads 32 , and –ram 64 . Benchmarking polymorphic chimeric transcripts with nanopore genomes Once chimeric transcripts were identified, we used the high- quality Nanopore assemblies for dmgoth101, dmgoth63, dmsj23 and dmsj7 previously published ( 54 ) to confirm whether genes predicted by Mode 1 as chimeric transcripts have indeed the respective TE insertion located near or within them. To do so, we have used an ad-hoc bash script to create three bed files from genes: 3 kb upstream; 3 kb downstr eam, and gene r egion. Then, we used bedtools inter- sect ( 44 ) to identify genes with TEs located in the three re- gions. For Mode 1 we have randomly sampled 100 chimeric transcripts of each wild-type strain to visualize the align- ments performed by Mode 1 on the IGV genome browser ( 63 ). For Mode 2, all genes not found by bedtools intersect with the predicted TE insertion were visualized in IGV . In both manual curations, we considered false positi v es those cases in which we did not find the TE insertion, or we found the TE insertion, but without chimeric reads. In order to assess the number of chimeric transcripts found by Mode 2 in wild-type strains deri v ed from TE in- sertions absent in the dm6 genome, we also used the ad- hoc bash script to create the bed files with 3 kb ± and the gene regions for dm6 . Then, we used bedtools inter- sect ( 44 ) with TE annotation and the gene regions. By us- ing this method, we generated a list of genes with TEs lo- cated 3 kb upstream, inside genes (introns / exons), and 3 kb downstream for the dm6 genome. Then, the polymorphic chimeric transcripts were identified with the comparison of genes with TEs inside / nearby in dm6 and the list of chimeric transcripts in the wild-type strains. In addition, all chimeric transcripts deri v ed from TEs that were not found in dm6 were manually curated with the IGV genome browser ( 63 ). Validation of chimeric transcripts by RT-PCR D. melanogaster strains were kept at 24 ◦C in standard labo- ratory conditions on cornmeal–sugar–yeast–agar medium. For sampling, 3–7 day old D. melanogaster females were immersed in Phospha te-buf fered saline solution (PBS) and dissected under a ster eomicroscope. Thr ee biological sam- ples were collected in buffer TA (35 mM Tris / HCl, 25 mM KCl, 10 mM MgCl 2 , 250 mM sucrose, pH 7.5), each com- posed of 30 pairs of ovaries. All samples were collected on ice in 1.5 ml RNAse free tubes and stored at −80 ◦C un- til use. Total RNA was extracted using QIAGEN AllPrep DN A / RN A Micro extraction kit (Qiagen 80284) follow- ing the manufacturer’s guideline. cDNA synthesis was per- formed on 1 !g of total RNA or water (RT negati v e control) with BIORAD iScript cDNA systhesis (Biorad 1708891). Primers were designed considering the TE and the exon lo- cation with aligned chimeric reads (Supplementary Table 1) and PCRs were performed with Taq’Ozyme (Ozya001- 1000). Sequence and protein analysis of TE-e x onized elements The sequences of TE-exonized elements were extracted from wild-type genomes with bedtools g etf asta ( 44 ), param- eter -s, using BED files created by ChimeraTE Mode 1. Since the TE reading frame incorporated into the gene tran- script is unknown, we r ecover ed all open r eading frames (ORFs) with EMBOSS v.6.6.0 getorf ( 64 ) in the three cod- ing frames, from both strands. Next, the protein domains in these sequences were assessed with pfam-scan v.1.6 ( 65 ), us- ing Pf am-A.hmm da tabase fr om InterPr o ( 66 ), with default parameters. We applied a filtering step for protein domains with multiple overlapped matches in the same TE insertion, keeping in the downstream analysis only the longest match. We also removed protein domains that had no association with any TE function, following the description from In- terPro. Finall y, to anal yze w hether r oo elements genera ting TE-exonized embedded transcripts provided a specific mo- tif to the chimeric exons, we extracted their sequences with bedtools g etf asta , and aligned with r oo consensus sequence with MUSCLE v.5.1 ( 67 ). Alignment plots were performed with MIToS v.2.11.1 ( 68 ), using the package Plots . Differ ential expr ession analysis The read count for differential gene expression analysis for the four wild-type strains was obtained from Fablet et al. ( 52 ). The data were normalized with DEseq2 v.3.10 ( 69 ), with a designed model ( ∼genotype), assuming that the only variable to identify differ entially expr essed genes between strains must be the genotype. We performed pairwise com- parisons with the four wild-type strains, considering differ- entially expressed genes the ones with adjusted p-value < 0.05. RESULTS AND DISCUSSION ChimeraTE predicts chimeric transcripts deri v ed from genes and TEs using two different strategies. Mode 1 is a genome-guided approach that will predict chimeric tran- scripts from paired-end RNA-seq through chimeric read pair detection. The main advantages of Mode 1 in compari- son to Mode 2 are that the first one can detect split reads be- tween exons and TEs, capture chimeric transcripts with low D ow nloaded from https://academ ic.oup.com /nar/article/51/18/9764/7249921 by guest on 05 Septem ber 2024 37 Nucleic Acids Research, 2023, Vol. 51, No. 18 9771 cov erage / e xpression and classify chimeric transcripts ac- cording to the TE position: TE-initiated, TE-exonized, TE- terminated transcripts. Howe v er, Mode 1 misses chimeric transcripts deri v ed from TE insertions that are absent from a r efer ence genome. Mode 2, w hich is a genome-blind a p- proach, performs two alignments against r efer ence tran- scripts and TE copies and then, similar to Mode 1, predicts chimeric reads between these two alignments. In addition, Mode 2 can optionally perform a de novo transcriptome assemb ly ab le to detect chimeric transcripts and chimeric reads through split read alignment, improving the sensitiv- ity. Despite the similarities between both Modes, the re- sults from Mode 1 and 2 had an overlap of 50.46% (Sup- plementary Data, note 1). We have shown that such differ- ences are likely to be associated with the alignment used by the Modes, and read length (Supplementary Data, note 1, Fig S1.1) ( 70 , 71 ). We also observed that increasing the minimum number of chimeric reads (default = 2) causes a decrease of 48.45% for Mode 1 and 33.48% for Mode 2 in the total number of chimeras found in two RNA-seq replicates (Supplementary Data, note 2). Ther efor e, a sub- stantial part of the results is detected with low coverage of chimeric reads. Finally, we demonstrated that the use of RNA-seq replicates is crucial to reduce the frequency of potential chimeras supported by artifactual chimeric reads (Supplementary Data, note 3) ( 37 , 72–77 ). Setting up the datasets for ChimeraTE Each ChimeraTE Mode r equir es differ ent input datasets. To run Mode 1 (genome-guided), gene and TE annotations, along with a genome fasta file are necessary. We took ad- vantage of available paired-end RNA-seq datasets from the ovaries of four wild-type strains of D. melanogaster ( 52 ), for which high-quality genome assemblies were also available ( 54 ). We performed gene annotation in the new assemblies using D. melanogaster ’s genome ( dm6 ) from Flybase r6.49 as a r efer ence and obtained ∼17,278 genes per genome (Sup- plementary Tab le 2). Regar ding TE annotation, we found ∼10.48% of TE content in the four wild-type strains, sim- ilar to our previous estimates for these strains ( 54 ). Filter- ing out TE insertions smaller than 80 bp, removed ∼5,549 TEs (20.59%) across the four wild-type strains. In addition, to remove potential mapping artifacts, we discarded ∼752 TE copies (2.79%) per strain that harbored SSRs longer than half the T