Dados do Trabalho
Título
Application of Machine Learning Algorithms for Identification of Viruses in Dark Matter from Next-Generation Sequencing
Introdução
Metagenomic methods represent one of the most potent tools for identifying emerging or lesser-known viruses. With the advent of next-generation sequencing technologies (NGS) and taxonomic classifiers, it has become feasible to discern their genetic makeup and correlate the identified sequences with their respective taxa. However, a subset of sequences remains unassociated with known taxa, commonly referred to as "dark matter." Dark matter poses a significant impediment to achieving a comprehensive understanding of the metagenome. Content residing within the dark matter holds the potential to unveil novel pathogens capable of infecting humans.
Objetivo (s)
Hence, the primary aim of this study is to delineate the viral content within the unclassified portion utilizing sequences obtained from samples collected via nasopharyngeal swabs of pediatric patients of Hospital das Clínicas de Ribeirão Preto, who tested negative for SARS-CoV-2 in 2021.
Material e Métodos
The obtained samples were sequenced using the NGS technique and the raw data was subjected to quality control. Next, the entire human genome present was mapped and removed, the unmapped reads were then taxonomically classified. The entire unclassified part (dark matter) was then subjected to a search for viral protein families that were used for training and testing of supervised machine learning techniques, namely Naïve Bayes, Random Forest, XGBoost, and LightGBM.
Resultados e Conclusão
Our approach revealed the presence of four predominant virus groups—Caudovirales, Enterovirus, Respiratory Syncytial Virus, and Torque Teno Virus—within the dark matter, indicating that certain genomic sequences evade taxonomic classification. Moreover, our findings indicate that while XGBoost exhibited superior performance, Random Forest yielded the most reliable outcomes.
Palavras Chave
Viral metagenomics; Viral dark matter; Machine Learning; Bioinformatics.
Área
Eixo 10 | 4.Outras viroses humanas e veterinárias - Outras
Prêmio Jovem Pesquisador
4.Não desejo concorrer
Autores
Gabriel Montenegro de Campos, Luan Gaspar Clemente, Alex Ranieri Jerônimo Lima, Milton Yutaka Nishiyama Junior, Eneas de Carvalho, Sandra Coccuzo Sampaio, Maria Carolina Elias, Svetoslav Nanev Slavov