Field
Value
Language
dc.contributor.author
de Montigny, Nicolas
datacite.creator.affiliationIdentifier
https://ror.org/002rjbv21
en_US
datacite.creator.affiliation
University of Quebec at Montreal
en_US
datacite.creator.nameIdentifier
https://orcid.org/0000-0002-3708-4055
en_US
dc.contributor.author
Steven W., Kembel
datacite.creator.affiliationIdentifier
https://ror.org/002rjbv21
en_US
datacite.creator.affiliation
University of Quebec at Montreal
en_US
datacite.creator.nameIdentifier
https://orcid.org/0000-0001-5224-0952
en_US
dc.contributor.author
Abdoulaye Baniré, Diallo
datacite.creator.affiliationIdentifier
https://ror.org/002rjbv21
en_US
datacite.creator.affiliation
University of Quebec at Montreal
en_US
datacite.creator.nameIdentifier
en_US
dc.date.accessioned
2024-12-12T19:14:05Z
dc.date.available
2024-12-12T19:14:05Z
dc.date.issued
2024-12-12
dc.identifier.uri
https://doi.org/10.20383/103.01160
dc.identifier.uri
https://www.frdr-dfdr.ca/repo/dataset/6536e425-a10a-46b1-acae-da529f061915
dc.description
This dataset contains sequencing data used to train the models of the Caribou pipeline. We developed this pipeline for alignment-free bacterial identification and classification in metagenomics sequencing data using machine learning. The datasets were derived from the GTDB v.202 database (https://data.gtdb.ecogenomic.org/releases/release202/202.0/) and include training steps using the species representatives, as the benchmark datasets used non-representative whole genomes. We also simulated sequencing reads to evaluate and compare performance on whole genomes and sequencing reads. We provide models and encoding files of CNN-trained models; datasets used for training, validation and testing of models, randomly sampled from representative genomes; and datasets used for benchmarking the method against state-of-the-art methods, randomly sampled from non-representative whole genomes and simulated reads.
en_US
dc.publisher
Federated Research Data Repository / dépôt fédéré de données de recherche
dc.rights
Creative Commons Attribution 4.0 International (CC BY 4.0)
en_US
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
en_US
dc.subject
metagenomics
en_US
dc.subject
alignment-free DNA classification
en_US
dc.subject
classification models
en_US
dc.subject
machine learning
en_US
dc.subject
neural networks
en_US
dc.subject
bacterial genomes
en_US
dc.subject
taxonomic classification
en_US
dc.title
Caribou pipeline for the alignment-free bacterial identification and classification in metagenomics sequencing data using machine learning
en_US
globus.shared_endpoint.name
f163c1b3-9c88-42f6-a7bb-5839ed6c4063
globus.shared_endpoint.path
/1/published/publication_1155/
datacite.publicationYear
2024
datacite.contributor.DataCollector
Nicolas de Montigny
datacite.contributor.Supervisor
Steven W. Kembel
datacite.contributor.Supervisor
Abdoulaye Baniré Diallo
datacite.date.Collected
/2021-04-27
datacite.resourceType
Dataset
en_US
datacite.relatedIdentifier.IsPartOf
https://github.com/bioinfoUQAM/Caribou/
datacite.relatedIdentifier.IsDerivedFrom
https://gtdb.ecogenomic.org/stats/r202#gtdb-species-representatives
datacite.relatedIdentifier.IsDerivedFrom
https://data.gtdb.ecogenomic.org/releases/release202/202.0/
datacite.fundingReference.funderIdentifier
en_US
datacite.fundingReference.funderName
Natural Sciences and Engineering Research Council of Canada (NSERC)
en_US
datacite.fundingReference.awardNumber
en_US
datacite.fundingReference.awardTitle
en_US
frdr.crdc.code
RDF1020104
en_US
frdr.crdc.group_en
Computer and information sciences
en_US
frdr.crdc.class_en
Artificial intelligence (AI)
en_US
frdr.crdc.field_en
Machine learning
en_US
frdr.crdc.group_fr
Informatique et systèmes d'information
fr_CA
frdr.crdc.class_fr
Intelligence artificielle (IA)
fr_CA
frdr.crdc.field_fr
Apprentissage machine
fr_CA
datacite.description.other
This data was used in Nicolas de Montigny's Master's degree thesis (https://archipel.uqam.ca/18182/)
en_US