Skip Navigation

How to Download

FRDR offers multiple ways to download datasets. Learn more in our documentation.

Caribou pipeline for the alignment-free bacterial identification and classification in metagenomics sequencing data using machine learning

Description: This dataset contains sequencing data used to train the models of the Caribou pipeline. We developed this pipeline for alignment-free bacterial identification and classification in metagenomics sequencing data using machine learning. The datasets were derived from the GTDB v.202 database ( and include training steps using the species representatives, as the benchmark datasets used non-representative whole genomes. We also simulated sequencing reads to evaluate and compare performance on whole genomes and sequencing reads. We provide models and encoding files of CNN-trained models; datasets used for training, validation and testing of models, randomly sampled from representative genomes; and datasets used for benchmarking the method against state-of-the-art methods, randomly sampled from non-representative whole genomes and simulated reads.
Notes: This data was used in Nicolas de Montigny's Master's degree thesis (
Authors: de Montigny, Nicolas; University of Quebec at Montreal; ORCID iD 0000-0002-3708-4055
Steven W., Kembel; University of Quebec at Montreal; ORCID iD 0000-0001-5224-0952
Abdoulaye Baniré, Diallo; University of Quebec at Montreal
Keywords: metagenomics
alignment-free DNA classification
classification models
machine learning
neural networks
bacterial genomes
taxonomic classification
Field of Research: 
Computer and information sciences
Artificial intelligence (AI)
Machine learning
Publication Date: 2024-12-12
Publisher: Federated Research Data Repository / dépôt fédéré de données de recherche
Funder: Natural Sciences and Engineering Research Council of Canada (NSERC)
Related Identifiers: 
This dataset is part/subset of

Files in Dataset 
No files uploaded
Download entire dataset using Globus Transfer. This method requires a Globus account and installing software. Watch Video: Get Started with FRDR: Download a Dataset
Download with Globus
Files for this dataset are currently being backed up so it cannot be approved at this time. Please try later.

Access to this dataset is subject to the following terms:
Creative Commons Attribution 4.0 International (CC BY 4.0)
de Montigny, N., Steven W., K., Abdoulaye Baniré, D. (2024). Caribou pipeline for the alignment-free bacterial identification and classification in metagenomics sequencing data using machine learning. Federated Research Data Repository.