LEMMIv2
Documentation and user guide
Last update: feb23, lemmi v2.1
TIP
See a short tutorial on the LEMMI standalone pipeline here (opens new window)
Table of content:
About LEMMI
- What does LEMMI do?
- What do I find on lemmi.ezlab.org?
- What is the standalone pipeline for?
- Why use only containers?
- How is the GTDB taxonomy integrated?
- How does the calibration of a tool happen?
- What are unknown taxa?
- How does LEMMI prevent over-fitting on public data
How to use LEMMI standalone
- Step 1: Setup of the standalone LEMMI pipeline
- Step 2: Defining benchmarking instances and runs
- Step 3: Exploring the results
- Make your tool compatible with LEMMI: a tutorial with Kraken2
A brief methods section for LEMMIv2
# About LEMMI
# What does LEMMI do?
LEMMI, A Live Evaluation of Computational Methods for Metagenome Investigation, is an online resource and a pipeline dedicated to continuous benchmarking of newly published metagenomics taxonomic classifiers. LEMMI generates scenarios (instances) that represent a given biological problem and will evaluate how candidate tools can solve them. Solving means here providing the abundance profile of all taxa that can be identified in the sample.
It uses a container-based approach and exploits the ability of many methods to construct a reference database to compare them under identical conditions. By using publicly available data to generate benchmarking samples, it is reactive and flexible.
LEMMI provides a standalone pipeline to conduct personal benchmarking and/or to help preparing pipelines for the LEMMI online benchmarking. While the LEMMI team survey and benchmark tools, we highly encourage developers to take the lead and request an evaluation of their tool after making it compatible. We also encourage developers to contact us with suggestions; we do independent benchmarking but aim at optimizing the tools we evaluate so our platform reflects the best users can get.
The goal of LEMMI is to maintain an up-to-date catalog of method implementations that represent what is truly accessible to users to date. Appearing in LEMMI is an indication that a method implementation can be set up on standard/independent computational environments. We aim at continuously re-evaluating established methods; new software releases, even when no change in the core method is implemented, can affect the outcome of the analyses.
# What do I find on lemmi.ezlab.org?
Several benchmarking scenarios maintained by the LEMMI team. They allow us to evaluate tools of interests in an independent computational environment. The available benchmarking instances are listed on the landing page of the platform. Each of them has a target clade (PROK, EUK, VIR) and a taxonomy to use (NCBI or GTDB) when reporting the results . An instance is a set of in-silico samples; they can also contain a host and contaminants that have to be filtered out. Some samples are used for calibration of the tool (i.e. finding the best filtering threshold of reads that maximize the performances), others are negative samples to identify systematic errors, and the rest is used for the actual evaluation.
# As a tool user
When exploring the LEMMI results, you will find how different tools perform on each specific scenario, with metrics such as precision and recall, but also runtime and memory consumption. From the result page, you will be able to follow links to the tool website, git page, or obtain the container that was used to run the benchmarking process.
# As a tool developers
Publishing a method paper is necessary but not sufficient to guarantee your method implementation will remain a functional software over time, that behaves well in most environments. By benchmarking it on the LEMMI platform, your can demonstrate what the last version of your software is capable of, on standardized scenarios, and side by side with other available methods, no matter how popular and established or novel and still confidential.
# What is the standalone pipeline for?
The composition of the samples offered on lemmi.ezlab.org may not be the best to represent the type of experiment you are conducting. With the LEMMI standalone pipeline, you can design your own benchmark and directly assess how existing tools will perform on them. It also ensures the reproducibility of the evaluations.
Furthermore, as a tool developer, you will need to run LEMMI on your own in order to prepare it to be evaluated on our platform. It will also be informative to run your tool against alternative tools in an automated procedure in which you will not install and configure these tools (there is an obvious risk of optimizing your own tool to a point it introduce a clear bias). If all developers prepare their own container for LEMMI, each tool would have received the same level of attention no matter who is running them.
# Why use only containers?
Wrapping everything in a software container decreases the risk of incompatibilities and conflicts. It ensures that two runs at two different times and on two different environments will be based on the same version of all dependencies. It enables a very simple loading, running, and unloading procedure while isolating the run so memory and runtime can be easily tracked. Conda could be considered in the future, but as there are ways to convert conda distribution into containers quite easily, we stick to that restricted approach for now (with Docker and Singularity for more flexibility).
# How is the GTDB taxonomy integrated?
As many tools use the NCBI dmp files, we convert the GTDB taxonomy into NCBI-like files and it becomes invisible to any tool that accepts these files without validating them against NCBI content. See the methods section below.
# How does the calibration of a tool happen?
When LEMMI generates an instance, it will generates calibration samples and evaluation samples. The evaluated tools with process them indistinctly. When assessing the predictions, LEMMI will define on the calibration samples which threshold maximizes the F1-score for the tool (this can be seen as training the tool on known samples) to then use that threshold to compute the metric on the evaluation samples.
# What are unknown taxa ?
In the context of LEMMI, unknown taxa are organisms for which no reference genome of the same species is provided to the tool for building a reference. They make the sample more challenging and realistic, to better recognize methods that do not over-classify. When a LEMMI instance includes such taxa, no tool with an embedded reference (e.g. marker genes) should be used.
# How does LEMMI prevent over fitting on public data?
LEMMI always excludes the genomes that are used to generate reads from the set of genomes that are provided as reference. LEMMI can also use a date to split between training/reference and testing so the most recent ones are placed in the reads. This is useful when including methods with embedded references to ensure the sample is made of genomes published after the references were built.
# How to use LEMMI standalone
# Step 1 | Setup of the standalone LEMMI pipeline
TIP
The demo setting included will generate a small repository and a few samples to run with Kraken2. On a laptop, it takes about 1 hour to complete the demo benchmarking run, depending on the number of cpus and your bandwidth for download.
TIP
See all tools we have prepared so far on https://quay.io/user/ezlab (opens new window). They can be used directly for benchmarking. If you develop one of these tools and think the wrapper we created could be improved to better reflect the true performances of the method, please contact us.
In this first step, you will put together all necessary software components and fill the main configuration file that defines which data will be available to create LEMMI benchmarking instances. At the end of this step you will be able to run the pipeline for the first time and let it populate its repository (a rather long job running once).
To get help, please open an issue on https://gitlab.com/ezlab/lemmi-v2/-/issues (opens new window). Please keep it public unless you have good reasons not to, so it will benefit to others.
# i) LEMMI code: clone the sources and set ENV variables
See https://gitlab.com/ezlab/lemmi-v2 (opens new window)
git clone https://gitlab.com/ezlab/lemmi-v2.git
export LEMMI_ROOT=/your/path/lemmi-v2
export PATH=$LEMMI_ROOT/workflow/scripts:$PATH
# ii) Install dependencies in a mamba environment
conda install -n base -c conda-forge mamba
mamba env update -n lemmi --file $LEMMI_ROOT/workflow/envs/lemmi.yaml
conda activate lemmi
# to deactivate or remove if necessary
conda deactivate
conda remove --name lemmi --all
TIP
It is assumed you can use conda and have it installed.
If you do not wish to use this method, the dependencies to install are:
- snakemake >=6.0.0
- biopython
- pandas >= 1.2.0
- numpy
- ete3
- pyyaml
# iii) Container engines
LEMMI runs candidate tools as containers, as well as a few internal tasks in what is called the "LEMMI master" container. Install Docker or Singularity.
DANGER
With LEMMI, you can choose to automatically run any candidate tools publicly available. You may not have a clear view on how much resources they will require before running them. Please pay attention to the way your system deals with containers overloading the memory. If you choose to use Docker, an option will be introduced below.
No specific version is recommended for Docker, LEMMI was developed and tested with Docker version 20.10.5, build 55c4c88
.
For singularity, version 3+ will be necessary. LEMMI was tested on our HPC environment with the following versions: module load GCCcore/8.2.0 Python/3.7.2 Singularity/3.4.0-Go-1.12
# iv) Main config file
cd $LEMMI_ROOT/config && cp config.yaml.default config.yaml
and then edit with your favorite editor.
Here you will define the source of the genomes and taxonomy available in LEMMI. The type of file has to match those provided in the example file below. They are either NCBI data or GTDB data.
ncbi_dmp: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
gtdb_metadata:
- https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/bac120_metadata_r202.tar.gz
- https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/ar122_metadata_r202.tar.gz
prok_clade:
- g__Acetobacter
- c__Methanobacteria
genbank_assembly: https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
euk_clade:
- Amycolatopsis methanolica
vir_clade:
- Herpesvirales
host_genome: https://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr22.fa.gz
host_taxid: 9606
max_genomes_per_organism: 5
lemmi_master: quay.io/ezlab/lemmi_master:v1
group_download_jobs: 8
lemmi_seed: 1
docker_memory: 10g
The key words euk_clade
, prok_clade
, and vir_clade
allow you to filter the content and retain only organisms within these clade. It might be useful when running tests to avoid creating a large repository, which will take time and disk space, or if you investigate very specific organisms and contaminants. Everthing from the reads to the reference database will be restricted to these organisms.
TIP
This is not the place to choose which organisms will go in your samples, that is for later. If you want to use a comprehensive list of genomes as a reference, do not restrict the taxonomic content of the LEMMI repository and use:
prok_clade:
- d__Archaea
- d__Bacteria
vir_clade:
- Viruses
euk_clade:
- Eukaryota
In this file, prok_clade
follows the naming convention of GTDB and the rest follows the NCBI naming conventions. However, it will be still possible to use the NCBI taxonomy with prokaryotes later in the process when creating benchmarking instances and runs.
Some organisms are represented by dozens of genomes, which may be unnecessary in the context of the LEMMI benchmarking, only contributing to using more memory and space. The option max_genomes_per_organism
allows you to control that aspect.
group_download_jobs
is only useful if using a workload manager such as slurm, to submit multiple download tasks as a single job to decrease the time spent waiting in the queue for small jobs.
There is a seed value that is used to fix the outcome of pseudo-random computations in python (lemmi_seed
).
If you use Docker, you can limit the amount of memory that is given to any loaded container before it is killed by the container engine. If you use Singularity, please check their documentation or discuss with your sysadmin, it may involve the use of cgroups (opens new window)
TIP
At the moment, this main config file will affect the whole environment, so if you want to set multiple data sources and taxonomic versions, you will have to set up LEMMI in multiple isolated environments (including different values for the $LEMMI_ROOT variable) using different locations on the file system.
# v) the LEMMI master container
Some of the tasks are executed within a container that is not a candidate tool. Its location (docker repository, for example quay.io/ezlab/lemmi_master:v1
) is declared in the config file above. You can optionally build it with a local name such as lemmi_master
as follows:
cd ${LEMMI_ROOT} && docker build ${LEMMI_ROOT} -t lemmi_master --file ${LEMMI_ROOT}/workflow/lemmi_master.Dockerfile
and use its local name lemmi_master
in the main config file. It is unlikely that you will need to use that.
# vi) If using Singularity, export all containers as .sif
files
For now, singularity requires all containers (lemmi_master and candidate tools) to be exported as .sif
files and placed in the ${LEMMI_ROOT}/benchmark/sif/
folder.
The name of the file is what should be used wherever the name of a container needs to be provided. E.g. in the main config file, lemmi_master: lemmi_master
would point to ${LEMMI_ROOT}/benchmark/sif/lemmi_master.sif
.
To convert a Docker image to singularity images, you can use the docker2singularity image as follows:
docker run -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/output --privileged -t --rm quay.io/singularity/docker2singularity lemmi_master
You will obtain a .sif
file that you can rename according to your need and place in the ${LEMMI_ROOT}/benchmark/sif/
folder.
TIP
Docker2singularity is not a projet maintained by the LEMMI team. See the info section of their repository if necessary.
# vii) Commands to run the the pipeline
WARNING
Once you have reached this point, the main setup of LEMMI is done. You can now run the pipeline for the first time to let LEMMI populate its repository in $LEMMI_ROOT/benchmark/repository/
. This should take several hours depending on the size of the reference, the bandwith and the number of parallel tasks that is permitted by your environment.
You can also directly move to the next step of this guide and run it later.
LEMMI is run by calling
lemmi_full --cores 8 # running locally on Docker
lemmi_full --cores 8 --use-singularity # running locally on Singularity
lemmi_full --use-singularity --profile cluster --jobs 8 # running on Singularity on a cluster, see below
# to catch all the logs, we suggest to proceed this way:
# this ensures that all outputs are captured, some may be lacking in ./snakemake/log/*.log
lemmi_full --cores 8 > out 2>&1 &
tail -f out
# or separate stdout stderr
lemmi_full --cores 8 > out 2> err &
The lemmi_full
command will run all subpipelines of LEMMI. There are five:
- Set up the repository according to the main config file (Step 1 above)
- Define instances composition (below)
- Create the samples (below)
- Run the tools (below)
- Evaluate the results (below)
TIP
Under the hood, the command will trigger multiple Snakemake pipelines. You can pass to the lemmi_full
command all Snakemake standard parameters, such as --dry-run
or --unlock
# viii) Job scheduling system
You can run LEMMI on a job scheduling system following the Snakemake documentation. We tested it on Slurm. We designed the pipeline to be compatible with profiles, see https://github.com/snakemake-profiles (opens new window). We tested it with https://github.com/Snakemake-Profiles/slurm (opens new window).
You can configure the number of cpus, memory, queue, runtime, as follows, by editing the config.yaml file defined in the the profile you are passing to Snakemake. Each entry corresponds to a Snakemake rule that appears throughout the LEMMI workflow.
__default__:
queue: shared-cpu
time: 0.25
create_vir_list:
queue: shared-cpu
time: 5
download_group:
queue: shared-cpu
time: 0.1
up_ete3:
queue: shared-cpu
mem: 20
time: 1
create_host_reads:
time: 1
mem: 20
# Step 2 | Defining benchmarking instances and runs
In this second step, you will create multiple .yaml
files to define what needs to be generated and benchmarked.
TIP
Demo files that will trigger small runs are present by default. However, if you already edited the default clade in the main config file to define a comprehensive genome repository, it will increase the resources and time necessary for constructing the reference.
TIP
The focus of this second step is to run candidate tools that are already containerized and include the LEMMI benchmarking script. To learn how to create such a container with your own tool, see the last section of this user guide.
# i) Defining a LEMMI instance
A LEMMI instance is a serie of samples having a similar (but not identical) composition. You will define the type of organisms, target clades, contaminant clades, proportion of each, etc.
It can be seen as one scenario you can benchmark your tools with, trying to reproduce the type of samples you are dealing with.
Create a .yaml
file in the yaml/instances
folder, for example $LEMMI_ROOT/benchmark/yaml/instances/2021_9_PROK_GTDB.yaml
The instance will have the name of the file, 2021_9_PROK_GTDB. The name is not constrained (although the use of -
is not allowed) but we use as convention (YEAR_MONTH)_(TARGET)_(TAXONOMY OF TARGET).
All fields are mandatory:
instance_seed: 1 # to redefine the pseudo random results in python
negative: # define the number of negative samples by enumerating their suffix. Follow that 3-digits notation
- '001'
- '002'
calibration: # same for calibration samples
- '001'
- '002'
evaluation: # same for evaluation samples
- '001'
- '002'
host_ratio: 0.7 # define the ratio of each type of organisms, in term of reads.
prok_ratio: 0.2
vir_ratio: 0
euk_ratio: 0.1
targets:
- prok # define what the targets are. Targets are what tools need to report
contaminants: # define what contaminants are. Contaminants are what tools need to discard / ignore in the report
- euk
targets_taxonomy: gtdb # define the taxonomy to use for each type of organims. Be careful to use meaningful combinations. gtdb if only prokaryotes, ncbi in any other case.
contaminants_taxonomy: ncbi
sd_lognormal_vir: 1 # to sample the coverage of each organims, a lognormal distribution is used. This allows you to define its standard dev.
sd_lognormal_euk: 2.8
sd_lognormal_prok: 2
prok_nb: 300 # here you define the number of organims of each type. If you do not want one type, e.g. viruses, please pay attention to set it to 0 everywhere, vir_ratio, vir_nb, and to not mention it as target or contaminants.
euk_nb: 140
vir_nb: 0
unknown_organisms_ratio: 0.2 # to include organims for which no reference genome is provided. Do not use this if you included tools that will not build the ref but use external (marker genes) references
core_simul: 0 # if 1 simulates evolution with CoreSimul before sampling the reads. For now, experimental and unsure whether it can simulate something relevant to represent real-life, we do not recommend using it, but it exists.
prok_taxa: # here you can specify organisms to include in the samples. The full list will be filtered before doing the random sampling. Be careful to keep enough genomes to match the number above, else you will get a lower number than specified. Use the NCBI naming, even for prokarytes.
- Actinomyces israelii
- Actinomyces sp. oral taxon 897
- Actinomyces oris
- ...
- Candidatus Saccharibacteria oral taxon TM7x
- Fretibacterium sp. OH1220_COT-178
euk_taxa:
- Ascomycota
- Mucoromycota
vir_taxa:
tech: 'HS25' # this is the sequencing technology used by the art simulator
read_length: 150 # other params sent to art
total_reads_nb: 20000000
fragment_mean_size: 300
fragment_sd: 25
max_base_quality: 35 # last param sent to art
evaluation_taxlevel: # here define which of the rank among the 7 main are used to evaluate the predictions made by the tools (domain-phylum-class-order-family-genus-species)
- genus
- species
calibration_fdr: 0.5 # The fdr that will be used to define the tool specific filtering threshold (when analyzing calibration samples) to then apply it to the evaluation samples. Write best_f1 instead of a number (0-1) if you do not want to define a specific FDR but maximize the f1 score.
calibration_function: median # mean/median/max/min. The function used to set the threshold when using multiple calibration samples
no_evaluation:
- my_tool # this can be added after a tool was run to remove it from the final evaluation
When you run the lemmi_full
script, an instance declaration will trigger a sampling process that will create one .yaml
file for each requested sample in order to describe its composition. For example $LEMMI_ROOT/benchmark/yaml/instances/samples/2021_9_PROK_GTDB-e001.yaml
.
Each of these yaml file will then trigger a subpipeline that will generate all the in-silico reads. They will be placed in $LEMMI_ROOT/benchmark/instances/instance_name/sample_name/
WARNING
Once you have reached this point, the declaration of a benchmarking instance is completed. You can now run the pipeline; it will populate the repository if necessary (from Step 1) and create any benchmarking instance that was declared above.
You can also directly move to the next step of this guide and run it later.
# ii) Defining runs
Once LEMMI has generated all the samples for all the instances you wish to create, it will execute the runs. A run defines how to execute one tool for one LEMMI instance. Create a .yaml
file as follows:
$LEMMI_ROOT/benchmark/yaml/runs/kraken2.2021_9_PROK_GTDB.yaml
container: kraken2
ref_size: all # ref size is best or all
instance_name: 2021_9_PROK_GTDB
tmp: kraken2_2021_9_PROK_GTDB
distribute_cores_to_n_tasks: 2 # if you give 8 cores, it will give 4 cores to 2 tasks in parallel
# Pay attention to the memory your tool is going to use, as two reference building tasks may start in parallel. In doubts, set this option to 1
The candidate tool container has to be a Docker repository (e.g. quay.io/ezlab/kraken2_lemmi
) or a name corresponding to a local container (kraken2
) or sif file (in $LEMMI_ROOT/benchmark/sif/
).
The ref_size
can be all or best. If your tool can process the reference efficiently with the memory available to run LEMMI, you can use all. If your tool cannot process all genomes, LEMMI will provide a subset of all reference genomes, one per organism, selected based on completeness criteria. If the tool can still not process the reference, then you should reconsidered whether it is a valuable solution for real analyses or set a different LEMMI repository with less genomic content to be able to benchmark it (see Step 1).
The tmp
parameter is optional. If not specified, LEMMI will create a new subfolder in $LEMMI_ROOT/benchmark/tmp/
named using the current timestamp (e.g. 1632498097
) every time it is triggered. This means that any temporary files necessary to the runs will not be available when restarting the pipeline. This ensures previous tmp files do not interfere with new runs, as tool may not use unique identifers for tmp files.
By specifying the tmp
option in a run, you will force LEMMI to reuse the same tmp folder to run the candidate tool. It can be either a timestamp created by a previous run (e.g. 1632498097
) or a string you specified before the first run (e.g. kraken2_2021_9_PROK_GTDB
). By doing this, you can benefits from Snakemake's ability to restart a process that has crashed without rerunning the rules that were completed. For example, your tool may crash when processing the target reference after succeeding in processing the contaminants and host references. We recommend to use this option to manually set a name for the tmp folder of the run.
TIP
Since the reference for the host, contaminants, and targets can be very large files, LEMMI does not consider them as final results that need to be kept. They are placed in the tmp folder. This allows the user to delete them when the tool is evaluated and disk space need to be recovered for future tasks.
LEMMI is designed in such a way that if the final results in $LEMMI_ROOT/benchmark/analysis_outputs/
are all present, Snakemake will not be retriggered even if some of the inputs and outputs of rules are pointing to an inexisting tmp folder.
You can always force reruns by deleting the final outputs in $LEMMI_ROOT/benchmark/analysis_outputs/
.
WARNING
Once you have reached this point, the declaration of a benchmarking run is completed. You can now run the pipeline; it will populate the repository if necessary (from Step 1), create any benchmarking instance that was declared above if necessary, and run any tool on any instance as defined in the last section. Pay attention that all components exists, in particular that container repositories are accessible or that sif
files have been deposited in the correct folder. Finally, LEMMI will run the last subpipeline that evaluate the results of any tool that has produced results for any existing instance. The final results will be json files that can be explored through the LEMMI web interface, see below.
TIP
A second command, lemmi_analysis
(with same parameters as lemmi_full
) will only execute the two last subpipelines, to run the tools and evaluate them. This will be faster to start (It may take seconds to minutes to determine there is nothing to be done in the first subpipelines) once your repository and instances are fully set and created, and what you want is just to run additional tools.
TIP
If you wish to disable an instance or a run that is incomplete, or is causing problems, you can simply change the extention of its .yaml
file. LEMMI will ignore files that do not end with .yaml
# disable instance
mv $LEMMI_ROOT/benchmark/yaml/instances/2021_9_PROK_GTDB.yaml $LEMMI_ROOT/benchmark/yaml/instances/2021_9_PROK_GTDB.yaml.disabled
# disable run
mv $LEMMI_ROOT/benchmark/yaml/runs/kraken2.2021_9_PROK_GTDB.yaml $LEMMI_ROOT/benchmark/yaml/runs/kraken2.2021_9_PROK_GTDB.yaml.disabled
# Step 3 | Exploring the results
Once the LEMMI pipeline has run to the end, all benchmarking results exist in the $LEMMI_ROOT/benchmark/final_results/
folder.
To explore them, call
lemmi_web_docker
# or
lemmi_web_singularity
This will start a web server running locally on your machine, on port 8080 if available.
Use a web browser to navigate the results on http://127.0.0.1:8080/ (opens new window)
TIP
If you are running LEMMI on a Singularity engine, you are likely to need a privileged (root, sudo) account to run a web server like this.
You can edit the file $LEMMI_ROOT/benchmark/final_results/structure.json
to declare and order instances, samples, and widgets (that display a metric or a feature) as follows. A list of these widgets will be documented soon.
This is what you can see on our public LEMMI.v2 instance on https://lemmi.ezlab.org/2021_9_VIR_NCBI (opens new window)
{
"2022_01_PROK_GTDB": {
"2022_01_PROK_GTDB": [
"description",
"filtering_threshold",
"selector",
"f1",
"precision",
"recall",
"tp",
"fp",
"memory",
"runtime"
],
"2022_01_PROK_GTDB-c001": [
"prc",
"auprc",
"predictions:full"
],
"2022_01_PROK_GTDB-e001": [
"f1",
"precision",
"recall",
"tp",
"fp",
"l2",
"predictions:full"
]
}
}
# Make your tool compatible with LEMMI: a tutorial with Kraken2
WARNING
We will be glad to help. Create an issue (opens new window) or contact us (opens new window) if necessary.
To be compatible with LEMMI, your tool needs to be wrapped in a Docker (opens new window) or Singularity (opens new window) container. This container needs to be able to complete specific tasks in bash scripts that the LEMMI pipeline will trigger, namely LEMMI_process_ref_host.sh
, LEMMI_process_ref_contaminants.sh
, LEMMI_process_ref_targets.sh
, LEMMI_analysis.sh
. If you already chose to distribute your tool as a container, it would be simple to add these LEMMI specific scripts so it can be benchmarked directly using your current distribution. You can also create a container specifically for benchmarking.
In this tutorial, we detail the creation of the Kraken2 container that we have used to benchmark this tool. The container is available on https://quay.io/repository/ezlab/kraken_212_lemmi?tab=info (opens new window). Note that we use Docker to create containers and suggest to convert it to singularity using docker2singularity (opens new window). Building directly a singularity container would work as well.
The source of the container can be found here: https://gitlab.com/ezlab/lemmi-v2/-/tree/main/resources/source_container_kraken2 (opens new window). We suggest to use it as a template to start your own scripts as it helps following steps and finding the paths where to read and write all files provided and expected by LEMMI.
# Dockerfile
- Choose a base image to install the tool, or the image that already wraps the tool.
Dockerfile#L1 (opens new window)
If not present, you need to install
bc
andps
which are required for the benchmark to run properly.Then install every components that your tool depends on
Dockerfile#L5 (opens new window)
- Finally install your tool. Use a specific tag or commit to ensure reproducibility. Here we install both Kraken2 and Bracken2
Dockerfile#L18 (opens new window)
- We are going to use post processing scripts (opens new window) to convert the reads abundance into a more meaningful abundance, the number of organisms (proxied in LEMMI by the genome copies, with the assembly size accepted as the genome size). These scripts are written in Python and use pandas. Up to you to decide whether you need them with your tool or if you can use a different method to report the abundance of organisms. In the former case, you need to install pandas.
Dockerfile#L53 (opens new window)
- Finally, we add all
.sh
and.py
scripts that will be described below. Note that they should be made executable withchmod +x filename.ext
Dockerfile#L56 (opens new window)
# LEMMI_process_ref_host.sh
When called, this script will execute the procedure needed by your tool to prepare a reference based on the fasta genome of a host organism, in order to filter it out later. The LEMMI pipeline will call it as follows:
LEMMI_process_ref_host.sh {input.host_genome} {output.ref_host}
The first parameter is the path to the input genome, and the second parameter is the folder where the processed reference has to be written. No specific format expected, anything can be placed in this folder that will be accessible to your tool later in the benchmarking process when running the LEMMI_analysis script. The host is processed once no matter how many samples are analysed. Therefore it is important to do this step here and not as part of the analysis script, as it would dramatically increase the runtime if done during the LEMMI_analysis step.
WARNING
For now the taxid of the host is not provided, therefore it has to be hard coded when the container is build.
In the case of kraken, the taxid has to be introducted in the fasta file. This is done by the script edit_host.py
before calling kraken2-build
. Taxonomic files are given by LEMMI during the process. See the complete source file (opens new window).
If your tool cannot use a host genome as a reference to filter out reads, you can ignore this step by including this in LEMMI_process_ref_host.sh
:
#!/usr/bin/env bash
set -o xtrace
set -e
mkdir -p $2
sleep 30
The same is true for the two following scripts.
# LEMMI_process_ref_contaminants.sh
When called, this script will execute the procedure needed by your tool to prepare a reference based on multiple fasta genomes representing contaminants (i.e. not of interests), in order to filter them out later. The LEMMI pipeline will call it as follows:
LEMMI_process_ref_contaminants.sh {input.contaminants_ref} {output.ref_contaminants} {params.contaminants_taxonomy}
The first parameter is a tsv file with the list of genomes and relevant metadata, as follows:
# LEMMI content selected from NCBI+GTDB data
ncbi_genbank_assembly_accession taxid __lineage gtdb_taxonomy __gtdb_taxid __gtdb_accession __size
GCA_000371885.1 1068978 root;cellular organisms;Bacteria;Terrabacteria group;Actinobacteria;Actinomycetia;Pseudonocardiales;Pseudonocardiaceae;Amycolatopsis;Amycolatopsis methanolica group;Amycolatopsis methanolica;Amycolatopsis methanolica 239 7196860
GCA_002814635.1 10295 root;Viruses;Duplodnaviria;Heunggongvirae;Peploviricota;Herviviricetes;Herpesvirales;Herpesviridae;Alphaherpesvirinae;Simplexvirus;Bovine alphaherpesvirus 2 9388
The second parameter is the output folder where the built reference will be placed. Same conditions as described above for the host.
The third parameter indicate whether the taxonomy used for contaminants is ncbi
or gtdb
so the tool can process them accordingly. However, LEMMI provides a NCBI-like taxonomy using dmp files for GTDB (see Methods below), so it is just about selecting the real NCBI taxonomy id or the generated GTDB id in that file.
See the source files (opens new window) to see how Kraken2 processes these references.
# LEMMI_process_ref_targets.sh
When called, this script will execute the procedure needed by your tool to prepare a reference based on multiple fasta genomes representing targets (i.e. of interests), in order to classify them later. The LEMMI pipeline will call it as follows:
LEMMI_process_ref_targets.sh {input.targets_ref} {output.ref_targets} {params.targets_taxonomy}
It is similar to the script for processing contaminants.
This is where we introduce Braken as it will be used on target organisms only, not when filtering out host and contaminants.
See the source files (opens new window) to see how Kraken2/Bracken processes these references.
# LEMMI_analysis.sh
When called, this script will execute the procedure needed by your tool to analyse a paired-end short reads sample, reusing the reference previously built to 1) filter out host reads, 2) filter out contaminant reads, and 3) classify the target organims. The expected output is as follows:
#LEMMI_ncbi
organism taxid reads abundance
Bacteria 2 153333 0.03059380281871103
Archaea 2157 147 5.3815508436829394e-05
Bacteroidetes 976 4071 0.000871182017875575
Chlorobi 1090 2 5.664923606140522e-07
Cyanobacteria 1117 279 5.134202471553331e-05
Proteobacteria 1224 24734 0.0042844809433214845
Firmicutes 1239 10731 0.0025405580944023102
Crenarchaeota 28889 2 7.685194508220337e-07
...
Abundance needs to be provided as relative genome/markers copies, not reads.
See the source files (opens new window) to see how we made Kraken2 filtering host, contaminants, and normalizing its output into the expected values.
# Build your image
docker build . -t kraken2
It is now ready to be debugged/used with LEMMI standalone.
# Material and methods
The benchmark is based on publicly available assemblies on NCBI. The results currently available online were based on a repository last updated on January 7th, 2022.
ART-MountRainier-2016-06-05 (opens new window) is used to generate in-silico reads through the use of https://github.com/wanyuac/readSimulator.git (opens new window)
To convert the GTDB taxonomy into NCBI-like files, we use https://github.com/nick-youngblut/gtdb_to_taxdump (opens new window)
The ete3 (opens new window) Python package is used to manipulate the NCBI taxonomy. The dmp files defined in the config are used to populate the database.
LEMMI includes https://github.com/lbobay/CoreSimul.git (opens new window) to simulate evolution on genomes before generating reads. This was not used in the instances currently available.
All containers were run on Singularity version 3.8.1-1.el8 on a dedicated server with 48CPUS, 500GB of RAM, 250TB of disk.