LEMMI demo tutorial

---

In this short tutorial, you will learn how to set up the standalone version of LEMMI v2. You will generate a simple configuration to create a repository, and a scenario for a fast benchmarking run.

You will first run Kraken (version 2, combined with Bracken) and then update the benchmark with Centrifuge using existing compatible containers. You will then explore the results with the webapp locally on your machine.

To successfully run this tutorial, you will need:

The Docker or Singularity engine installed and running
A working version of Conda

The estimated run time is about half an hour if you can dedicate 4 cpus to the tasks and have decent network.

# LEMMI code: clone the sources and set ENV variables

git clone https://gitlab.com/ezlab/lemmi-v2.git
cd lemmi-v2
git checkout v2.1
export LEMMI_ROOT=/your/path/lemmi-v2
export PATH=${LEMMI_ROOT}/workflow/scripts:$PATH

# Install dependencies in a mamba environment

conda install -n base -c conda-forge mamba
mamba env update -n lemmi --file ${LEMMI_ROOT}/workflow/envs/lemmi.yaml
conda activate lemmi

# to deactivate or remove if necessary
conda deactivate
conda remove --name lemmi --all

# Container engines

No specific version is recommended for Docker, LEMMI was developed and tested with Docker version 20.10.5, build 55c4c88.

For singularity, version 3+ will be necessary. LEMMI was tested on our HPC environment with the following versions: module load GCCcore/8.2.0 Python/3.7.2 Singularity/3.4.0-Go-1.12

LEMMI is based on snakemake and is run by calling

lemmi_{task} --cores 8 # running locally on Docker
lemmi_{task} --cores 8 --use-singularity # running locally on Singularity
lemmi_{task}--use-singularity --profile cluster --jobs 8  # running on Singularity on a cluster using profiles.

TIP

You can pass to the lemmi_{task} command all Snakemake standard parameters, such as --dry-run or --unlock

# Main config file

Create it by doing:

cd ${LEMMI_ROOT}/config && cp config.yaml.default config.yaml

The config file set up all parameters that are global and applied to the whole LEMMI pipeline.

In brief, this file defines what will be used to create the genome repository with its taxonomic information. It defines the diversity that exists in the LEMMI benchmarking world, from which all datasets are created and which defines all existing genomes available as reference.

ncbi_dmp: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2023-01-01.zip
gtdb_metadata:
    - https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/bac120_metadata_r202.tar.gz
    - https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/ar122_metadata_r202.tar.gz
prok_clade: # this limits the available prokaryotes, if let emtpy, everything in gtdb will be part of the repo
    - s__Acetobacter aceti
    - s__Acetobacter peroxydans
    - s__Acetobacter oryzoeni
prokncbi_clade: # this limits the available prokaryotes if using the prokncbi_clade mode. Not used in this demo
    - Acetobacter aceti
    - Acetobacter peroxydans
    - Acetobacter oryzoeni
genbank_assembly: https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt|20230101
# !!! you will see that in the actual demo file, a smaller version of assembly_summary_genbank.txt is present,
# to save time during the demo. If you use LEMMI for real, don't forget to restore that line.
# the date at the end of the line is there to guarantee reproducibilty by filtering out more recent entries as the file itself is not versioned
euk_clade: # this limits the available eukaryotes
    - Suillus placidus
vir_clade: # this limits the available viruses
    - Human alphaherpesvirus 1
host_genome: https://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chrM.fa.gz # the mitochondrial genome to have a very fast run
host_taxid: 9606
max_genomes_per_organism: 2 # limit the number of representative to the minimum... one to go in the reads, on in the ref.
lemmi_master: quay.io/ezlab/lemmi_master:v2.1_cv1 # 
group_download_jobs: 8
lemmi_seed: 1
docker_memory: 10g
no_sum_tools: []
strict_container_version: 1

# If using Singularity, export all containers as `.sif` files

Singularity requires all containers (lemmi_master and candidate tools) to be exported as .sif files and placed in the ${LEMMI_ROOT}/benchmark/sif/ folder.

The name of the file is the name of the equivalent Docker container without the repository. E.g. in the main config file, lemmi_master:quay.io/ezlab/lemmi_master:v2.1_cv1 would point to ${LEMMI_ROOT}/benchmark/sif/lemmi_master:v2.1_cv1.sif. You can keep the full Docker path in the config as long as the .sif file exists.

To convert a Docker image to singularity images, you can use the docker2singularity image as follows:

docker run -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/output --privileged -t --rm quay.io/singularity/docker2singularity quay.io/ezlab/lemmi_master:v2.1_cv1

You will obtain a .sif file that you can rename according to your need and place in the ${LEMMI_ROOT}/benchmark/sif/ folder.

You can do the same for quay.io/ezlab/kraken_212_lemmi:v2.1_cv1 and quay.io/ezlab/centrifuge_104_lemmi:v2.1_cv1

If you don't have Docker to export, you can also download the sif files here: lemmi_master (opens new window), kraken_212 (opens new window), centrifuge_104 (opens new window)

# Defining a LEMMI instance

A LEMMI instance will generate different datasets from the defined mock microbial community. The one available for the demo is

instance_seed: 1
negative:
    - '001' # this means here is one negative sample
calibration:
    - '001' # one sample for calibration of the filtering threshold
evaluation: 
    - '001' # one sample for the actual benchmarking
host_ratio: 0.01
prok_ratio: 0.79
vir_ratio: 0.1
euk_ratio: 0.1
targets:
    - prok
contaminants:
    - euk
    - vir
use_prokncbi: 0
targets_taxonomy: gtdb # we will run and evaluate according to the GTDB taxonomy
contaminants_taxonomy: ncbi
sd_lognormal_vir: 1
sd_lognormal_euk: 1
sd_lognormal_prok: 3
prok_nb: 2
euk_nb: 1
vir_nb: 1
unknown_organisms_ratio: 0.3
prok_taxa: # 2/3 of the existing organisms are target organisms. On is left as potential false positive
    - Acetobacter peroxydans
    - Acetobacter oryzoeni
euk_taxa: # It will be randomly selected among what is available in the repo
vir_taxa: # It will be randomly selected among what is available in the repo
core_simul: 0
tech: 'HS25'
read_length: 150
total_reads_nb: 500000 # a low number of reads to make it fast for a demo
fragment_mean_size: 300
fragment_sd: 25
max_base_quality: 35
evaluation_taxlevel:
    - family
    - genus
    - species
calibration_fdr: best_f1
calibration_function: max
no_evaluation: []
rank_to_sort_by: species

# Defining runs

Once LEMMI has generated all the samples for all the instances you wish to create, it will execute the runs. A run defines how to execute one tool for one LEMMI instance. See the existing one.

${LEMMI_ROOT}/benchmark/yaml/runs/

container: quay.io/ezlab/kraken_212_lemmi:v2.1_cv1
ref_size: all # ref size is best or all. 
# best is limiting reference to one genome per species, for tools requiring a lot of mem, while all takes all representatives.
# With the limited number of genomes in this demo, it does not matter. 
# In real setting, Kraken2 would process all with 500GB of ram, Centrifuge would not and require best.
instance_name: demo_prok_gtdb
tmp: kraken2_tmp
distribute_cores_to_n_tasks: 1 # if you give 8 cores, it will give 4 cores to 2 tasks in parallel. Don't bother changing this here
pe: 1
params_analysis:

# Let's run it

lemmi_full --cores 4

This will execute all the tasks... set up the repository (download and process genomes and taxonomy), create the demo instance, run Kraken on it, and produce the evaluation.

# Explore the results: files

You can see all files generated for the instance, i.e. fastq and descriptions, in ${LEMMI_ROOT}/benchmark/instances/demo_prok_gtdb/

You can see the predictions made by Kraken in ${LEMMI_ROOT}/benchmark/analysis_outputs/

# Explore the results: web

Once the LEMMI pipeline has run to the end, all benchmarking results exist in the ${LEMMI_ROOT}/benchmark/final_results/ folder.

To explore them in the webapp, call

lemmi_web_docker
# or
lemmi_web_singularity

This will start a web server running locally on your machine, on port 8080 if available.

Use a web browser to navigate the results on http://127.0.0.1:8080/demo_prok_gtdb (opens new window)

TIP

If you are running LEMMI on a Singularity engine, you are likely to need a privileged (root, sudo) account to run a web server like this.

You will see that Kraken has a F1-score of 1, it has identified all target species without producing a FP.

# Let's run a second tool, Centrifuge.

cd ${LEMMI_ROOT}/benchmark/yaml/runs/
cp kraken_212.demo.yaml centrifuge.demo.yaml

Edit the new file and replace kraken_212 by centrifuge_104. Pick a different name for the tmp folder as well. A container for centrifuge_104 is available on our quay.io repository. It will be pulled automatically by your Docker engine, or you placed it as a sif file in the appropriate folder as described above.

lemmi_analysis --cores 4 # this time, call only the analysis task. Repository and instance have not changed.

# Let's go back to the results

lemmi_web_singularity

http://127.0.0.1:8080/ (opens new window)

It seems that Centrifuge did as well as Kraken.

Let's complexify the interface by adding new widgets to see new metrics.

Replace the file ${LEMMI_ROOT}/benchmark/final_results/structure.json

with the following one:

{
  "demo_prok_gtdb": {
    "demo_prok_gtdb": [
      "f1",
      "precision",
      "recall",
      "tp",
      "fp",
      "l2_rank",
      "filtering_threshold",
      "memory",
      "runtime"
    ],
    "demo_prok_gtdb-e001": [
      "l2",
      "predictions:full"
    ]
  }
}

And create ${LEMMI_ROOT}/benchmark/final_results/demo_prok_gtdb.tools.json

[
    "kraken_212",
    "centrifuge_104"
]

You can now see extra metrics. The first section is "Average across all evaluation samples". You would have noted that this little demo has only one evaluation sample in the instance declaration. So this gives you the metrics for that unique sample.

The second section is the detail of the sample, you can see there the L2 distance to true abundance, and explore the details of the predictions made by the tools. You can sort and filter. Overall, the scenario is simple, both Centrifuge and Kraken predicted the two correct species, and a few reads assigned to Acetobacter aceti that were ignored thanks to the filtering threshold defined on the calibration sample. In the end, Centrifuge save memory but is slower compared to Kraken.

# Conclusion

With this short demo, you have set up and run the LEMMI standalone app. As a tool user, you can use it to generate instances representing samples that you are interested in and run all available containers (https://quay.io/user/ezlab (opens new window)) on them.

As a tool developer, you can write the "LEMMI layer" for your method, that is all the scripts necessary to complete the LEMMI tasks and wrap them in a container. This will allow us to evaluate the tool on the public LEMMI on https://lemmi.ezlab.org (opens new window) and others to use it with the standalone LEMMI app.

More info on https://www.ezlab.org/lemmi-v2-documentation.html (opens new window) and we are glad to help https://lemmi.ezlab.org/submission (opens new window)