OrthoDB
user guide
# Terminology
TIP
# Orthologs
Orthologs are genes in different species that evolved from a common ancestral gene by speciation. If one or both of these genes were duplicated after the speciation they are all termed co-orthologs, or just orthologs.
TIP
# Orthologous group / level-of-orthology
If there are more than two species considered, there are more than one speciation event, and we refer as orthologs, or orthologous group, to all descendants of a particular single gene of the last common ancestor of these species. Thus our operational definition refers to a specific phylogeny radiation for a set of species, termed the level-of-orthology.
TIP
# Ortholog functions
It is a reasonable hypothesis that orthologs keep functions of their ancestor gene ("by tradition"), though there are examples of gene function gains and losses. The statement of gene orthology though refers to their evolutionary relation, not to the kept or altered functions.
TIP
# Paralogs
Paralogs are genes that evolved by duplication inside a genome. Notions of orthologs and paralogs are disjoint, e.g. paralogs can be co-orthologs if duplicated after the speciation or can be not if duplicated earlier.
# Standalone OrthoLoger software
OrthoDB standalone pipeline for delineation of orthologs, OrthoLoger, is freely available
here
(opens new window)
# Text Search
OrthoDB can be queried using a gene name, identifier, annotation keywords, etc. We indexed many relevant identifiers of proteins and genes, including UniProtKB, Ensembl, InterPro, KEGG, GenBank, RefSeq, etc.
TIP
To get a gene-centric view providing the available annotations and a list of pair-wise orthologs - switch from Text
to get Gene
on the left of the search input.
TIP
To query specifically for a numeric NCBI gene - switch from Text
to NCBI ID
on the left of the search input.
TIP
To query for EC numbers - use double quotes, e.g. "3.1.1.-".
# Text query format
- Use double quotation marks to match a phrase, e.g.
"Cytochrome P450"
- Take advantage of the autocomplete lookup feature
- Logical operator NOT use '-' or '!', e.g.
kinase !tyrosine
- Logical operator OR use '|', e.g.
protease | peptidase
- Logical operator AND is implicit, e.g.
sodium transporter
actually meanssodium AND transporter
(if not quoted)
# Sequence Search
OrthoDB can be queried by homology to a protein sequence: switch from Text
to Sequence
on the left of the search input and paste the query protein sequence without a header line.
# Advanced options
# Phyloprofile
The result list of Orthologous Groups can be filtered for
- universality, i.e. having member genes in all species of the selected taxonomic node, or a fraction of them, e.g.
present in all
,in >90%
or>80%
of the species. - gene copy-number (duplicability), requiring them to have only single-copy orthologs in all species of the selected taxonomic node, or a fraction of them, e.g.
single-copy in all
,in >90%
or>80%
of the species.
You can combine any presence filter with any copy-number filter to refine your results, e.g. present in >90% AND single-copy in >80% of species.
# Select species
You can tailor your search by using the expandable species tree to select a radiation point or particular sets of species.
- Expand or collapse any node on the tree by clicking on the filled arrows or node names.
- Select all species at a node by clicking on the unfilled box next to the node name, or
- select specific species by clicking on the unfilled box next to the species name.
You may also add species to the list of selected species to display by typing the species name in the search box and selecting from the autocompleted options.
As you add or remove species from the expandable species tree, the
Species to display
box above it will automatically update to reflect your selections.
# Search at (level-of-orthology)
OrthoDB Orthologous Groups are hierarchical, being delineated at the major radiations along the species phylogeny. This enables to precise orthologs to a particular level-of-orthology: considering many distantly-related species delineates fewer, more general (inclusive) orthologous groups containing all the descendants of the ancestral gene, while examining only sets of more closely-related species produces many fine-grained orthologous groups of mostly one-to-one relations.
TIP
The level-of-orthology can be adjusted after species or clades of interest were selected (see Select species
)
# Species to display
By default only genes from model species will be shown in details for returned Orthologous Groups in the Orthologs by organism
section of results.
This can be changed instead to a set of Species to display
.
# Results
WARNING
Results of an OrthoDB query are shown as a list of relevant Orthologous Groups that are in a condensed view and require clicking on them to expand into a detailed view.
Each detailed record of an Orthologous Group has following sections:
# Functional descriptions
OrthoDB provides tentative functional annotations of groups of orthologs and mapping to functional categories by summarizing functional gene annotations, extensively collected from other public resources. Annotation of genes is complicated and contains errors. Although in many cases OrthoDB makes such errors in the underlying data apparent, discordant annotations should be considered with caution.
# Evolutionary descriptions
The evolutionary annotations of the orthologs remain a distinguishing feature of OrthoDB.
# Phyletic Profile
is a summary of the ortholog presence (from universal to species-specific) and copy-numbers (single/multi-copy counts).
# Evolutionary Rate
is a measure if this Orthologous Group exhibit appreciably higher or lower levels of sequence divergence, derived from quantification of the relative divergence among their member genes. These are computed for each orthologous group as the average of inter-species identities normalized to the average identity of all inter-species best reciprocal hits, computed from pairwise alignments of protein sequences. The relative rate is indicated by the position of the black star along the scale of slow-blue to fast-red rates.
# Gene Architecture
shows median and standard deviation values of protein lengths and exon counts for each orthologous group, effectively describing a 'consensus' gene architecture (for those genes with available data).
# Orthologs by organism
WARNING
This section can be very long. Use navigation arrows on the left to go to the beginning or the end of the record, or the cross to collaps the detailed view to the condensed view.
Condensed view for each gene includs gene/protein ID, UniProt ID, short description, number of amino acids (AAs), number of exons, and associated InterPro domains.
TIP
For the length (AAs) and exon counts (Exons) listed for each gene, the exclamation mark (!
) indicates differences from consensus (left: shorter, right: longer, !
: 1 stdev, !!
: 2 stdev).
# Double-arrow icon
expands the view, if clicked, to the available for a given gene annotations with links to source databases.
Available annotation of InterPro domains are displayed for each protein member ordered from the N to C terminus. Click on the grey magnifying glass icon to query OrthoDB for groups containing proteins with the same domains. To search for specific domain architectures, enter an ordered list of InterPro identifiers separated with only commas into the 'Text Search' field.
TIP
# Get All Fasta / View Fasta
retrieves the corresponding protein sequences in Fasta format. Group ID, gene, organism, and other useful details are contained in the header of each sequence. This information can be saved as a file by right-clicking on the link followed by "save link as...".
TIP
# Get All as Tab Delimited / View Tab Delimited
retrieves the corresponding ortholog information as tab delimited text. This information can be saved as a file by right-clicking on the link followed by "save link as...".
WARNING
Note that retrieving sequences in Fasta/Tab format is limited to a maximum of 5000 groups
# Sibling Groups
Related orthologous groups at the same level-of-orthology are defined according to their common InterPro domain annotations. The top 5 groups are listed with their percentage overlap in terms of common InterPro domains, and the complete list of related groups may be retrieved by clicking the 'Show all siblings' link.
# Uploading and analyzing your own sequences
# Register
In order to be able to upload sequences for a custom analysis, you need to register:
- Click on the "Register" link on the top right part of the OrthoDB webpage.
- Enter your login detail in the form that will appear.
# Data upload
Upload a fasta file with the sequences to be analyzed
After logging in, you can upload your sequences using the "Own data mapping" link (next to "Help").
After clicking on "Own data mapping", click on the "Upload" button and select your fasta-formatted file. Be aware that the file should contain amino acid sequences only.
After uploading is finished you will have to enter a species name in the corresponding field.
Next step is to select where to map your sequence. This level of orthology can either be selected automatically or manually.
If 'Auto' is selected, BUSCO (opens new window) is used to find an appropriate level.
In 'manual' mode, select species by clicking on the 'Advanced' button next to the search button and select clade from the tree.
Click on Advanced
again to return back.
Click on "Run analysis" to add your job to the mapping queue. When the job starts, the status should change from "CREATED" to something else depending on your setup.
When mapping is done and it passed without error, the status will again change to "DONE".
If there is somekind of error, the state will be "ERROR" or something more informative.
In particular when BUSCO
has been used to determine where to map, the error may be any of the following:
Message | Description |
---|---|
ERROR | server side error |
ERROR_BUSCO:io | server side error |
ERROR_BUSCO:bad_setup | server side error |
ERROR_BUSCO:no_result | busco failed |
POOR_BUSCO:AT<score> | busco successful but score is too low |
If the two last errors are generated, rerun the analysis using a manually selected orthology level. If it still fails (or any other error), contact orthodb support at support[at]orthodb.org
.
# Retrieve results
Download the results in a plain text file
Click on the "Download" button to get the mapping results. The name of this file contains all the mapping information:
- node_XXX: where "XXX" is the NCBI taxon ID of the mapping node.
- subnode_AAA_BBB_CCC_DDD_EEE: where AAA, BBB, CCC, DDD, and EEE are the NCBI taxon IDs for the selected species.
- taxid_YYY: where YYY is a temporary taxon ID for your species.
The mapping file contains 9 fields:
- Ortholog group name
- Gene name
- Ortholog type; for mapped sequences this field is a number >=10 and <20.
- Length of the matching region (in amino acids).
- Start coordinate of the match.
- End coordinate of the match.
- Score of the match.
- Normalized score of the match.
- E-value of the match.
# Comparative Charts
This OrthoDB online tool allows generation of a comparative overview of the gene content across selected genomes. The total gene counts and the fractions of orthologs among these species shows the level of relatedness among the genomes, highlighting the "universal" core of genes and the ones evolving under single-copy constraint [PMID:21148284] (opens new window).
You can select up to 20 species on the right panel to be included into the comparative genomics chart. The colors, patterns, etc can be customised from the "Configure chart" tab on the right panel. The fractions shown are hyperlinked to their corresponding Ortholog Groups from which the gene counts were made. The tailored chart can then be exported as a publication quality vector graphics.
Explore an example (opens new window)
# Bookmarking
Search results can be saved by simply bookmarking the result page or saving the URL text.
You can also drag & drop the bookmarklet link under Bookmark OrthoDB
at the right side under the search field to the browser toolbar for easy OrthoDB search next time with the same settings.
You can later just highlight a keyword somewhere on a web page and click on the saved bookmarklet to search OrthoDB for this keyword.
# API
The OrthoDB data can be programatically accessed using a URL based interface. In our implementation this means that the data can be retrieved using the following:
# URL
https://data.orthodb.org/current/CMD?ARG1="value"&ARG2="value&..."
where CMD is a command and all ARGx are arguments to that specific command. Below follows a description of the available commands with arguments.
WARNING
NOTE the request rate is limited to 1 request/second for the following URL's:
/blast
/tab
/fasta
If the rate is too high, some of the requests will fail with a 503 error.
# Data Formats
All data is returned in JSON format, except for /fasta and /tab. JSON data is widely supported by many languages. An overview with many examples can be found here (opens new window).
The JSON returned is of the generic format:
{
"url" : full url of request
"message": message string if status is error
"status" : "ok" or "error"
"data" : array of data
}
The clusters and genes have OrthoDB specific ids.
Cluster id
Generic form CLIDatCLADE
CLID is a numerical cluster id
CLADE NCBI taxid of the clade
Example: 124at33208
NOTE prior to OrthoDB 10 the cluster ids were of the form:
Generic form FFFVVCCCCII, where
- FFF either EOG (eukaryota) or POG (prokaryota)
- VV OrthoDB version ('09' for both v9 and v9.1)
- CCCC unique identifier for each clade
- II unique cluster identifier within the clade clade Example: EOG091G06KN
Gene id
Generic form taxid_version:geneid
taxid is the NCBI taxonomy id, extended with a version
geneid is a unique zero-padded hexadecimal identifier
Example: 10090_0:000d08
Using the API
Interacting with the API can be done using either any web browser or a command line tool like 'curl'.
Note that currently 'wget' is not supported.
Linux: normally both are installed by default
Windows: curl (opens new window)
Mac: 'curl' is usually installed natively, otherwise look here (opens new window)
Example download fasta for a given cluster and save in file 'data.fs' :
curl 'https://data.orthodb.org/current/fasta?id=32204at9721&species=9721' -L -o data.fs
Note the difference in options for specifying output file.
# API Commands
# /tree
Arguments:
NONE
Returns:
full tree used in OrthoDBDescription:
This retrieves the full tree.
curl 'https://data.orthodb.org/current/tree' -L -o tree.dat
# /search
Arguments:
query
- full query string
level
- NCBI taxon id of the clade
species
- NCBI taxon id of a clade or a csv list of taxons
skip
- number of hits to skip
take
- maximum nr of hits (cluster ids) to return - default is 1000
universal
- phyloprofile filter, present in 1.0, 0.9, 0.8 of all species in the clade
singlecopy
- phyloprofile filter, singlecopy in 1.0, 0.9, 0.8 of all species in the cladeReturns:
a list of clusters, the maximum number of clusters is defined bytake
Description:
This finds all cluster id's matching a given query.
Note that if noquery
is given,species
is required.
curl 'https://data.orthodb.org/current/search?query=p450&take=2&level=33208&singlecopy=0.8' -L -o search.dat
# /blast
Arguments:
seq
- sequence string, without fasta-header \Returns:
info about the best matching gene as a JSONDescription:
This finds the best match with the given sequence.
curl 'https://data.orthodb.org/current/blast?seq=MGDSHEDTSATVPEAVAEEVSLFSTTDIVLF' -L -o blast.dat
# /group
Arguments:
id
- OrthoDB cluster idReturns:
annotation details on the given cluster idDescription:
Retrieve detailed annotation information on the given cluster.
curl 'https://data.orthodb.org/current/group?id=1627at33208' -L -o group.dat
# /orthologs
Arguments:
id
- OrthoDB cluster idReturns:
a dictionary of tax id's, each contain a list of OrthoDB gene id'sDescription:
Retrieve all genes in a given cluster, possibly filtered wrt species.
curl 'https://data.orthodb.org/current/orthologs?id=1627at33208' -L -o orthologs.dat
# /ogdetails
Arguments:
id
- OrthoDB gene idReturns:
detailed information on the given gene idDescription:
Retrieve further details on a given gene id.
curl 'https://data.orthodb.org/current/ogdetails?id=9606_0:0017fc' -L -o ogdetails.dat
# /siblings
Arguments:
id
- OrthodDB cluster id
take
- max nr of returned siblingsReturns:
a list of OrthoDB cluster id'sDescription:
Retrieve all siblings to the given cluster.
curl 'https://data.orthodb.org/current/siblings?id=1627at33208' -L -o siblings.dat
# /fasta
- Arguments:
id
- OrthoDB cluster id
species
- CSV list of NCBI species taxonomy id's OR a NCBI clade id
curl 'https://data.orthodb.org/current/fasta?id=32204at9721' -L -o data.fs
- Returns:
sequences in fasta format
Ifspecies
is a CSV list of taxid's it will return sequences only from those species.
If it is a clade, it will return all sequences from the clade.
Ifid
is not given, it will return all sequences given by the species argument.
Note that this query is limited by a maximum of 5000 clusters. If the limit is exceeded, a page is given with basic instructions on how to retrieve the information.
# /tab
Arguments:
id
- OrthoDB cluster id
species
- list of NCBI species taxonomy id'sReturns:
tab-separated table of gene annotations
curl 'https://data.orthodb.org/current/tab?id=32204at9721' -L -o data.tsv
# RDF
This SPARQL 1.1 endpoint serves OrthoDB data as RDF (opens new window). The OrthoDB release 10.1 consists of 2'246'378'105 RDF triples describing evolutionary and functional properties of 40'614'194 genes from 15247 organisms clustered in 8'952'780 orthologous groups on 1004 taxonomic levels.
# Downloads
WARNING
Use API (Application Programming Interface) to download data if the data set is not too large.
OrthoDB data is also available as Flat files for download from here (opens new window). This is recommended if the user intends to process large parts of the data or /fasta or /tab exceeds the maximum nr of clusters (5000).
# FAQ
# How can I ..?
..will come soon..
# Contact
Email: support[at]orthodb.org
Join the OrthoDB-News (opens new window) mailing list (low trafic).
# Funding
- UNIGE
- SIB
- SNSF
# Previous versions
# Cite us
OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity D Kuznetsov, F Tegenfeldt, M Manni, M Seppey, M Berkeley, EV Kriventseva, EM Zdobnov, NAR, Nov 2022, doi:10.1093/nar/gkac996 (opens new window). PMID:36350662 (opens new window)
..more & stats (opens new window)
Go to OrthoDB >>