OrthoDB

user guide

Go to OrthoDB >>

# Terminology

TIP

# Orthologs

Orthologs are genes in different species that evolved from a common ancestral gene by speciation. If one or both of these genes were duplicated after the speciation they are all termed co-orthologs, or just orthologs.

TIP

# Orthologous group / level-of-orthology

If there are more than two species considered, there are more than one speciation event, and we refer as orthologs, or orthologous group, to all descendants of a particular single gene of the last common ancestor of these species. Thus our operational definition refers to a specific phylogeny radiation for a set of species, termed the level-of-orthology.

TIP

# Ortholog functions

It is a reasonable hypothesis that orthologs keep functions of their ancestor gene ("by tradition"), though there are examples of gene function gains and losses. The statement of gene orthology though refers to their evolutionary relation, not to the kept or altered functions.

TIP

# Paralogs

Paralogs are genes that evolved by duplication inside a genome. Notions of orthologs and paralogs are disjoint, e.g. paralogs can be co-orthologs if duplicated after the speciation or can be not if duplicated earlier.

# Standalone OrthoLoger software

OrthoDB standalone pipeline for delineation of orthologs, OrthoLoger, is freely available here (opens new window)

OrthoDB can be queried using a gene name, identifier, annotation keywords, etc. We indexed many relevant identifiers of proteins and genes, including UniProtKB, Ensembl, InterPro, KEGG, GenBank, RefSeq, etc.

TIP

To get a gene-centric view providing the available annotations and a list of pair-wise orthologs - switch from Text to get Gene on the left of the search input.

TIP

To query specifically for a numeric NCBI gene - switch from Text to NCBI ID on the left of the search input.

TIP

To query for EC numbers - use double quotes, e.g. "3.1.1.-".

# Text query format

  • Use double quotation marks to match a phrase, e.g. "Cytochrome P450"
  • Take advantage of the autocomplete lookup feature
  • Logical operator NOT use '-' or '!', e.g. kinase !tyrosine
  • Logical operator OR use '|', e.g. protease | peptidase
  • Logical operator AND is implicit, e.g. sodium transporter actually means sodium AND transporter (if not quoted)

OrthoDB can be queried by homology to a protein sequence: switch from Text to Sequence on the left of the search input and paste the query protein sequence without a header line.

# Advanced options

# Phyloprofile

The result list of Orthologous Groups can be filtered for

  • universality, i.e. having member genes in all species of the selected taxonomic node, or a fraction of them, e.g. present in all, in >90% or >80% of the species.
  • gene copy-number (duplicability), requiring them to have only single-copy orthologs in all species of the selected taxonomic node, or a fraction of them, e.g. single-copy in all, in >90% or >80% of the species.

You can combine any presence filter with any copy-number filter to refine your results, e.g. present in >90% AND single-copy in >80% of species.

# Select species

You can tailor your search by using the expandable species tree to select a radiation point or particular sets of species.

  • Expand or collapse any node on the tree by clicking on the filled arrows or node names.
  • Select all species at a node by clicking on the unfilled box next to the node name, or
  • select specific species by clicking on the unfilled box next to the species name. You may also add species to the list of selected species to display by typing the species name in the search box and selecting from the autocompleted options. As you add or remove species from the expandable species tree, the Species to display box above it will automatically update to reflect your selections.

# Search at (level-of-orthology)

OrthoDB Orthologous Groups are hierarchical, being delineated at the major radiations along the species phylogeny. This enables to precise orthologs to a particular level-of-orthology: considering many distantly-related species delineates fewer, more general (inclusive) orthologous groups containing all the descendants of the ancestral gene, while examining only sets of more closely-related species produces many fine-grained orthologous groups of mostly one-to-one relations.

TIP

The level-of-orthology can be adjusted after species or clades of interest were selected (see Select species)

# Species to display

By default only genes from model species will be shown in details for returned Orthologous Groups in the Orthologs by organism section of results. This can be changed instead to a set of Species to display.

# Results

WARNING

Results of an OrthoDB query are shown as a list of relevant Orthologous Groups that are in a condensed view and require clicking on them to expand into a detailed view.

Each detailed record of an Orthologous Group has following sections:

# Functional descriptions

OrthoDB provides tentative functional annotations of groups of orthologs and mapping to functional categories by summarizing functional gene annotations, extensively collected from other public resources. Annotation of genes is complicated and contains errors. Although in many cases OrthoDB makes such errors in the underlying data apparent, discordant annotations should be considered with caution.

# Evolutionary descriptions

The evolutionary annotations of the orthologs remain a distinguishing feature of OrthoDB.

# Phyletic Profile

is a summary of the ortholog presence (from universal to species-specific) and copy-numbers (single/multi-copy counts).

# Evolutionary Rate

is a measure if this Orthologous Group exhibit appreciably higher or lower levels of sequence divergence, derived from quantification of the relative divergence among their member genes. These are computed for each orthologous group as the average of inter-species identities normalized to the average identity of all inter-species best reciprocal hits, computed from pairwise alignments of protein sequences. The relative rate is indicated by the position of the black star along the scale of slow-blue to fast-red rates.

# Gene Architecture

shows median and standard deviation values of protein lengths and exon counts for each orthologous group, effectively describing a 'consensus' gene architecture (for those genes with available data).

# Orthologs by organism

WARNING

This section can be very long. Use navigation arrows on the left to go to the beginning or the end of the record, or the cross to collaps the detailed view to the condensed view.

Condensed view for each gene includs gene/protein ID, UniProt ID, short description, number of amino acids (AAs), number of exons, and associated InterPro domains.

TIP

For the length (AAs) and exon counts (Exons) listed for each gene, the exclamation mark (!) indicates differences from consensus (left: shorter, right: longer, !: 1 stdev, !!: 2 stdev).

# Double-arrow icon

expands the view, if clicked, to the available for a given gene annotations with links to source databases.

Available annotation of InterPro domains are displayed for each protein member ordered from the N to C terminus. Click on the grey magnifying glass icon to query OrthoDB for groups containing proteins with the same domains. To search for specific domain architectures, enter an ordered list of InterPro identifiers separated with only commas into the 'Text Search' field.

TIP

# Get All Fasta / View Fasta

retrieves the corresponding protein sequences in Fasta format. Group ID, gene, organism, and other useful details are contained in the header of each sequence. This information can be saved as a file by right-clicking on the link followed by "save link as...".

TIP

# Get All as Tab Delimited / View Tab Delimited

retrieves the corresponding ortholog information as tab delimited text. This information can be saved as a file by right-clicking on the link followed by "save link as...".

WARNING

Note that retrieving sequences in Fasta/Tab format is limited to a maximum of 5000 groups

# Sibling Groups

Related orthologous groups at the same level-of-orthology are defined according to their common InterPro domain annotations. The top 5 groups are listed with their percentage overlap in terms of common InterPro domains, and the complete list of related groups may be retrieved by clicking the 'Show all siblings' link.

# Uploading and analyzing your own sequences

# Register

In order to be able to upload sequences for a custom analysis, you need to register:

  • Click on the "Register" link on the top right part of the OrthoDB webpage.
  • Enter your login detail in the form that will appear.

# Data upload

Upload a fasta file with the sequences to be analyzed

After logging in, you can upload your sequences using the "Own data mapping" link (next to "Help").

After clicking on "Own data mapping", click on the "Upload" button and select your fasta-formatted file. Be aware that the file should contain amino acid sequences only.

After uploading is finished you will have to enter a species name in the corresponding field.

Next step is to select where to map your sequence. This level of orthology can either be selected automatically or manually. If 'Auto' is selected, BUSCO (opens new window) is used to find an appropriate level.
In 'manual' mode, select species by clicking on the 'Advanced' button next to the search button and select clade from the tree. Click on Advanced again to return back.

Click on "Run analysis" to add your job to the mapping queue. When the job starts, the status should change from "CREATED" to something else depending on your setup.

When mapping is done and it passed without error, the status will again change to "DONE".

If there is somekind of error, the state will be "ERROR" or something more informative. In particular when BUSCO has been used to determine where to map, the error may be any of the following:

Message Description
ERROR server side error
ERROR_BUSCO:io server side error
ERROR_BUSCO:bad_setup server side error
ERROR_BUSCO:no_result busco failed
POOR_BUSCO:AT<score> busco successful but score is too low

If the two last errors are generated, rerun the analysis using a manually selected orthology level. If it still fails (or any other error), contact orthodb support at support[at]orthodb.org.

# Retrieve results

Download the results in a plain text file

Click on the "Download" button to get the mapping results. The name of this file contains all the mapping information:

  • node_XXX: where "XXX" is the NCBI taxon ID of the mapping node.
  • subnode_AAA_BBB_CCC_DDD_EEE: where AAA, BBB, CCC, DDD, and EEE are the NCBI taxon IDs for the selected species.
  • taxid_YYY: where YYY is a temporary taxon ID for your species.

The mapping file contains 9 fields:

  • Ortholog group name
  • Gene name
  • Ortholog type; for mapped sequences this field is a number >=10 and <20.
  • Length of the matching region (in amino acids).
  • Start coordinate of the match.
  • End coordinate of the match.
  • Score of the match.
  • Normalized score of the match.
  • E-value of the match.

# Comparative Charts

This OrthoDB online tool allows generation of a comparative overview of the gene content across selected genomes. The total gene counts and the fractions of orthologs among these species shows the level of relatedness among the genomes, highlighting the "universal" core of genes and the ones evolving under single-copy constraint [PMID:21148284] (opens new window).

You can select up to 20 species on the right panel to be included into the comparative genomics chart. The colors, patterns, etc can be customised from the "Configure chart" tab on the right panel. The fractions shown are hyperlinked to their corresponding Ortholog Groups from which the gene counts were made. The tailored chart can then be exported as a publication quality vector graphics.

Explore an example (opens new window)

# Bookmarking

Search results can be saved by simply bookmarking the result page or saving the URL text.

You can also drag & drop the bookmarklet link under Bookmark OrthoDB at the right side under the search field to the browser toolbar for easy OrthoDB search next time with the same settings. You can later just highlight a keyword somewhere on a web page and click on the saved bookmarklet to search OrthoDB for this keyword.

# API

The OrthoDB data can be programatically accessed using a URL based interface. In our implementation this means that the data can be retrieved using the following:

# URL

https://data.orthodb.org/current/CMD?ARG1="value"&ARG2="value&..."

where CMD is a command and all ARGx are arguments to that specific command. Below follows a description of the available commands with arguments.

WARNING

NOTE the request rate is limited to 1 request/second for the following URL's:
/blast
/tab
/fasta
If the rate is too high, some of the requests will fail with a 503 error.

# Data Formats

All data is returned in JSON format, except for /fasta and /tab. JSON data is widely supported by many languages. An overview with many examples can be found here (opens new window).

The JSON returned is of the generic format:

          {
             "url"    : full url of request
             "message": message string if status is error
             "status" : "ok" or "error"
             "data"   : array of data
          }

The clusters and genes have OrthoDB specific ids.

Cluster id
Generic form CLIDatCLADE
CLID is a numerical cluster id
CLADE NCBI taxid of the clade
Example: 124at33208

NOTE prior to OrthoDB 10 the cluster ids were of the form:
Generic form FFFVVCCCCII, where

  • FFF either EOG (eukaryota) or POG (prokaryota)
  • VV OrthoDB version ('09' for both v9 and v9.1)
  • CCCC unique identifier for each clade
  • II unique cluster identifier within the clade clade Example: EOG091G06KN

Gene id
Generic form taxid_version:geneid
taxid is the NCBI taxonomy id, extended with a version
geneid is a unique zero-padded hexadecimal identifier
Example: 10090_0:000d08

Using the API
Interacting with the API can be done using either any web browser or a command line tool like 'curl'.
Note that currently 'wget' is not supported.

Linux: normally both are installed by default
Windows: curl (opens new window)
Mac: 'curl' is usually installed natively, otherwise look here (opens new window)

Example download fasta for a given cluster and save in file 'data.fs' :

curl 'https://data.orthodb.org/current/fasta?id=32204at9721&species=9721' -L -o data.fs

Note the difference in options for specifying output file.

# API Commands

# /tree

  • Arguments:
    NONE

  • Returns:
    full tree used in OrthoDB

  • Description:
    This retrieves the full tree.

Example (opens new window)

curl 'https://data.orthodb.org/current/tree' -L -o tree.dat
  • Arguments:
    query - full query string
    level - NCBI taxon id of the clade
    species - NCBI taxon id of a clade or a csv list of taxons
    skip - number of hits to skip
    take - maximum nr of hits (cluster ids) to return - default is 1000
    universal - phyloprofile filter, present in 1.0, 0.9, 0.8 of all species in the clade
    singlecopy- phyloprofile filter, singlecopy in 1.0, 0.9, 0.8 of all species in the clade

  • Returns:
    a list of clusters, the maximum number of clusters is defined by take

  • Description:
    This finds all cluster id's matching a given query.
    Note that if no query is given, species is required.

Example (opens new window)

curl 'https://data.orthodb.org/current/search?query=p450&take=2&level=33208&singlecopy=0.8' -L -o search.dat

# /blast

  • Arguments:
    seq - sequence string, without fasta-header \

  • Returns:
    info about the best matching gene as a JSON

  • Description:
    This finds the best match with the given sequence.

Example (opens new window)

curl 'https://data.orthodb.org/current/blast?seq=MGDSHEDTSATVPEAVAEEVSLFSTTDIVLF' -L -o blast.dat

# /group

  • Arguments:
    id - OrthoDB cluster id

  • Returns:
    annotation details on the given cluster id

  • Description:
    Retrieve detailed annotation information on the given cluster.

Example (opens new window)

curl 'https://data.orthodb.org/current/group?id=1627at33208' -L -o group.dat

# /orthologs

  • Arguments:
    id- OrthoDB cluster id

  • Returns:
    a dictionary of tax id's, each contain a list of OrthoDB gene id's

  • Description:
    Retrieve all genes in a given cluster, possibly filtered wrt species.

Example (opens new window)

curl 'https://data.orthodb.org/current/orthologs?id=1627at33208' -L -o orthologs.dat

# /ogdetails

  • Arguments:
    id - OrthoDB gene id

  • Returns:
    detailed information on the given gene id

  • Description:
    Retrieve further details on a given gene id.

Example (opens new window)

curl 'https://data.orthodb.org/current/ogdetails?id=9606_0:0017fc' -L -o ogdetails.dat

# /siblings

  • Arguments:
    id - OrthodDB cluster id
    take - max nr of returned siblings

  • Returns:
    a list of OrthoDB cluster id's

  • Description:
    Retrieve all siblings to the given cluster.

Example (opens new window)

curl 'https://data.orthodb.org/current/siblings?id=1627at33208' -L -o siblings.dat

# /fasta

  • Arguments:
    id - OrthoDB cluster id
    species - CSV list of NCBI species taxonomy id's OR a NCBI clade id

Example (opens new window)

curl 'https://data.orthodb.org/current/fasta?id=32204at9721' -L -o data.fs
  • Returns:
    sequences in fasta format
    If species is a CSV list of taxid's it will return sequences only from those species.
    If it is a clade, it will return all sequences from the clade.
    If id is not given, it will return all sequences given by the species argument.
    Note that this query is limited by a maximum of 5000 clusters. If the limit is exceeded, a page is given with basic instructions on how to retrieve the information.

# /tab

  • Arguments:
    id - OrthoDB cluster id
    species - list of NCBI species taxonomy id's

  • Returns:
    tab-separated table of gene annotations

Example (opens new window)

curl 'https://data.orthodb.org/current/tab?id=32204at9721' -L -o data.tsv

# RDF

This SPARQL 1.1 endpoint serves OrthoDB data as RDF (opens new window). The OrthoDB release 10.1 consists of 2'246'378'105 RDF triples describing evolutionary and functional properties of 40'614'194 genes from 15247 organisms clustered in 8'952'780 orthologous groups on 1004 taxonomic levels.

# Downloads

WARNING

Use API (Application Programming Interface) to download data if the data set is not too large.

OrthoDB data is also available as Flat files for download from here (opens new window). This is recommended if the user intends to process large parts of the data or /fasta or /tab exceeds the maximum nr of clusters (5000).

# FAQ

# How can I ..?

..will come soon..

# Contact

Email: support[at]orthodb.org Join the OrthoDB-News (opens new window) mailing list (low trafic).

# Funding

  • UNIGE
  • SIB
  • SNSF

# Previous versions

# Cite us

OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity D Kuznetsov, F Tegenfeldt, M Manni, M Seppey, M Berkeley, EV Kriventseva, EM Zdobnov, NAR, Nov 2022, doi:10.1093/nar/gkac996 (opens new window). PMID:36350662 (opens new window)

..more & stats (opens new window)

Go to OrthoDB >>