Skip to content

Functional Annotation of MAGs (or Contigs)

Santiago Castro Dau edited this page Jun 28, 2024 · 29 revisions

Here we will assume that you already have a mags.qza artifact to work with. The run-time estimates are based on the dereplicated (TODO: link to dereplication tutorial) version of the mags.qza artifact from the Generate MAGs from Reads tutorial.

The process of functionally annotating your MAGs or contigs will involve several actions, more or less depending on what reference databases you want to use. The process can be broadly conceptualized into 3 stages:

  1. Downloading or building the reference database
  2. Searching your contigs or MAGs for homologs against a reference database
  3. Functionally annotating the search hits against the eggNOG database

1. Downloading or Building the Reference Database

Homologs can be found by comparing sequences to reference databases (e.g. Diamond, HMMER, and MMseqs2). Below you will find several options for fetching or constructing one of these reference databases. Feel free to choose the one that best fits your needs.

Create a Diamond Database

You must choose whether to download the complete Diamond reference database, only a portion of it (i.e. for a given taxon), or whether to create a custom database from user-provided protein sequences.

Download the Full Database (All Taxa) (✅ runs)

Use the fetch-diamond-db action to download and save the full Diamond reference database.

⚠️ At least 18 GB of free storage space is required to run this action. Runtime: 17 minutes

qiime moshpit fetch-diamond-db \
   --o-diamond-db diamond_db.qza

Download Database for a Specific Taxon

Use the fetch-eggnog-proteins action to download and save the eggNOG protein database.

⚠️ At least 18 GB of free storage space in your machine is required to run this. First, we must download the eggNOG protein sequence database.

qiime moshpit fetch-eggnog-proteins \
   --o-eggnog-proteins eggnog_proteins.qza

Now we can use this database to construct a Diamond database for a specific taxon, using the build-eggnog-diamond-db action. The --p-taxon parameter specifies the taxon ID number for which to build the database (here 2 = Bacteria).

qiime moshpit build-eggnog-diamond-db \
   --i-eggnog-proteins eggnog_proteins.qza \
   --p-taxon 2 \
   --o-diamond-db diamond_db.qza \
   --verbose

Create a Diamond Database from a Custom Protein Database

Optional

If you want the resulting Diamond database to have taxonomy features, first download the NCBI taxonomy database using the fetch-ncbi-taxonomy action.

⚠️ At least 30 GB of free storage space is required to run this action.

qiime moshpit fetch-ncbi-taxonomy \
   --o-taxonomy taxonomy.qza
   --verbose

If you don't want taxonomy features just skip this step.

Now if you chose this option it's because you have a protein reference database that you would like to use to construct the Diamond database. Collect (if you have not already) all of your sequences in the same fasta file and import it into a Qiime2 artifact with the FeatureData[ProteinSequence] semantic type.

qiime tools import \
   --input-path my_proteins.fasta \
   --output-path my_proteins.qza \
   --type "FeatureData[ProteinSequence]"
   --verbose

Now, construct a Diamond reference database using the build-custom-diamond-db action.

If you decided to include taxonomy information in your database (i.e. the optional step above) don't forget to include a --i-taxonomy taxonomy.qza line in the command below.

qiime moshpit build-custom-diamond-db \
   --i-seqs my_proteins.qza \
   --o-diamond-db diamond.qza \
   --verbose

Create a HMMER Database (runs ✅)

Use the fetch-eggnog-hmmer-db action to construct a HMMER database for a specific taxon. The --p-taxon-id parameter specifies the taxon ID number for which to build the database (here 2 = Bacteria).

⚠️ At least 80 GB of free storage space is required to run this action. Runtime: TODO

qiime moshpit fetch-eggnog-hmmer-db \
   --p-taxon-id 1100069 \
   --output-dir hmmr_db_1100069 \
   --verbose

2. Searching for Homologues

Search for hologosues by checking your sequences against a reference database.

Diamond Search (✅ runs)

Aprox runntime: 60 minutes

Search for homologs in your MAGs or contigs by comparing them against a Diamond database. Do this by using the eggnog-diamond-search action.

qiime moshpit eggnog-diamond-search \
   --i-sequences mags.qza \
   --i-diamond-db diamond_db.qza \
   --o-eggnog-hits hits.qza \
   --o-table table.qza \
   --parallel \
   --p-num-partitions 5 \
   --p-num-cpus 7 \
   --verbose

HMMER Search

Search for homologs in your MAGs or contigs by comparing them against a HMMER database. Do this by using the eggnog-hmmer-search action.

qiime moshpit eggnog-hmmer-search \
   --i-fastas hmmr_db_2/fastas.qza \
   --i-idmap hmmr_db_2/idmap.qza \
   --i-pressed-hmm-db hmmr_db_2/pressed_hmm_db.qza \
   --i-sequences mags.qza \
   --o-eggnog-hits hits.qza \
   --o-table table.qza \
   --p-num-cpus 7 \
   --p-num-partitions 4 \
   --parallel \
   --verbose 

3. Functional Annotation

Fetch the eggNOG database with the fetch‐eggnog‐db action.

⚠️ At least 80 GB of storage space is required to run this action.

qiime moshpit fetch-eggnog-db \
   --o-eggnog-db eggnog.db \
   --verbose

Annotate the hits from the previous stage against the eggNOG database with the eggnog-annotate action.

qiime moshpit eggnog-annotate \
   --i-eggnog-hits hits.qza \
   --i-eggnog-db eggnog_db.qza \
   --o-ortholog-annotations annotations.qza \
   --verbose

🏠 Home

🧑🏻‍🏫 Tutorials

🎬 Actions

Clone this wiki locally