Skip to content

Input and Output Files

Vinh Tran edited this page Jan 24, 2023 · 4 revisions

Table of Contents

Data structure

Beside the given input file, for each taxon that is included in your study fDOG need these 3 directories to be functional:

  • searchTaxa_dir: Contains sub-directories for proteome fasta files for each taxon. All taxa in this folder will be used for ortholog search.
  • coreTaxa_dir: Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes. It is not necessary that all taxa within the genome_dir have to have a BLAST database. Only taxa that should be included in the core ortholog group compilation must be present in this folder.
  • annotation_dir: Contains feature annotation files for all taxa present in searchTaxa_dir and coreTaxa_dir. These annotation files are not a must. However, to utilize all the features of fDOG including the FAS scores calculations, we recommend that you should have these data available.

fDOG comes together with a pre-calculated data for 78 QFO species (data set 2019). If you want to work with other taxa, you can add them into fDOG following this instruction.

NOTE: you can rename searchTaxa_dir, coreTaxa_dir and annotation_dir to anything as well as place them anywhere you want.

NOTE 2: we recommend you should check your own data for their validity before running fDOG.

During the process of fDOG, an additional folder core_orthologs will be created to store the core ortholog groups. By default, this directory will be created inside the current directory. All these 4 folders can be manually specified using the corresponding command parameters --hmmpath, --searchpath, --corepath, --annopath, or obtained from a yaml file with the option --pathFile. An example of this pathConfig.yml file will look like:

hmmpath: /home/yourname/working_dir/core_orthologs
searchpath: /home/yourname/fdog_data/searchTaxa_dir
corepath: /home/yourname/fdog_data/coreTaxa_dir
annopath: /home/yourname/fdog_data/annotation_dir

In case all of those folders are located in the same directory, you only need to put a single line to the pathConfig.yml file:

dataPath: /home/yourname/fdog_data

Input file

Input (or seed sequence) for fDOG is a single FASTA file. For example:

>HUMAN@9606@3|P83876
MSYMLPHLHNGWQVDQAILSEEDRVVVIRFGHDWDPTCMKMDEVLYSIAEKVKNFAVIYL
VDITEVPDFNKMYELYDPCTVMFFFRNKHIMIDLGTGNNNKINWAMEDKQEMVDIIETVY
RGARKGRGLVVSPKDYSTKYRY

The taxon of this seed sequence, which is called reference taxon and specified by the option --refspec, must be present in the blast database directory (coreTaxa_dir) of fDOG.

Output files

For one seed sequence, fDOG output consist of these text files (note: test is your defined job name using the --jobName parameter)

  1. test.extended.fa: a multiple FASTA file containing the seed and its ortholog sequences
  2. test.phyloprofile: an input file for analysing the phylogenetic profile of the query gene using PhyloProfile tool
  3. test_forward.domains and optionally, test_reverse.domains: protein domain annotation files for all the sequences present in the orthologous group. The _forward or _reverse suffix indicates the direction of the feature architecture comparison (FAS), in which _forward means that the query gene is used as seed and it orthologs as target for the comparison, while _reverse is vice versa. These files can be submitted into PhyloProfile for visualising

Phylogenetic profile analysis using PhyloProfile

For a rich visualisation of the provided information from the fDOG outputs, you can plug them into the Phyloprofile tool.

The main input file for PhyloProfile is test.phyloprofile, which contains list of all orthologous gene names and the taxonomy IDs of their taxa together with the FAS scores (if available). For analysing more information such as the FASTA sequences or the domain annotations, you can optionally input test.extended.fa and test_forward.domains (or test_reverse.domains) to PhyloProfile.

You can combine multiple fDOG runs into a single phylogenetic profile input using fdog.mergeOutput function.

fdog.mergeOutput -i /path/to/fdog/single/output/files/ -o output_name

in which /path/to/fdog/single/output/files/ is a directory where all single *.phyloprofile, *.domains, *.extended.fa file can be found.

The resulting file output_name.phyloprofile, /output_name.extended.fa, output_name_forward.domains and output_name_backward.domains are saved in the current directory.