Skip to content
Brian Haas edited this page Oct 8, 2024 · 34 revisions

CTAT-LR-Fusion : Detect Fusion Transcripts from Long Reads (PacBio Iso-seq or ONT transcriptomes)

CTAT-LR-Fusion is a component of the Trinity Cancer Transcriptome Analysis Toolkit (CTAT) used for detecting fusion transcripts from long-read transcriptome sequencing data, including PacBio Iso-seq and Oxford Nanopore Technology sequenced transcriptomes. If matched Illumina RNA-seq data are available, these can be leveraged as well for additional exploration and quantification of fusions initially detected via long reads.

CTAT-LR-Fusion was developed in the Broad Institute's Methods Development Laboratory (MDL) for characterizing long read transcriptome sequences such as derived from MAS-seq.

CTAT-LR-Fusion: How it works

CTAT-LR-fusion operates in three main steps:

  1. Fusion candidates are initially identified based on long read alignments using ctat-minimap2, a modified version of minimap2 that focuses on identifying likely chimeric long reads rather than providing high quality alignments for all input reads. The chimeric-read-only search speeds up the initial minimap2 search phase.

  2. The chimeric read alignments are screened based on read and genome alignment positions to define a list of fusion candidates. For each fusion candidate, a model of the ordered and oriented fusion pair is constructed - borrowing the approach from our FusionInspector software.

  3. The candidate chimeric reads are realigned to a database of these fusion contigs and each fusion pair is scored for read support according to read alignment breakpoints (aka. fusion transcript breakpoints). If matched Illumina short reads are available, these are separately aligned to these fusion contigs using FusionInpsector and the results are integrated into the final ctat-LR-fusion report with fusion variant expression estimates from both short and long reads, respectively.

Sometimes the short reads provide evidence for alternatively spliced fusion isoforms for which long reads weren't captured, or vice-versa. These cases can be easily identified in the ctat-LR-fusion report.

Installing CTAT-LR-Fusion

Obtaining CTAT-LR-Fusion software

Docker and Singularity images are available and recommended.

If you would prefer to install from source code, download the latest 'FULL' release tarball from the CTAT-LR-Fusion release site. Unpack it, and run 'make' in the base installation directory.

There are likely other dependencies that you may require. The full installation for a full stack of dependencies is shown in this Dockerfile. You can probably just get away with the following if you're only running long reads through:

pip install pandas igv-reports pysam

Obtaining and configuring the CTAT Genome Lib

The CTAT genome lib is the same used for other CTAT tools and can be downloaded from https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/. The ctat genome lib software compatibility matrix indicates the version of STAR to use if you have companion Illumina short reads.

Configuring the CTAT Genome Lib for CTAT-LR-Fusion

The ctat-LR-fusion software comes with a customized version of minimap2 named ctat-minimap2, and CTAT-LR-Fusion requires a minimap2 index of the reference genome. To build this, initially run ctat-LR-fusion like so:

ctat-LR-fusion -T long_reads.fastq.gz \
               --genome_lib_dir  /path/to/ctat_genome_lib_build_dir \
               --prep_reference --CPU 4

and it will first build the minimap2 genome index before running ctat-LR-fusion to find fusion transcripts.

If you run with --prep_reference_only, it will stop after building the index.

For future runs, drop the --prep_reference argument, as the index only needs to be built once. If you forget, no worries. It'll only build it once anyway.

Running CTAT-LR-Fusion

Once you have the ctat genome lib installed and configured as above.

For long reads, you need either a FASTA or FASTQ formatted file. Then, run ctat-LR-fusion like so:

ctat-LR-fusion -T long_reads.fastq.gz \
               --genome_lib_dir  /path/to/ctat_genome_lib_build_dir \
               --CPU 4 \
               --vis

If you have the ctat genome lib dir set up as an environmental variable CTAT_GENOME_LIB, then you don't need to specify --genome_lib_dir, and only need to specify -T for the long reads.

If you have reads that align to the reference genome with <90% sequence identity, adjust the --min_per_id parameter (default: 90) accordingly.

Including Illumina RNA-seq

If you additionally have Illumina RNA-seq for the sample, you can include that as well like so:

ctat-LR-fusion -T long_reads.fastq.gz \
               --genome_lib_dir  /path/to/ctat_genome_lib_build_dir  \
               --left_fq illumina_reads_1.fq \
               --right_fq illumina_reads_2.fq \
               --CPU 4 \
               --vis

ctat-LR-fusion does not find additional fusions based on short reads... it will only additionally examine short read support for those fusion gene pairs initially detected via long read sequences. However, it will identify fusion splicing isoforms that are uniquely supported by Illumina short read data.

See the full usage info (via --help or no parameters) for additional options and configurations.

Fusion Outputs

The output files consist of the following:

  • ctat-LR-fusion.fusion_predictions.tsv : the final fusion predictions including names for the evidence reads. See the .abridged version for simpler output lacking the read names.

  • ctat-LR-fusion.fusion_inspector_web.html : the results in an interactive igv-reports for exploring the evidence supporting each fusion. Requires the --vis command line argument to ctat-LR-fusion.

A preliminary list of fusions before any filtering is performed to generate the final list is provided as file 'ctat-LR-fusion.fusion_predictions.preliminary.tsv'. This is useful for additional exploration and for troubleshooting purposes.

A screenshot of the interactive fusion html view is shown below:

In the image above, we have PacBio Iso-seq reads supporting the fusion, and below Illumina junction reads and spanning fragments that also support this fusion. If you only have long reads, the Illumina tiers will simply be empty. The different fusion breakpoints are evidence of alternatively spliced fusion transcripts from within the single sample.

Application to single cell transcriptomics

Before running single cell RNA-seq through CTAT-Mutations, the names of the reads should be encoded with cell barcode and UMI information in the following format:

cellbarcode^UMI^read_name

If you have 10xGenomics reads in a ubam format, you can convert to fastq format with the above read name encoding using this script: 10x_ubam_to_fastq.py

The fusion-to-cell mapping information can be derived from the ctat-LR-fusion output file 'ctat-LR-fusion.fusion_predictions.tsv' using this script: cell_to_fusion_mappings.Rscript, generating a report like so:

FusionName                LeftGene    LeftBreakpoint    RightGene     RightBreakpoint  SpliceType       cb                umi         readname
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  CGAGCCATCTACTATC  CTACGGCGGC  m64020e_210506_132139/1068814535003/ccs
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  CAGCCGACAGGACCCT  GATTGGTCAA  m64020e_210506_132139/1162007282005/ccs
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  CTACCCATCCAAATGC  TCTACGGCGG  m64020e_210506_132139/1130810518002/ccs
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  CGACTTCTCCAAGCCG  TGTTGTCTAC  m64020e_210506_132139/1109709689001/ccs
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  CTCTAATTCTCGTATT  TTGTTTCGTT  m64020e_210506_132139/1016517571004/ccs
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  CCCTCCTCAGCTTCGG  TACGACCGCA  m64020e_210506_132139/1116984222006/ccs
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  TTGACTTAGGGTATCG  GGTCGGGAGT  m64020e_210506_132139/1114230755010/ccs
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  TGGTTAGAGACCCACC  TTTCCTCCGA  m64020e_210506_132139/1009833045005/ccs
NUTM2A-AS1--RP11-203L2.4  NUTM2A-AS1  chr10:87326630:-  RP11-203L2.4  chr9:68822648:-  ONLY_REF_SPLICE  TGGGCGTTCACTGGGC  ACATGTATAC  m64020e_210506_132139/1164366124008/ccs
...

PacBio Webinar for CTAT-LR-Fusion

Trinity CTAT on Youtube

Questions, comments, etc?

Contact us via our google group: https://groups.google.com/forum/#!forum/trinity_ctat_users