Circall: A fast and accurate methodology for discovery of circular RNAs from paired-end RNA-sequencing data
###########################################################
IMPORTANT NOTE: I tested with 45GB of RAM and 32 CPU cores and recommend a minimum of 45 GB of RAM to obtain stable results for Circall.
- Some changes in C++ binary (rebuild the software).
- Re-organization of the software package => user-friendly version with Docker.
- Speed up file reading in R with data.table.
- Fix bugs in hg38 annotation.
- Default option "-td" changed to FALSE.
- Output format with an additional column: junction_FPM - junction fragment per million .i.e, junction_fragment_count normalized by million fragment unit.
NOTE:
- Tutorial has been changed for Docker running tutorial in this version.
- An old version can be found in the sub-directory Circall_v0.1.0.
Instruction how to build the software from source can be found at: https://www.meb.ki.se/sites/biostatwiki/circall/
First submission
Circall is a novel method for fast and accurate discovery of circular RNAs from paired-end RNA-sequencing data. The method controls false positives by two-dimensional local false discovery method and employs quasi-mapping for fast and accurate alignments. The details of Circall are described in its manuscript. In this page, we present the Circall tool and how to use it.
- Docker
Circall requires:
- a fasta file of transcript sequences and a gtf file of transcript annotation: can be downloaded from public repositories such as Ensembl (ensembl.org)
- a genome file of transcript sequences and a gtf file of transcript annotation: can be downloaded from public repositories such as Ensembl (ensembl.org)
- a RData file of supporting annotation: A description of how to create the RData file for new annotation versions or species is available in the following section.
- a back-splicing-junction (BSJ) reference sqeuences: A description of how to create the BSJ reference sqeuences for new annotation versions or species is available in the following section.
Current Circall version was tested on the human genome (hg18 and hg38). Detail tutorials are in section 3 (hg19), and section 4 (hg38).
The esiest way to install and run Circall is using our pre-buit Docker immage:
# pull circall docker image:
docker pull ndatth/circall:v1.0.1
# test your docker
docker run --rm ndatth/circall:v1.0.1 Circall.sh
Expected output is:
--------------------------------------------------------------------------
_
_____ (_) _____ _____ __ __ __
/ ___/ / / / ___/ / ___/ / \ / / / /
/ /__ / / / / / /__ / __ \ / /__ / /__
\___/ /_/ /_/ \___/ /_/ \_\ /_/__/ /_/__/
---------------------------------------------------------------------------
Checking arguments ....
---------------------------------------------------------------------------
Usage:
./Circall.sh
-genome [genome in fasta format]
-gtfSqlite [genome annotation in Sqlite format]
-txFasta [transcripts (cDNA) in fasta format]
-txIdx [quasi-index of txFasta]
-bsjIdx [quasi-index of BSJ reference fasta file]
-read1 [read1 fastq.gz file]
-read2 [read2 fastq.gz file]
-thread [number of thread]
-tag [tag name]
-td [TRUE/FALSE value, defaut is FALSE]
-c[TRUE/FALSE value, defaut is TRUE]
-o [output folder]
If you want to build Circall from sources: Please prefer to our wiki webpage https://www.meb.ki.se/sites/biostatwiki/circall for detail instruction
wget http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh37.75.cdna.all.fa.gz
wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
gunzip Homo_sapiens.GRCh37.75.gtf.gz
You can generate the sqlite annotation database was generated and able to download from Homo_sapiens.GRCh37.75.sqlite. This file was generated by the following command:
# assumming you are running unix and $PWD: is the path to directory that is mounted to docker containter as /data:
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 createSqlite.R \
data/Homo_sapiens.GRCh37.75.gtf \
data/Homo_sapiens.GRCh37.75.sqlite
The BSJ reference database for Homo_sapiens.GRCh37.75 was generated and able to download from Homo_sapiens.GRCh37.75_BSJ_sequences.fa. This file was generated by the following command:
# assumming you are running unix and $PWD: is the path to directory that is mounted to docker containter as /data:
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 buildBSJdb.R \
gtfSqlite=data/Homo_sapiens.GRCh37.75.sqlite \
genomeFastaFile=data/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
bsjDist=250 maxReadLen=150 \
output=data/Homo_sapiens.GRCh37.75_BSJ_sequences.fa
# assumming you are running unix and $PWD: is the path to directory that is mounted to docker containter as /data:
# please uncompress your fa file (with gunzip) before indexing
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 TxIndexer \
-t data/Homo_sapiens.GRCh37.75.cdna.all.fa \
-o data/IndexTranscriptome
# assumming you are running unix and $PWD: is the path to directory that is mounted to docker containter as /data:
# please uncompress your fa file (with gunzip) before indexing
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 TxIndexer \
-t data/Homo_sapiens.GRCh37.75_BSJ_sequences.fa \
-o data/IndexBSJ
Now, you are ready to run Circall.
Suppose sample_01_1.fasta and sample_01_2.fasta are the input fastq files. For convenience, we prepared a toy example to test the pipeline, which can be downloaded here:
wget https://github.com/datngu/Circall/releases/download/v0.1.0/sample_01_1.fasta.gz
wget https://github.com/datngu/Circall/releases/download/v0.1.0/sample_01_2.fasta.gz
Circall can be run in one command wrapped in a bash script:
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 Circall.sh \
-genome data/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
-gtfSqlite data/Homo_sapiens.GRCh37.75.sqlite \
-txFasta data/Homo_sapiens.GRCh37.75.cdna.all.fa \
-txIdx data/IndexTranscriptome \
-bsjIdx data/IndexBSJ \
-dep Circall/Data/Circall_depdata_human.RData \
-read1 data/sample_01_1.fasta.gz \
-read2 data/sample_01_2.fasta.gz \
-p 4 \
-tag testing_sample \
-c FALSE \
-td FASLE \
-o data/Testing_HG19_out
NOTED: sample depdata of Hela, HEK293, HS68 (cell-lines) are located in Docker immage, path: Circall/Data/Circall_depdata_human.RData
wget http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz -O Homo_sapiens.GRCh38.106.gtf.gz
wget http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz -O Homo_sapiens.GRCh38.cdna.all.fa.gz
wget http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.106.gtf.gz
gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
You can generate the sqlite annotation database was generated and able to download from Homo_sapiens.GRCh38.106.sqlite. This file was generated by the following command:
# assumming you are running unix and $PWD: is the path to directory that is mounted to docker containter as /data:
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 createSqlite.R \
data/Homo_sapiens.GRCh38.106.gtf \
data/Homo_sapiens.GRCh38.106.sqlite
The BSJ reference database for Homo_sapiens.GRCh38.106 was generated and able to download from Homo_sapiens.GRCh38.106_BSJ_sequences.fa. This file was generated by the following command:
# assumming you are running unix and $PWD: is the path to directory that is mounted to docker containter as /data:
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 buildBSJdb.R \
gtfSqlite=data/Homo_sapiens.GRCh38.106.sqlite \
genomeFastaFile=data/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
bsjDist=250 maxReadLen=150 \
output=data/Homo_sapiens.GRCh38.106_BSJ_sequences.fa
# assumming you are running unix and $PWD: is the path to directory that is mounted to docker containter as /data:
# please uncompress your fa file (with gunzip) before indexing
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 TxIndexer \
-t data/Homo_sapiens.GRCh38.cdna.all.fa \
-o data/IndexTranscriptome
# assumming you are running unix and $PWD: is the path to directory that is mounted to docker containter as /data:
# please uncompress your fa file (with gunzip) before indexing
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 TxIndexer \
-t data/Homo_sapiens.GRCh38.106_BSJ_sequences.fa \
-o data/IndexBSJ
Now, you are ready to run Circall.
Suppose sample_01_1.fasta and sample_01_2.fasta are the input fastq files. For convenience, we prepared a toy example to test the pipeline, which can be downloaded here:
wget https://github.com/datngu/Circall/releases/download/v0.1.0/sample_01_1.fasta.gz
wget https://github.com/datngu/Circall/releases/download/v0.1.0/sample_01_2.fasta.gz
Circall can be run in one command wrapped in a bash script:
docker run --rm -v $PWD:/data ndatth/circall:v1.0.1 Circall.sh \
-genome data/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
-gtfSqlite data/Homo_sapiens.GRCh38.106.sqlite \
-txFasta data/Homo_sapiens.GRCh38.cdna.all.fa \
-txIdx data/IndexTranscriptome \
-bsjIdx data/IndexBSJ \
-dep Circall/Data/Circall_depdata_human.RData \
-read1 data/sample_01_1.fasta.gz \
-read2 data/sample_01_2.fasta.gz \
-p 4 \
-tag testing_sample \
-c FALSE \
-td FASLE \
-o data/Testing_HG38_out
NOTED: sample depdata of Hela, HEK293, HS68 (cell-lines) are located in Docker immage, path: Circall/Data/Circall_depdata_human.RData
Parameters | Description |
---|---|
gtfSqlite | genome in fasta format |
Paragraph | genome annotation in Sqlite format |
txFasta | transcripts (cDNA) in fasta format |
txIdx | quasi-index of txFasta |
bsjIdx | quasi-index of BSJ reference fasta file |
Parameters | Description |
---|---|
read1 | input read1: should be in gz format |
read2 | input read1: should be in gz format |
Parameters | Description |
---|---|
dep | data contain depleted circRNAs: to specify the null data (depleted circRNA) for the two-dimensional local false discovery rate method. For convenience, we collect the null data from three human cell lines datasets Hela, Hs68, and Hek293 and provided in the tool: Circall_v0.1.0_linux_x86-64/Data/Circall_depdata_human.RData |
p | the number of threads: Default is 4 |
tag | tag name of results: Default is “Sample” |
td | generation of tandem sequences: TRUE/FALSE value, default is FASLE |
c | clean intermediate data: TRUE/FALSE value, default is TRUE |
o | output folder: Default is the current directory |
The main output of Circall is provided in *_Circall_final.txt. In this file, each row indicates one circular RNA, and the information of one circular RNA is presented in 8 columns:
Name | Description |
---|---|
chr | chromosome |
start | start position |
end | end position |
geneID | gene name that the circRNA belongs to |
circID | the ID of circRNA in the format “chr__start__end” |
junction_fragment_count | the number of fragment counts supporting the back-splicing-junction (BSJ) |
junction_FPM | junction fragment per million .i.e, junction_fragment_count normalized by million fragment unit |
median_circlen | the median length of the circular RNA |
fdr | the false discovery rate computed from the two-dimensional local false discovery method |
Circall simulator is a tool integrated in Circall to generate RNA-seq data of both circRNA and tandem RNA. The source codes are provided in R/Circall_simulator.R
of the Circall package. The main function of the simulator is Circall_simulator()
which is able to be run in R console. This function requires the following parameters:
Name | Description |
---|---|
circInfo | a data frame that contains 6 columns which are: Chr, start_EXONSTART, end_EXONEND, GENEID, cCount and FPKM. Chr is chromosome name with formated as 1:22, X, Y, Mt. start_EXONSTART is starting position starting exon of circRNA, end_EXONEND is ending position ending exon of circRNA, GENEID is gene ID contains circRNA (used to get gene model), cCount are number of read pair want to generate for the target circRNA and FPKM are Fragments Per Kilobase of transcript per Million of target circRNAs. This is used to simulate circular RNAs |
tandemInfo | a data frame similar to circInfo to simulate tandem RNAs. tandemInfo=NULL (the default value) to not simulate tandem RNAs |
error_rate | sequencing error rate, the default value is 0.005 |
set.seed | set seed for reproducibility, the default value is 2018 |
gtfSqlite | path to your annotation file, Sqlite formated (generated by GenomicFeatures) |
genomeFastaFile | path to your genome fasta file |
txFastaFile | path to your transcript fasta file (cDNA) |
out_name | prefix output folders, the default value is “Circall_simuation” |
out_dir | the directory contains output, the default value is the current directory |
lib_size | expected library size used when useFPKM=TRUE, the default value is NULL |
useFPKM | boolean value to use FPKM or not, the default value is FALSE. When this useFPKM=TRUE, users need to set value for lib_size, and the simulator will use the abundance in column FPKM of circInfo/tandemInfo for simulation |
For an illustration of using Circall simulator, we provide in this section a toy example. Suppose your current working directory contains the installed Circall and the annotation data. First, we need to load the functions of the simulator into your R console:
source("/Circall_v1.0.0/R/Circall_simulator.R")
Then we create objects circInfo and tandemInfo containing the information of CircRNAs and tandem RNAs
Chr = c(7,7,3,5,17,4,7,1,3,1,17,12,14,10,18,17,5,20,16,17)
start_EXONSTART = c(131113792,99795401,172363413,179296769,36918664,151509200,2188787,51906019,57832924,225239153,76187051,111923075,104490906,101556854,196637,21075331,74981032,60712420,56903641,80730328)
end_EXONEND = c(131128461,99796580,172365904,179315312,36918758,151509336,2270359,51913807,57882659,225528403,76201599,111924628,104493276,101572901,199316,21087123,74998635,60716000,56904648,80772810)
GENEID = c("ENSG00000128585","ENSG00000066923","ENSG00000144959","ENSG00000197226","ENSG00000108294","ENSG00000198589","ENSG00000002822","ENSG00000085832","ENSG00000163681","ENSG00000185842","ENSG00000183077","ENSG00000204842","ENSG00000156414","ENSG00000023839","ENSG00000101557","ENSG00000109016","ENSG00000152359","ENSG00000101182","ENSG00000070915","ENSG00000141556")
set.seed(2021)
cCount = sample(2:2000,20)
FPKM = rep(0,20)
BSJ_info = data.frame(Chr = Chr, start_EXONSTART = start_EXONSTART, end_EXONEND = end_EXONEND, GENEID = GENEID, cCount = cCount, FPKM = FPKM)
circSet=c(1:15)
circInfo = BSJ_info[circSet,]
tandemInfo = BSJ_info[-circSet,]
Finally, we run the simulator:
simulation = Circall_simulator(circInfo = circInfo, tandemInfo = tandemInfo, useFPKM=FALSE, out_name = "Tutorial", gtfSqlite = "Homo_sapiens.GRCh37.75.sqlite", genomeFastaFile = "Homo_sapiens.GRCh37.75.dna.primary_assembly.fa", txFastaFile = "Homo_sapiens.GRCh37.75.cdna.all.fa", out_dir= "./simulation_test")
You can find in the “./simulation_test” that contains the outputs including:
- simulation_setting: setting information of simulation of both circRNAs and tandem RNAs.
- circRNA_data: RNA seq data of CircRNAs
- tandem_data: RNA seq data of tandem RNA
- fasta sequences of tandem RNAs
- fasta sequences of circular RNAs
Circall uses GNU General Public License GPL-3.
Nguyen, Dat Thanh, Quang Thinh Trac, Thi-Hau Nguyen, Ha-Nam Nguyen, Nir Ohad, Yudi Pawitan, and Trung Nghia Vu. 2021. “Circall: Fast and Accurate Methodology for Discovery of Circular RNAs from Paired-End RNA-Sequencing Data.” BMC Bioinformatics 22 (1): 495. https://doi.org/10.1186/s12859-021-04418-8.