This is the repository created for COMP483 mini project:
Providing a Python wrapper to investigate if the original E.coli K-12 strain from 1922 (Bachmann 1972 (PMID: 4568763)) has been evolved over time.
-
SRA-Toolkit
For downloading files from NCBI as well as coverting it into the fastq format -
SPAdes
For genome assembly -
Prokka
For rapid prokaryotic genome annotation -
TopHat
For mapping the reads of the E.coli transcriptome project of a K-12 derivative BW38028 -
Cufflinks
For quantifying the transcriptomic expression from the E.coli transcriptome project of a K-12 derivative BW38028
Use the following code in the command line
python3 EcoliWrapper.py
The wrapper will generate a 'Results' folder under '$HOME/' directory, and the folder contains the following:
- miniproject.log
- Command used for SPAdes
- The number of contigs with a length > 1000
- The length of the assembly
- Command used for Prokka
- Discrepancy between Prokka annotation and RefSeq for E. coli K-12 (NC_000913)
- long.fasta
- A fasta file containing all contigs with length > 1000
- transcriptome_data.fpkm
- A csv format file with seqname, start, end, strand and FPKM for each record.
- Based on SRR1411276 quantification
- SRR and NC_000913 data
- SRR8185310
- SRR1411276
- NC_000913
- Spade output
- Prokka output
- EcoliK12_index
- Built via bowtie2 with NC_000913 data
- Tophat output
- Cufflinks output
Since all the SRR used for this project are hard coded within the wrapper, so user does not need to supply any information.
However, if it is the case, the user can substitue the SRR in the wrapper to perform relative analysis to that specific SRR.
-
SRR8185310:
Used for conducting assembly and annotation, as well as comparisions to RefSeq for E. coli K-12 (NC_000913) -
SRR1411276:
Used for mapping and quantification, as well as generating the transcriptome_data.fpkm