Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataprep error and empty output files #170

Open
Michael-m6A opened this issue Dec 28, 2022 · 27 comments
Open

dataprep error and empty output files #170

Michael-m6A opened this issue Dec 28, 2022 · 27 comments

Comments

@Michael-m6A
Copy link

Michael-m6A commented Dec 28, 2022

Hi,
I keep getting an error message and empty output files (data.index, data.json, data.log, data.readcount eventalign.index) when running the xpore dataprep step for genomic coordinates.

I have downloaded the latest reference fasta and gtf files for Aedes aegypti from ENSEMBL as recommended.

Please see the header of the gtf file, script, and error messages below.

Head of gtf file

head /scratch/project_mnt/S0081/Aedes_aegypti_lvpagwg.AaegL5.55.gtf
#!genome-build AaegL5
#!genome-version AaegL5
#!genome-build-accession GCA_002204515.1
2 VectorBase gene 97401212 97402380 . + . gene_id "AAEL020088"; gene_source "VectorBase"; gene_biotype "protein_coding";
2 VectorBase transcript 97401212 97402380 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
2 VectorBase exon 97401212 97401577 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E1"; tag "Ensembl_canonical";
2 VectorBase CDS 97401561 97401577 . + 0 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";
2 VectorBase start_codon 97401561 97401563 . + 0 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
2 VectorBase exon 97401632 97401925 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E2"; tag "Ensembl_canonical";
2 VectorBase CDS 97401632 97401925 . + 1 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";

Script

#!/bin/bash

#PBS -l walltime=80:00:00
#PBS -N xpore-dataprep-ENSEMBL-55-genomic-gtf-PNXP22239
#PBS -j oe
#PBS -A UQ-SCI-BiolSci
#PBS -l select=1:ncpus=24:mem=80G

-location of scratch project space-
dir=/scratch/project/m6a-ml-2022

-location for eventalign file on scratch project space-
eventalign=${dir}/PNXP22239-WB-1-ENSEMBL-55-eventalign.txt

-location of ENSEMBL release 55 gtf reference file to work on scratch project space-
gtf=${dir}/Aedes_aegypti_lvpagwg.AaegL5.55.gtf

-location of ENSEMBL release 55 cDNA and ncRNA concatenated reference fasta file to work on scratch project space-
fasta=${dir}/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.fa

-xpore loading module-

module load xpore/2.1

-xpore command-
xpore dataprep
--eventalign ${eventalign}
--gtf_or_gff ${gtf}
--transcript_fasta ${fasta}
--out_dir ${dir}/dataprep-ENSEMBL-55-genomic-gtf-PNXP22239
--genome

Job log error message

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

…………………………..

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
Traceback (most recent call last):
File "/sw/QFAB/miniconda3/envs/xpore_2.1/bin/xpore", line 10, in
sys.exit(main())
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main
options.func(options)
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 753, in dataprep
parallel_preprocess_gene(eventalign_filepath,fasta_dict,annotation_dict,is_gff,out_dir,n_processes,readcount_min,readcount_max,resume)
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 342, in parallel_preprocess_gene
n_reads, tx_ids, t2g_mapping = t2g(gene_id,fasta_dict,annotation_dict,g2t_mapping,df_eventalign_index,readcount_min)
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 88, in t2g
tx_seq = fasta_dict[tx][0]

KeyError: 'AAEL010314-RA'
=>> PBS: job killed: walltime 288068 exceeded limit 288000
########################### Job Execution History #############################
JobId:970386.tinmgr2
UserName:
GroupName:qris-uq
JobName:xpore-dataprep-ENSEMBL-55-genomic-gtf-PNXP22239
SessionId:35971
ResourcesRequested:mem=80gb,ncpus=24,place=free,walltime=80:00:00
ResourcesUsed:cpupercent=185,cput=06:15:32,mem=2491616kb,ncpus=24,vmem=9301936kb,walltime=80:01:08
QueueUsed:General
AccountString:UQ-SCI-BiolSci
ExitStatus:271
###############################################################################

I’m not worried about the performance warning, however, the errors in line 10, 67, 753, 342, and 88 which seem to be the source of the error I don’t understand.

Any guidance would be greatly appreciated.

Many thanks,
Michael

@yuukiiwa
Copy link
Collaborator

yuukiiwa commented Jan 2, 2023

Hi Michael (@Michael-m6A),

Do you mind sharing the download links to the gtf and the fasta files, please?
Thanks!

Best wishes,
Yuk Kei

@Michael-m6A
Copy link
Author

Michael-m6A commented Jan 2, 2023

Hi Yuk (@yuukiiwa),
Absolutely. Please see the links to the fasta and gtf files below.

Index of /pub/metazoa/release-55/fasta/aedes_aegypti_lvpagwg/cdna
http://ftp.ensemblgenomes.org/pub/metazoa/release-55/fasta/aedes_aegypti_lvpagwg/cdna/

Index of /pub/metazoa/release-55/fasta/aedes_aegypti_lvpagwg/ncrna
http://ftp.ensemblgenomes.org/pub/metazoa/release-55/fasta/aedes_aegypti_lvpagwg/ncrna/

Please note: I have concatenated the cdna and ncrna fasta file in order to obtain one reference fasta file.

Index of /pub/metazoa/release-55/gtf/aedes_aegypti_lvpagwg
http://ftp.ensemblgenomes.org/pub/metazoa/release-55/gtf/aedes_aegypti_lvpagwg/

Thanks,
Michael

@yuukiiwa
Copy link
Collaborator

yuukiiwa commented Jan 3, 2023

Hi Michael (@Michael-m6A),

Here is a python script that will convert your merged fasta file into a format that xpore can process (I tested it on the fasta and gtf above):

file=open('merged.fa','r')
outfile=open('new.fa','w')
for ln in file:
 if ln.startswith('>'):
  ln=ln.split()[0]
  print(ln)
  outfile.write(ln+'\n')
 else:
  outfile.write(ln)
outfile.close()

Thanks!

Best wishes,
Yuk Kei

@Michael-m6A
Copy link
Author

Michael-m6A commented Jan 4, 2023

Hi Yuk (@yuukiiwa),

Thank you very much for the python script to make my fasta files usable for xpore.

Please see fasta file before and after applying the python script below:

fasta file before

grep ">" /scratch/project_mnt/S0081/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.fa | head -10

AAEL012458-RD cdna chromosome:AaegL5:1:123206413:123240627:1 gene:AAEL012458 gene_biotype:protein_coding transcript_biotype:protein_coding
AAEL012458-RC cdna chromosome:AaegL5:1:122993365:123240627:1 gene:AAEL012458 gene_biotype:protein_coding transcript_biotype:protein_coding
AAEL012458-RB cdna chromosome:AaegL5:1:122788218:123240627:1 gene:AAEL012458 gene_biotype:protein_coding transcript_biotype:protein_coding
AAEL006471-RB cdna chromosome:AaegL5:3:216688037:216701429:-1 gene:AAEL006471 gene_biotype:protein_coding transcript_biotype:protein_coding description:alcohol dehydrogenase
AAEL021257-RB cdna chromosome:AaegL5:1:239266392:239268653:1 gene:AAEL021257 gene_biotype:protein_coding transcript_biotype:protein_coding
AAEL021257-RA cdna chromosome:AaegL5:1:239266375:239268441:1 gene:AAEL021257 gene_biotype:protein_coding transcript_biotype:protein_coding
AAEL020088-RB cdna chromosome:AaegL5:2:97401212:97402380:1 gene:AAEL020088 gene_biotype:protein_coding transcript_biotype:protein_coding
AAEL020088-RA cdna chromosome:AaegL5:2:97401216:97402380:1 gene:AAEL020088 gene_biotype:protein_coding transcript_biotype:nontranslating_CDS
AAEL008603-RG cdna chromosome:AaegL5:2:362937967:362942272:1 gene:AAEL008603 gene_biotype:protein_coding transcript_biotype:protein_coding
AAEL008603-RC cdna chromosome:AaegL5:2:362937967:362942272:1 gene:AAEL008603 gene_biotype:protein_coding transcript_biotype:protein_coding

fasta file after applying python script

grep ">" /scratch/project_mnt/S0081/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.xpore.modified.fa | head -10

AAEL012458-RD
AAEL012458-RC
AAEL012458-RB
AAEL006471-RB
AAEL021257-RB
AAEL021257-RA
AAEL020088-RB
AAEL020088-RA
AAEL008603-RG
AAEL008603-RC

The fasta file looks good and I have set up a new xpore dataprep genomic coordinate run using the xpore modified fasta and original ENSEMBL gtf file.

I will let you know how the run goes.

Thanks again!

Regards,
Michael

@Michael-m6A
Copy link
Author

Michael-m6A commented Jan 6, 2023

Hi Yuk (@yuukiiwa),

The current xpore dataprep run finished but unfortunately gives a different error in line 419 and it also exceeded 40 hours of walltime. Please see the error message below.

Error message

cat xpore-dataprep-ENSEMBL-55-genomic-gtf-xpore-modified-PNXP22239.o985129
########################### Execution Started #############################
JobId:985129.tinmgr2
UserName:user
GroupName:qris-uq
ExecutionHost:tn329b
###############################################################################
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()

Process Consumer-2:
Traceback (most recent call last):
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/helper.py", line 113, in run
result = self.task_function(*next_task_args,self.locks)
*File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'

KeyError: ('AAEL023294-RA', 6259)
=>> PBS: job killed: walltime 144007 exceeded limit 144000

########################### Job Execution History #############################
JobId:985129.tinmgr2
UserName:user
GroupName:qris-uq
JobName:xpore-dataprep-ENSEMBL-55-genomic-gtf-xpore-modified-PNXP22239
SessionId:45949
ResourcesRequested:mem=80gb,ncpus=24,place=free,walltime=40:00:00
ResourcesUsed:cpupercent=185,cput=06:48:32,mem=2565768kb,ncpus=24,vmem=9406104kb,walltime=40:00:07
QueueUsed:General
AccountString:UQ-SCI-BiolSci
ExitStatus:271
###############################################################################

Many thanks again.

Regards,
Michael

@yuukiiwa
Copy link
Collaborator

yuukiiwa commented Jan 6, 2023

Hi Michael (@Michael-m6A),

Do you mind trying to strip off the -XX in AAEL000000-XX (the -RA in AAEL023294-RA) in the first column of the eventalign.txt file using the script below (not tested)?

file=open('eventalign.txt','r')
outfile=open('modified_eventalign.txt','w')
outfile.write(file.readline())
for ln in file:
 ln=ln.split('\t')
 newln=[ln[0].split('-')[0]]+ln[1:]
 outfile.write('\t'.join(newln))
outfile.close()

Thanks!

Best wishes,
Yuk Kei

@Michael-m6A
Copy link
Author

Hi Yuk (@yuukiiwa),

Thanks for your suggestion, but I am a bit reluctant to delete first column of the eventalign.txt -XX in AAEL000000-XX (the -RA in AAEL023294-RA).

The reason being is that the -RA, -RB etc. are the identifier for the different transcripts/isoforms of a gene. My thinking is that if I delete the -XX in AAEL000000-XX of the eventalign.txt file, then I would need to delete the-XX of the fasta and gtf file as well. However, in my view that would mean I am deleting all the transcripts/isoforms identifiers.

Can you think of another possible solution for the error in line 419 of the xpore python script?

Could it help applying the python script you previously provided to clean up the fasta file and do the same with the gtf file? Please see head of gtf file below.

head /scratch/project_mnt/S0081/Aedes_aegypti_lvpagwg.AaegL5.55.gtf
#!genome-build AaegL5
#!genome-version AaegL5
#!genome-build-accession GCA_002204515.1
2 VectorBase gene 97401212 97402380 . + . gene_id "AAEL020088"; gene_source "VectorBase"; gene_biotype "protein_coding";
2 VectorBase transcript 97401212 97402380 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
2 VectorBase exon 97401212 97401577 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E1"; tag "Ensembl_canonical";
2 VectorBase CDS 97401561 97401577 . + 0 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";
2 VectorBase start_codon 97401561 97401563 . + 0 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
2 VectorBase exon 97401632 97401925 . + . gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E2"; tag "Ensembl_canonical";
2 VectorBase CDS 97401632 97401925 . + 1 gene_id "AAEL020088"; transcript_id "AAEL020088-RB"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";

Thanks again!
Michael

@Michael-m6A
Copy link
Author

Hi Yuk (@yuukiiwa),

I have been going through all my scripts from the beginning and just realised that I obviously used the unmodified reference fasta file to align the basecalled fastq reads using minimap2, and then Nanopolish to generate the eventalign files.

Would I be correct in saying that I need to repeat the minimap2 and Nanopolish steps using the xpore_modified reference fasta file.

Best wishes,
Michael

@yuukiiwa
Copy link
Collaborator

Hi Michael (@Michael-m6A),

You don't have to rerun your samples from the beginning using the newly modified reference fasta file.

Regarding not stripping the -XX in AAEL000000-XX:
I don't think the transcript isoform is preserved when you are using the --genome flag with xpore as all transcripts of the same gene are being collapsed to the gene level, where the coordinates would be the genome coordinate instead of the transcriptome coordinate. You will not have to delete the -XX in AAEL000000-XX in the fasta and gtf files.

Thanks!

Best wishes,
Yuk Kei

@Michael-m6A
Copy link
Author

Hi Yuk (@yuukiiwa),

I am very much appreciating you taking the time to explain and troubleshoot this fasta and eventalign file compatibility issue with me. Your explanation totally makes sense to me and we’ve applied the python script you provided to delete the (-XX in AAEL000000-XX) in the eventalign file.
Please see the header of the xpore-modified and unmodified eventalign file below.

Job log
########################### Execution Started #############################
JobId:1040343.tinmgr2
UserName: user
GroupName:xxx
ExecutionHost:tn119a
###############################################################################
6092416079 /scratch/project_mnt/S0081/PNXP22239-WB-1-ENSEMBL-55-xpore-modified-eventalign.txt
6092416079 /scratch/project_mnt/S0081/PNXP22239-WB-1-ENSEMBL-55-eventalign.txt

########################### Job Execution History #############################
JobId:1040343.tinmgr2
UserName:user
GroupName:xx
JobName:Python-xpore-modify-eventalign-file
SessionId:7014
ResourcesRequested:mem=80gb,ncpus=2,place=free,walltime=12:00:00
ResourcesUsed:cpupercent=98,cput=03:56:54,mem=26952kb,ncpus=2,vmem=386740kb,walltime=04:08:10
QueueUsed:General
AccountString:xx
ExitStatus:0

head /scratch/project_mnt/S0081/PNXP22239-WB-1-ENSEMBL-55-eventalign.txt
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
AAEL006471-RB 90 ATAAT 6 t 530 84.32 2.712 0.00700 ATAAT 89.36 3.04 -1.46 66606 66627
AAEL006471-RB 90 ATAAT 6 t 531 87.57 1.950 0.01500 ATAAT 89.36 3.04 -0.52 66561 66606
AAEL006471-RB 91 TAATC 6 t 532 108.49 2.862 0.00433 TAATC 106.55 3.11 0.55 66548 66561
AAEL006471-RB 91 TAATC 6 t 533 111.79 0.902 0.00233 TAATC 106.55 3.11 1.48 66541 66548
AAEL006471-RB 91 TAATC 6 t 534 109.40 1.550 0.00233 TAATC 106.55 3.11 0.81 66534 66541
AAEL006471-RB 91 TAATC 6 t 535 109.79 2.267 0.00400 TAATC 106.55 3.11 0.92 66522 66534
AAEL006471-RB 92 AATCA 6 t 536 99.10 2.201 0.00167 AATCA 100.24 5.70 -0.18 66517 66522
AAEL006471-RB 92 AATCA 6 t 537 90.56 2.237 0.00267 AATCA 100.24 5.70 -1.49 66509 66517
AAEL006471-RB 92 AATCA 6 t 538 94.81 3.837 0.00667 AATCA 100.24 5.70 -0.84 66489 66509

head /scratch/project_mnt/S0081/PNXP22239-WB-1-ENSEMBL-55-xpore-modified-eventalign.txt
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
AAEL006471 90 ATAAT 6 t 530 84.32 2.712 0.00700 ATAAT 89.36 3.04 -1.46 66606 66627
AAEL006471 90 ATAAT 6 t 531 87.57 1.950 0.01500 ATAAT 89.36 3.04 -0.52 66561 66606
AAEL006471 91 TAATC 6 t 532 108.49 2.862 0.00433 TAATC 106.55 3.11 0.55 66548 66561
AAEL006471 91 TAATC 6 t 533 111.79 0.902 0.00233 TAATC 106.55 3.11 1.48 66541 66548
AAEL006471 91 TAATC 6 t 534 109.40 1.550 0.00233 TAATC 106.55 3.11 0.81 66534 66541
AAEL006471 91 TAATC 6 t 535 109.79 2.267 0.00400 TAATC 106.55 3.11 0.92 66522 66534
AAEL006471 92 AATCA 6 t 536 99.10 2.201 0.00167 AATCA 100.24 5.70 -0.18 66517 66522
AAEL006471 92 AATCA 6 t 537 90.56 2.237 0.00267 AATCA 100.24 5.70 -1.49 66509 66517
AAEL006471 92 AATCA 6 t 538 94.81 3.837 0.00667 AATCA 100.24 5.70 -0.84 66489 66509

I have also just submitted a new xpore dataprep run using the xpore-modified fasta and eventalign file. I should have an answer by Monday and will update you on the outcome.

Have a good weekend.

Cheers,
Michael

@Michael-m6A
Copy link
Author

Michael-m6A commented Jan 16, 2023

Hi Yuk (@yuukiiwa),

The xpore-dataprep run finished without giving me an error message, only listed the usual performance warnings. Please see job log below.

cat xpore-dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239.o1042590
########################### Execution Started #############################
JobId:1042590.tinmgr2
UserName:xx
GroupName:qris-uq
ExecutionHost:tn311b
###############################################################################
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
########################### Job Execution History #############################
JobId:1042590.tinmgr2
UserName:xx
GroupName:qris-uq
JobName:xpore-dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239
SessionId:39824
ResourcesRequested:mem=100gb,ncpus=24,place=free,walltime=40:00:00
ResourcesUsed:cpupercent=182,cput=06:21:30,mem=1766540kb,ncpus=24,vmem=8564532kb,walltime=03:55:07
QueueUsed:General
AccountString: xx
ExitStatus:0
###############################################################################

However, I've continued checking the outcome files of which the data.index data.json data.log data.readcount files are empty and only the eventalign.index contains the information of transcript_id,read_index,pos_start,pos_end, see below.

ls /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239
data.index data.json data.log data.readcount eventalign.index

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.index
idx,start,end

more /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.index
idx,start,end

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.index
idx,start,end

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.json

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.json

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.log
Total 0 genes.
--- SUCCESSFULLY FINISHED ---

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.log
Total 0 genes.
--- SUCCESSFULLY FINISHED ---

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.readcount
idx,n_reads

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/data.readcount
idx,n_reads

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/eventalign.index
transcript_id,read_index,pos_start,pos_end
AAEL006471,6,172,227770
AAEL006471,10,227770,470953
AAEL006471,15,470953,734427
AAEL006471,14,734427,1002147
AAEL012458,0,1002147,1108621
AAEL006471,9,1108621,1386556
AAEL006471,5,1386556,1664122
AAEL006471,7,1664122,1940105
AAEL006471,8,1940105,2211471

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-modified-PNXP22239/eventalign.index
AAEL023197,2964358,552753283315,552753335362
AAEL027615,2964371,552753335362,552753396536
AAEL023197,2964355,552753396536,552753454995
AAEL027615,2964370,552753454995,552753516754
AAEL027615,2964372,552753516754,552753636015
AAEL027615,2964374,552753636015,552753788122
AAEL027615,2964373,552753788122,552753936615
AAEL027615,2964367,552753936615,552754139519
AAEL027615,2964369,552754139519,552754367132
AAEL027615,2964375,552754367132,552754580657

It looks like that the xpore dataprep step worked but data.index data.json data.log data.readcount files being empty makes me a bit suspicious.

Can you please tell me what you think and if this is still an issue point me to a possible solution.

Many thanks again!

Cheers,
Michael

@yuukiiwa
Copy link
Collaborator

Hi Michael (@Michael-m6A),

Do you mind using the following GTF to run xpore dataprep?
https://drive.google.com/file/d/1KLikdHKevy7sDWnppo2LmazeehTrKCUy/view?usp=sharing
Thanks!

Best wishes,
Yuk Kei

@Michael-m6A
Copy link
Author

Hi Yuk (@yuukiiwa),

Absolutely, happy to do so. I have submitted a new xpore dataprep run using your modified gtf file. I am currently in a queue on our HPC waiting for my job submission to run. We should have an outcome by tomorrow which I’ll post again.

Thank you!

Cheers,
Michael

@Michael-m6A
Copy link
Author

Hi Yuk (@yuukiiwa),

I am afraid but it looks like using the modified gft file you provided did not solved the issue. The xpore dataprep run using the modified gtf file produced the following error messages (line 10, 67, 753, 342, and 88) and the output files are empty. Please see below.

Job log

cat xpore-dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239.o1055583
########################### Execution Started #############################
JobId:1055583.tinmgr2
UserName:xx
GroupName:qris-uq
ExecutionHost:tn425b
###############################################################################
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
Traceback (most recent call last):
File "/sw/QFAB/miniconda3/envs/xpore_2.1/bin/xpore", line 10, in
sys.exit(main())
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main
options.func(options)
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 753, in dataprep
parallel_preprocess_gene(eventalign_filepath,fasta_dict,annotation_dict,is_gff,out_dir,n_processes,readcount_min,readcount_max,resume)
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 342, in parallel_preprocess_gene
n_reads, tx_ids, t2g_mapping = t2g(gene_id,fasta_dict,annotation_dict,g2t_mapping,df_eventalign_index,readcount_min)
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 88, in t2g
tx_seq = fasta_dict[tx][0]
KeyError: 'AAEL014640'
=>> PBS: job killed: walltime 72045 exceeded limit 72000

########################### Job Execution History #############################
JobId:1055583.tinmgr2
UserName:xx
GroupName:qris-uq
JobName:xpore-dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239
SessionId:24225
ResourcesRequested:mem=100gb,ncpus=24,place=free,walltime=20:00:00
ResourcesUsed:cpupercent=185,cput=06:15:28,mem=2215896kb,ncpus=24,vmem=9023764kb,walltime=20:00:45
QueueUsed:General
AccountString:UQ-SCI-BiolSci
ExitStatus:271
###############################################################################

Inspected output files

ls /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239
data.index data.json data.log data.readcount eventalign.index

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/data.index
idx,start,end

tail /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/data.index
idx,start,end

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/data.json
nothing

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/data.log
nothing

Usually here we see
Total xx genes.
--- SUCCESSFULLY FINISHED ---

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-gtf-file-PNXP22239/eventalign.index
transcript_id,read_index,pos_start,pos_end
AAEL006471,6,172,227770
AAEL006471,10,227770,470953
AAEL006471,15,470953,734427
AAEL006471,14,734427,1002147
AAEL012458,0,1002147,1108621
AAEL006471,9,1108621,1386556
AAEL006471,5,1386556,1664122
AAEL006471,7,1664122,1940105
AAEL006471,8,1940105,2211471

Thank you again.

Cheers,
Michael

@yuukiiwa
Copy link
Collaborator

Hi Michael (@Michael-m6A),

I have tried running xpore dataprep with the 10-line eventalign.txt you provided above and the following modified fasta and gtf. Here are the top 10 lines of the files:

eventalign.txt

contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx
AAEL006471	90	ATAAT	6	t	530	84.32	2.712	0.00700	ATAAT	89.36	3.04	-1.46	66606	66627
AAEL006471	90	ATAAT	6	t	531	87.57	1.950	0.01500	ATAAT	89.36	3.04	-0.52	66561	66606
AAEL006471	91	TAATC	6	t	532	108.49	2.862	0.00433	TAATC	106.55	3.11	0.5566548	66561
AAEL006471	91	TAATC	6	t	533	111.79	0.902	0.00233	TAATC	106.55	3.11	1.4866541	66548
AAEL006471	91	TAATC	6	t	534	109.40	1.550	0.00233	TAATC	106.55	3.11	0.8166534	66541
AAEL006471	91	TAATC	6	t	535	109.79	2.267	0.00400	TAATC	106.55	3.11	0.9266522	66534
AAEL006471	92	AATCA	6	t	536	99.10	2.201	0.00167	AATCA	100.24	5.70	-0.18	66517	66522
AAEL006471	92	AATCA	6	t	537	90.56	2.237	0.00267	AATCA	100.24	5.70	-1.49	66509	66517
AAEL006471	92	AATCA	6	t	538	94.81	3.837	0.00667	AATCA	100.24	5.70	-0.84	66489	66509

fasta

>AAEL012458
ATTTGAACATTATTGCCGTATAGCAAAATCGTGAAAAAAGTTCAAATTAGTTTTAAAATA
TGTGATGCAATCAGTGCCCGTACAAACACTATGGTGAAATAAAGCTGCGCATTTTACTCC
AACGTTCCTCAAGTGCCAAATATTCTGAACAACTTCTCCATCAATTATGCGAATCCTGTT
CCATAACGAAAACTATGCCAACTACGAAGACTCAGGTGTGGCCACAAGTGCCGATGCCCA
TTCTCTGGTATTCGACGAACCGAACACACCGCCACCTCCCGGCCAAGGAAAATGTCCATC
GACTGGGGGGCCAATTGTGCTGGCAGCTGCCACGGCCATCGGTCCCATCACTGCGACCGG
AAATGGTACCGTCACTGGGTCAATGTCAACCCAGCAACCCTCCTCCGGAACAGGCCCAAT
TCCTTCAACCGGGGTTACTCAAAGGGCGTCGAGCCGTATGTCCAATACCAGCACCAGTAC
GATCAAGCAACATTACTACCCGGAGGGTGGATGGGGCTGGATCGTGATTGTGGTGGGCAT

gtf

2	VectorBase	gene	97401212	97402380	.	+	.	gene_id "AAEL020088"; gene_source "VectorBase"; gene_biotype "protein_coding";
2	VectorBase	transcript	97401212	97402380	.	+	.	gene_id "AAEL020088"; transcript_id "AAEL020088"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
2	VectorBase	exon	97401212	97401577	.	+	.	gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E1"; tag "Ensembl_canonical";
2	VectorBase	CDS	97401561	97401577	.	+	0	gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";
2	VectorBase	start_codon	97401561	97401563	.	+	0	gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
2	VectorBase	exon	97401632	97401925	.	+	.	gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E2"; tag "Ensembl_canonical";
2	VectorBase	CDS	97401632	97401925	.	+	1	gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";
2	VectorBase	exon	97401984	97402380	.	+	.	gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL020088-RB-E3"; tag "Ensembl_canonical";
2	VectorBase	CDS	97401984	97402101	.	+	1	gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL020088-PB"; tag "Ensembl_canonical";
2	VectorBase	stop_codon	97402102	97402104	.	+	0	gene_id "AAEL020088"; transcript_id "AAEL020088"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

My data.json output is not empty, which means that in theory your data.json should not be empty:
Screenshot from 2023-01-19 14-08-39

Thanks!

Best wishes,
Yuk Kei

@Michael-m6A
Copy link
Author

Michael-m6A commented Jan 23, 2023

Hi Yuk (@yuukiiwa),

Thanks very much for running xpore dataprep with the 10-line eventalign.txt file.

I have kind of good news with one remaining barrier to overcome (error in line 419), please see below.

Fasta file
Firstly, I have inspected all files (fasta, gtf, and eventalign) used in my previous xpore dataprep run using genomic coordinates and discovered that the original python script previously provided for the fasta file only removed certain header information but did not remove the transcript identification (-RA).

head /scratch/project_mnt/xx/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.xpore.modified.fa

AAEL012458-RD
ATTTGAACATTATTGCCGTATAGCAAAATCGTGAAAAAAGTTCAAATTAGTTTTAAAATA
TGTGATGCAATCAGTGCCCGTACAAACACTATGGTGAAATAAAGCTGCGCATTTTACTCC
AACGTTCCTCAAGTGCCAAATATTCTGAACAACTTCTCCATCAATTATGCGAATCCTGTT
CCATAACGAAAACTATGCCAACTACGAAGACTCAGGTGTGGCCACAAGTGCCGATGCCCA
TTCTCTGGTATTCGACGAACCGAACACACCGCCACCTCCCGGCCAAGGAAAATGTCCATC
GACTGGGGGGCCAATTGTGCTGGCAGCTGCCACGGCCATCGGTCCCATCACTGCGACCGG
AAATGGTACCGTCACTGGGTCAATGTCAACCCAGCAACCCTCCTCCGGAACAGGCCCAAT
TCCTTCAACCGGGGTTACTCAAAGGGCGTCGAGCCGTATGTCCAATACCAGCACCAGTAC
GATCAAGCAACATTACTACCCGGAGGGTGGATGGGGCTGGATCGTGATTGTGGTGGGCAT

We modified your python script and removed the -RA, see below.

file=open('/scratch/project_mnt/xx/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.fa','r')
outfile=open('/scratch/project_mnt/xx/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.xpore.modified.fa','w')
for ln in file:
if ln.startswith('>'):
ln=ln.split()[0]
ln=ln.split('-')[0]
#print(ln)
outfile.write(ln+'\n')
else:
outfile.write(ln)
outfile.close()

xpore fully modified fasta file

head /scratch/project_mnt/xx/Aedes_aegypti_lvpagwg.AaegL5.cdna.ncrna.all.xpore.modified.fa

AAEL012458
ATTTGAACATTATTGCCGTATAGCAAAATCGTGAAAAAAGTTCAAATTAGTTTTAAAATA
TGTGATGCAATCAGTGCCCGTACAAACACTATGGTGAAATAAAGCTGCGCATTTTACTCC
AACGTTCCTCAAGTGCCAAATATTCTGAACAACTTCTCCATCAATTATGCGAATCCTGTT
CCATAACGAAAACTATGCCAACTACGAAGACTCAGGTGTGGCCACAAGTGCCGATGCCCA
TTCTCTGGTATTCGACGAACCGAACACACCGCCACCTCCCGGCCAAGGAAAATGTCCATC
GACTGGGGGGCCAATTGTGCTGGCAGCTGCCACGGCCATCGGTCCCATCACTGCGACCGG
AAATGGTACCGTCACTGGGTCAATGTCAACCCAGCAACCCTCCTCCGGAACAGGCCCAAT
TCCTTCAACCGGGGTTACTCAAAGGGCGTCGAGCCGTATGTCCAATACCAGCACCAGTAC
GATCAAGCAACATTACTACCCGGAGGGTGGATGGGGCTGGATCGTGATTGTGGTGGGCAT

Xpore dataprep run – genomic coordinates using the fully modified fasta, gtf, and eventalign files

Now to the kind of good news, following the xpore-dataprep run using the fully xpore-modified fasta, gtf, and eventalign files all the outcome files contain expected data, see below.

However, the xpore-dataprep run still gets stuck at around 6 hours and in the job, log gave me this error message, also see below.

ls /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239
data.index data.json data.log data.readcount eventalign.index

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.index
idx,start,end
AAEL019853,0,84036
AAEL008063,84036,792134
AAEL001394,792134,3104813
AAEL018094,3104813,3124935
AAEL004980,3124935,4075246
AAEL027344,4075246,4079148
AAEL010082,4079148,4125980
AAEL012245,4125980,4159251
AAEL025417,4159251,4168695

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.json
{"AAEL019853":{"64326275":{"AGAAA":[87.0]},"64326276":{"CAGAA":[111.1]},"64326277":{"TCAGA":[114.1]},"64326278":{"ATCAG":[80.9]},"64326279":{"TATCA":[93.4]},"64326280":{"GTATC":[84.0]},"64326281":{"AGTAT":[119.0]},"64326282":{"GAGTA":[115.2]},"64326283":{"TGAGT":[119.2]},"64326284":{"TTGAG":[95.2]},"64326285":{"ATTGA":[92.0]},"64326288":{"AAGAT":[122.5]},"64326289":{"GAAGA":[114.1]},"64326290":{"AGAAG":[135.5]},"64326291":{"AAGAA":[120.0]},"64326292":{"AAAGA":[109.1]},"64326293":{"TAAAG":[108.6]},"64326294":{"ATAAA":[90.9]},"64326295":{"AATAA":[98.8]},"64326296":{"GAATA":[115.7]},"64326297":{"AGAAT":[134.7]},"64326298":{"AAGAA":[124.7]},"64326299":{"TAAGA":[119.4]},"64326300":{"ATAAG":[78.9]},"64326301":{"TATAA":[80.0]},"64326302":{"GTATA":[92.0]},"64326303":{"GGTAT":[90.0]},"64326304":{"TGGTA":[121.6]},"64326305":{"ATGGT":[97.0]},"64326306":{"AATGG":[85.9]},"64326307":{"GAATG":[104.4]},"64326308":{"GGAAT":[124.7]},"64326309":{"TGGAA":[112.8]},"64326310":{"TTGGA":[96.9]},"64326311":{"CTTGG":[92.7]},"64326312":{"TCTTG":[80.1]},"64326313":{"TTCTT":[78.2]},"64326314":{"TTTCT":[80.0]},"64326315":{"TTTTC":[76.5]},"64326316":{"GTTTT":[76.9]},"64326317":{"GGTTT":[92.7]},"64326318":{"AGGTT":[124.7]},"64326319":{"AAGGT":[120.7]},"64326320":{"AAAGG":[109.1]},"64326321":{"GAAAG":[100.7]},"64326322":{"TGAAA":[111.4]},"64326323":{"ATGAA":[90.4]},"64326324":{"GATGA":[78.1]},"64326325":{"GGATG":[130.5]},"64326326":{"TGGAT":[128.8]},"64326327":{"ATGGA":[104.3]},"64326328":{"GATGG":[84.7]},"64326329":{"GGATG":[118.2]},"64326330":{"TGGAT":[128.5]},"64326331":{"GTGGA":[100.1]},"64326332":{"GGTGG":[120.0]},"64326333":{"GGGTG":[128.1]},"64326334":{"TGGGT":[115.8]},"64326335":{"CTGGG":[97.7]},"64326336":{"GCTGG":[89.4]},"64326337":{"TGCTG":[99.8]},"64326338":{"TTGCT":[98.5]},"64326339":{"GTTGC":[85.4]},"64326340":{"CGTTG":[93.0]},"64326341":{"GCGTT":[84.0]},"64326342":{"TGCGT":[94.9]},"64326343":{"ATGCG":[87.3]},"64326344":{"AATGC":[76.7]},"64326345":{"TAATG":[93.4]},"64326346":{"ATAAT":[89.3]},"64326347":{"AATAA":[94.4]},"64326348":{"GAATA":[112.9]},"6432

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.log
AAEL019853: True
AAEL008063: True
AAEL001394: True
AAEL018094: True
AAEL004980: True
AAEL027344: True
AAEL010082: True
AAEL012245: True
AAEL025417: True
AAEL004472: True

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.readcount
idx,n_reads
AAEL019853,6
AAEL008063,63
AAEL001394,192
AAEL018094,1
AAEL004980,100
AAEL027344,3
AAEL010082,2
AAEL012245,2
AAEL025417,1

head /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/eventalign.index
transcript_id,read_index,pos_start,pos_end
AAEL006471,6,172,227770
AAEL006471,10,227770,470953
AAEL006471,15,470953,734427
AAEL006471,14,734427,1002147
AAEL012458,0,1002147,1108621
AAEL006471,9,1108621,1386556
AAEL006471,5,1386556,1664122
AAEL006471,7,1664122,1940105
AAEL006471,8,1940105,2211471

Running time when xpore dataprep job got stuck

Job id Name User Time Use S Queue


1067207.tinmgr2 xpore-dataprep- xx 06:27:24 R General

Job log
cat xpore-dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239.o1067207

########################### Execution Started #############################
JobId:1067207.tinmgr2
UserName:xx
GroupName:qris-uq
ExecutionHost:tn313c
###############################################################################
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()

/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
Process Consumer-2:
Traceback (most recent call last):
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/helper.py", line 113, in run
result = self.task_function(*next_task_args,self.locks)
*File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
KeyError: ('AAEL024786', 2984)
=>> PBS: job killed: walltime 72066 exceeded limit 72000

########################### Job Execution History #############################
JobId:1067207.tinmgr2
UserName:xx
GroupName:qris-uq
JobName:xpore-dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239
SessionId:20878
ResourcesRequested:mem=80gb,ncpus=24,place=free,walltime=20:00:00
ResourcesUsed:cpupercent=184,cput=06:27:24,mem=2496020kb,ncpus=24,vmem=9283068kb,walltime=20:01:06
QueueUsed:General
AccountString:UQ-SCI-BiolSci
ExitStatus:271
###############################################################################

It appears that there is still an issue with the genomic coordinate conversion
line 419, in preprocess_gene genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' “
which I don’t quite understand as we have removed all the transcript identification data (i.e. -RA).

Given that it specifically listed KeyError: ('AAEL024786', 2984), I have inspected the modified gtf and eventalign.txt files.

grep AAEL024786 /scratch/project_mnt/xx/xpore_modified_Aedes_aegypti_lvpagwg.AaegL5.55.gtf
1 VectorBase gene 86105077 86118053 . - . gene_id "AAEL024786"; gene_source "VectorBase"; gene_biotype "protein_coding";
1 VectorBase transcript 86105077 86118053 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
1 VectorBase exon 86117901 86118053 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL024786-RA-E1"; tag "Ensembl_canonical";
1 VectorBase CDS 86117901 86118051 . - 0 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "1"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL024786-PA"; tag "Ensembl_canonical";
1 VectorBase exon 86117589 86117828 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL024786-RA-E2"; tag "Ensembl_canonical";
1 VectorBase CDS 86117589 86117828 . - 2 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "2"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL024786-PA"; tag "Ensembl_canonical";
1 VectorBase exon 86117354 86117511 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL024786-RA-E3"; tag "Ensembl_canonical";
1 VectorBase CDS 86117354 86117511 . - 2 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "3"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL024786-PA"; tag "Ensembl_canonical";
1 VectorBase exon 86105077 86107509 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "4"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; exon_id "AAEL024786-RA-E4"; tag "Ensembl_canonical";
1 VectorBase CDS 86107264 86107509 . - 0 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "4"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; protein_id "AAEL024786-PA"; tag "Ensembl_canonical";
1 VectorBase stop_codon 86107261 86107263 . - 0 gene_id "AAEL024786"; transcript_id "AAEL024786"; exon_number "4"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
1 VectorBase five_prime_utr 86118052 86118053 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
1 VectorBase three_prime_utr 86105077 86107260 . - . gene_id "AAEL024786"; transcript_id "AAEL024786"; gene_source "VectorBase"; gene_biotype "protein_coding"; transcript_source "VectorBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

Contains AAEL024786

Data log

grep AAEL024786 /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.log

wc -l /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.log
133 /scratch/project_mnt/xx/dataprep-ENSEMBL-55-genomic-xpore-modified-All-files-PNXP22239/data.log

modified-eventalign – search for AAEL024786 (mentioned in error message)

grep AAEL024786 /scratch/project_mnt/xx/PNXP22239-WB-1-ENSEMBL-55-xpore-modified-eventalign.txt

AAEL024786 304 TCGGT 1708964 t 9 96.18 3.014 0.00500 TCGGT 95.29 4.98 0.16 52031 52046
AAEL024786 304 TCGGT 1708964 t 10 98.23 1.941 0.00233 TCGGT 95.29 4.98 0.53 52024 52031
AAEL024786 304 TCGGT 1708964 t 11 94.83 3.748 0.00433 TCGGT 95.29 4.98 -0.08 52011 52024
AAEL024786 305 CGGTC 1708964 t 12 118.01 5.639 0.01100 CGGTC 117.18 3.37 0.22 51978 52011
AAEL024786 305 CGGTC 1708964 t 13 119.51 8.057 0.02167 CGGTC 117.18 3.37 0.62 51913 51978
………….

AAEL024786 3900 GTATA 1709033 t 1308 86.73 1.652 0.00700 GTATA 89.66 2.63 -0.87 2336 2357
AAEL024786 3901 TATAA 1709033 t 1309 83.04 1.997 0.00467 TATAA 92.84 6.28 -1.22 2322 2336
AAEL024786 3902 ATAAT 1709033 t 1310 82.56 2.438 0.00633 ATAAT 89.36 3.04 -1.75 2303 2322
AAEL024786 3902 ATAAT 1709033 t 1311 87.65 1.897 0.00967 ATAAT 89.36 3.04 -0.44 2274 2303
AAEL024786 3902 ATAAT 1709033 t 1312 86.09 2.294 0.02900 ATAAT 89.36 3.04 -0.84 2187 2274
AAEL024786 3902 ATAAT 1709033 t 1313 72.46 2.759 0.00933 NNNNN 0.00 0.00 inf 2159 2187
AAEL024786 3913 ATTAT 1709033 t 1314 63.95 1.483 0.00500 ATTAT 85.44 2.46 -6.83 2144 2159

Contains AAEL024786, however, data.log shows that it only progressed until 133 and then produced the error message.

I am very appreciative of all your hep and time invested in resolving the xpore dataprep genomic coordinate issue, and we’re getting close. Please have a look at the error message and my attempted file investigation, I am hopeful that the error message in line 419 is resolvable.

Thanks again!

Cheers,
Michael

@yuukiiwa
Copy link
Collaborator

Hi Michael (@Michael-m6A),

Sorry for the delayed reply! (It was Chinese New Year out here)

The transcript_id AAEL024786 is there but position 2984 is not found in AAEL024786, as the annotation dictionary shows that the last nucleotide of the transcript is at position 2983:

{'chr': '1', 'g_id': 'AAEL024786', 'strand': '-', 'transcript': (86105077, 86118053),
 'exon': [(86117901, 86118053), (86117589, 86117828), (86117354, 86117511), (86105077, 86107509)], 
 'tx_exon': [(0, 152), (153, 392), (393, 550), (551, 2983)]}

I think this is more of a 0-base vs 1-base problem, which doesn't quite exist in widely used genomes (e.g. human, mouse, and Arabidopsis) from the same reference source.

I have edited the function in the michael branch, and you can try installing and running it to see whether it solves your problem:

git clone https://github.com/GoekeLab/xpore.git
git checkout origin/michael
sudo python3 setup.py install

The annotation for AAEL024786 should include position 2984:

{'chr': '1', 'g_id': 'AAEL024786', 'strand': '-', 'transcript': (86105077, 86118053), 
 'exon': [(86117901, 86118053), (86117589, 86117828), (86117354, 86117511), (86105077, 86107509)], 
 'tx_exon': [(1, 153), (154, 393), (394, 551), (552, 2984)]}

Thanks!

Best wishes,
Yuk Kei

@Michael-m6A
Copy link
Author

Michael-m6A commented Feb 20, 2023

Hi Yuk (@yuukiiwa),

No problem at all! My apologies for not responding to you for the last 3 weeks. I was on leave as we welcomed a new baby girl into our family.

I am incredible thankful to you for providing me with this edited function (michael branch).

I have requested this edited function to be installed on our institutional HPC server, unfortunately, that is not the fastest process.

I will keep you up to date as soon as I have any developments.

Thanks again,
Michael

@yuukiiwa
Copy link
Collaborator

Hi Michael (@Michael-m6A),

Congratulations!! Take your time!

Best wishes,
Yuk Kei

@Michael-m6A
Copy link
Author

Hi Yuk (@yuukiiwa),
Just want to give you a quick update. I am experiencing more delays in being able to test your xPore edited function (michael branch) as our organisation permanently decommissioned the HPC infrastructure/server I was using. The migration to the new HPC environment is underway but none of the tools/modules I need to use are installed yet. I am working on getting everything I required for the m6A analysis to be installed.

I will post here again once I have made actual progress with the analysis using the xPore edited function (michael branch) you kindly provided.

Thanks,
Michael

@Michael-m6A
Copy link
Author

Michael-m6A commented Apr 19, 2023

Hi Yuk (@yuukiiwa),

We could finally install your modified xpore version on the new HPC infrastructure.

I have used fasta, eventalign, and gtf (containing cDNA and ncRNA information) files without (-RA) and submitted the xpore-developer dataprep job.

However, it did produce a similar error message like we got previously, with the difference that it is now the position leading to the error in the gene 'AAEL024786' is 2985 instead of previously 2984, please see below.

Error message modified xpore developer

File "/home/s4303883/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
KeyError: ('AAEL024786', 2985)

Error message standard xpore
*File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
KeyError: ('AAEL024786', 2984)
=>> PBS: job killed: walltime 72066 exceeded limit 72000

Error message modified xpore developer in detail

-rw-r--r--. 1 xx qris-uq 1433120 Apr 16 04:34 s3535379_job.xpore-dev-dataprep-ENSEMBL-55-genomic-gtf.error
-rw-r--r--. 1 xx qris-uq 0 Apr 14 12:04 s3535379_job.xpore-dev-dataprep-ENSEMBL-55-genomic-gtf.out

cat xx_job.xpore-dev-dataprep-ENSEMBL-55-genomic-gtf.error

[xx@bunya3 ~]$ cat xx_job.xpore-dev-dataprep-ENSEMBL-55-genomic-gtf.error
/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
Process Consumer-2:
Traceback (most recent call last):
File "/sw/auto/rocky8.6/epyc3/software/Python/3.9.5-GCCcore-10.3.0-bare/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/helper.py", line 113, in run
result = self.task_function(*next_task_args,self.locks)
File "/home/user/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
KeyError: ('AAEL024786', 2985)
slurmstepd: error: *** JOB 3535379 ON bun031 CANCELLED AT 2023-04-16T04:34:40 DUE TO TIME LIMIT ***

Error message standard xpore in detail

Traceback (most recent call last):
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/helper.py", line 113, in run
result = self.task_function(*next_task_args,self.locks)
File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
KeyError: ('AAEL024786', 2984)
=>> PBS: job killed: walltime 72066 exceeded limit 72000
########################### Job Execution History #############################

I have also inspected all output files and all of them contain data outputs, please see below.

Output:modified xpore developer run on 17-04-2023

drwxr-sr-x. 2 user Q5334RW 4096 Apr 14 16:58 dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239

ls /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239

data.index data.json data.log data.readcount eventalign.index

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/data.index

idx,start,end
AAEL017263,0,2030676
AAEL012250,2030676,2799959
AAEL021326,2799959,2820651
AAEL017012,2820651,5097858
AAEL014368,5097858,5571925
AAEL002886,5571925,14939749
AAEL009400,14939749,15243925
AAEL001846,15243925,15257916
AAEL019615,15257916,15292102

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/data.json

{"AAEL017263":{"78078919":{"GGAAA":[119.4,115.5,71.2,116.0,104.5,106.4,68.5,107.8,109.7,104.6,105.7,113.8,114.0,102.8,109.9,108.3,108.4,111.4,90.6,110.3,112.1,111.5,102.0,111.2,110.9,111.5,108.6,108.7,85.0,111.6,109.6,107.9,109.9,107.0,108.1,100.4,112.4,107.3,111.4,86.5,109.9,107.9,111.4,106.1,110.9,110.9,113.9,105.8,110.1,90.9,101.3,86.8,109.4,110.8,85.7,108.1,108.0,110.7,109.1,70.0,117.4,113.6,111.0,106.0,93.5,108.7,110.4,111.5,93.4,88.2,68.2,108.3,111.6,112.1,112.8,109.4,111.3,108.4,109.7,91.1,113.5,116.7,70.6,108.9,111.7,109.4,105.5,91.4,110.1,108.5,100.0,103.7,108.5,121.4,110.5,103.7,91.5,86.5,107.7,109.0,108.9,92.1,103.3,112.3,106.6,94.1,85.5,110.8,108.9,73.7,107.8,110.0,108.8,91.3,110.5,111.7,112.9,109.1,110.8,64.4,108.3,109.0,106.6,109.9,103.3,108.4,107.6,108.0,115.1,113.1,110.6,108.0,106.3,111.9,87.6,105.6,109.4,111.5,107.5,108.6,110.6,110.9,97.4]},"78078920":{"TGGAA":

etc.

TAT":[84.1]},"84837":{"GAATA":[111.4]},"84838":{"AGAAT":[128.7]},"84839":{"CAGAA":[121.1]},"84840":{"GCAGA":[95.8]},"84841":{"CGCAG":[92.5]},"84842":{"CCGCA":[84.0]},"84843":{"GCCGC":[76.5]},"84844":{"TGCCG":[99.9]},"84845":{"ATGCC":[83.0]},"84846":{"CATGC":[74.0]},"84847":{"ACATG":[80.4]},"84848":{"AACAT":[89.0]},"84849":{"CAACA":[91.6]},"84850":{"GCAAC":[83.1]},"84851":{"GGCAA":[110.9]},"84852":{"GGGCA":[105.4]},"84853":{"TGGGC":[113.5]},"84854":{"ATGGG":[97.8]},"84855":{"TATGG":[84.7]},"84856":{"ATATG":[85.9]},"84857":{"AATAT":[92.1]},"84858":{"CAATA":[112.4]},"84859":{"CCAAT":[90.0]},"84860":{"GCCAA":[74.3]},"84861":{"GGCCA":[103.8]},"84862":{"TGGCC":[110.0]},"84863":{"CTGGC":[104.3]},"84864":{"GCTGG":[89.3]}}}

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/data.log

AAEL017263: True
AAEL012250: True
AAEL021326: True
AAEL017012: False
AAEL014368: True
AAEL002886: True
AAEL009400: False
AAEL001846: True
AAEL019615: True
AAEL024674: True

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/data.readcount

idx,n_reads
AAEL017263,245
AAEL012250,115
AAEL021326,7
AAEL017012,326
AAEL014368,124
AAEL002886,1001
AAEL009400,267
AAEL001846,1
AAEL019615,3

head /scratch/project_mnt/user/dataprep-ENSEMBL-55-genomic-gtf-fasta-eventalign-xpore-dev-modified-PNXP22239/eventalign.index

transcript_id,read_index,pos_start,pos_end
AAEL006471,6,172,227770
AAEL006471,10,227770,470953
AAEL006471,15,470953,734427
AAEL006471,14,734427,1002147
AAEL012458,0,1002147,1108621
AAEL006471,9,1108621,1386556
AAEL006471,5,1386556,1664122
AAEL006471,7,1664122,1940105
AAEL006471,8,1940105,2211471

Any help will be greatly appreciated again.

Many thanks,
Michael

@Michael-m6A
Copy link
Author

Hi Yuk (@yuukiiwa),

Would it be possible to get your help on this xpore dataprep genomic coordinate error?

I have used the modified version (michael branch) you kindly provided to me in January but still get an error message one position further down the line.

Error message modified xpore developer
File "/home/s4303883/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
KeyError: ('AAEL024786', 2985)

Please see further details and the previous error message in the post from the 19th of April above.

Many thanks in advance,
Michael

@Michael-m6A
Copy link
Author

Michael-m6A commented Jul 24, 2023

Hi Ploy (@ploy-np), Yuk (@yuukiiwa), and Jonathan (@jonathangoeke),

I am writing to all of you in the hope to get your help with the xpore dataprep genomic coordinate error.

Please see the error message in full in my previous post from 19th of April above where I used the modified version (michael branch) you kindly provided.

Error message modified xpore developer
File "/home/s4303883/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
KeyError: ('AAEL024786', 2985)

Thank you in advance,
Michael

@lyj95618
Copy link

lyj95618 commented Oct 2, 2023

Hi Ploy (@ploy-np), Yuk (@yuukiiwa), and Jonathan (@jonathangoeke),

I am writing to all of you in the hope to get your help with the xpore dataprep genomic coordinate error.

Please see the error message in full in my previous post from 19th of April above where I used the modified version (michael branch) you kindly provided.

Error message modified xpore developer File "/home/s4303883/.local/lib/python3.9/site-packages/xpore-2.1-py3.9.egg/xpore/scripts/dataprep.py", line 419, in preprocess_gene genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('AAEL024786', 2985)

Thank you in advance, Michael

Hi Michael, I am wondering if you have resolved this issue or have any update on it. I am currently running Xpore and experiencing a very similar issue as you.

Thanks,
Laur

@Michael-m6A
Copy link
Author

Hi Laur (@lyj95618) and Yuk (@yuukiiwa),

Unfortunately, I have not been able to resolve the xPore dataprep converting transcriptome positions to genomic coordinates issue.

However, I did investigate if they are variations between the reference annotation files and the number of exons from two different databases.

I have compared the Aedes aegypti annotation files from ENSEMBL and VectorBase for the gene
“AAEL024786” which produced the error in xPore dataprep for me.

The location coordinates and number of exons for “AAEL024786” are identical in both ENSEMBL and VectorBase database annotation files.

My conclusion so far is that the source of the problem is within the xPore transcriptome to genomic coordinates (t2g_mapping) code, see error message which is the same as your error below.

File "/sw/QFAB/miniconda3/envs/xpore_2.1/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 419, in preprocess_gene
genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
KeyError: ('AAEL024786', 2985)

Please let me know if you find a way to overcome this xPore dataprep genomic coordinates problem.

Many thanks,
Michael

@lyj95618
Copy link

Hi Michael @Michael-m6A,

I have also done some investigations on the errors and found that it was the issue in the transcriptome to genomic coordinates conversion code.

I tried to run Xpore without the --genome flag which meant I didn't want the transcriptome to genomic coordinates conversion and the whole thing finished smoothly without any error message. The only problem with this is the output location is in transcriptomic, not genomic.

I might go to try to download the most recent version of Ensembl cdna and GTF and see if that helps to overcome the genomic coordinate issue.

Thanks!
Laur

@Michael-m6A
Copy link
Author

Hi Laur (@lyj95618),

Thank you very much for sharing!

I also run xPore without the -–genome flag which worked well and gave me the locations within the transcriptome.

However, being able to get the genomic coordinates of the modifications would be really valuable. Hopefully, the @GoekeLab can fix the bug/error in their xPore transcriptome to genomic coordinates conversion code.

Thanks,
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants