Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on scan_all function #302

Open
nikolamilosevic86 opened this issue Mar 14, 2023 · 5 comments
Open

Segmentation fault on scan_all function #302

nikolamilosevic86 opened this issue Mar 14, 2023 · 5 comments

Comments

@nikolamilosevic86
Copy link

Describe the bug
When I try to perform Motif.scan_all on HNF1A, I get segmentation fault

To Reproduce
Steps to reproduce the behavior:

  1. Export whole genome sequence
  2. Take MA0046.2 jaspar
  3. Run the following code, where mart_export.txt is gene sequence:
from gimmemotifs.motif import Motif,read_motifs
from gimmemotifs.fasta import Fasta
jaspar_motif_file = "MA0046.2.jaspar"
motifs = read_motifs(jaspar_motif_file, fmt="jaspar")
f = Fasta("mart_export.txt")
m = motifs[0]
la = m.scan_all(f)
print(la)

Expected behavior
Predictions are printed out

Error logs
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

@siebrenf
Copy link
Member

Thank you for sharing this! I was able to reproduce the error with test data, but it depended on the input fasta!

from gimmemotifs.motif import Motif, read_motifs
from gimmemotifs.fasta import Fasta


jaspar_motif_file="data/motif_databases/JASPAR2022_vertebrates.pfm"
motifs = read_motifs(jaspar_motif_file)
print(len([m for m in motifs if m.id.startswith("MA0046.2")]))  # prints 1

m = [m for m in motifs if m.id.startswith("MA0046.2")][0]
print(m.id)  # prints 'MA0046.2_MA0046.2.HNF1A'

ff = "test/data/scan/genome/scan_test.fa"  # segfault
ff = "test/data/genomes/hg38sample.fa"  # no segfault

f = Fasta(ff)
m.scan_all(f)

The file with proper output contained one chromosome with zero hits, so I'm not sure what causes this. @simonvh, could you have a look if you have the time? This is happening inside the C function pfmscan.

@nikolamilosevic86
Copy link
Author

nikolamilosevic86 commented Mar 16, 2023

Hello,

I believe, I have identified problem - when a sequence in FASTA file is shorter then motif, I believe that is when segmentation fault happens.

For example if I have something like this in FASTA file (usually incomplete CDS):

ENSG00000146733|ENST00000416592
ATGGTCT

And consensus is for example (https://jaspar.genereg.net/matrix/MA0114.4/):
'nnCAAAGTCCAnn'

Similar happens on the other short sequences.

@siebrenf
Copy link
Member

siebrenf commented Mar 17, 2023

I only got a segfault with a long sequence, but not with a short sequence!
Perhaps I'm missing something though, so I made a PRs that might fix either cases: #303 and #304

If you could test these, or provide me with your input data, that would be wonderful :)

@nikolamilosevic86
Copy link
Author

You can download file I have used here: https://filebin.net/sxmgfi99b5rnzb1y

@siebrenf
Copy link
Member

Thanks!

I used your example code, with your mart file, and this MA0046.2 motif (I guess its the same you had?) but did not get a segfault!

I'm using a beefy PC, so my theory is that I have a bigger CPU cache (stack, I think?). If that is the case, #303 / #305 should fix this issue. Let me know what you think @simonvh :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants