Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatibility with certain common PWM files #313

Open
Mitmischer opened this issue Mar 20, 2024 · 0 comments
Open

Incompatibility with certain common PWM files #313

Mitmischer opened this issue Mar 20, 2024 · 0 comments

Comments

@Mitmischer
Copy link

Describe the bug
gimme scan appears to be incompatible with custom PWMs that have spaces in the motif description. Note that JASPAR itself returns files that are formatted like this, for example: https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar . My motif database has those spaces, too.

To Reproduce
See error logs.

Expected behavior
gimme should support PWM files with spaces in the motif description (or give more helpful error messages).

Error logs
I reproduced the behaviour for two input files:

$ gimme scan in/IFIH1.fasta -g ../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -p ../5/in2/consensus_pwms_stripped.jaspar
# GimmeMotifs version 0.18.0
# Input: in/IFIH1.fasta
# Motifs: ../5/in2/consensus_pwms_stripped.jaspar
# FPR: 0.01 (../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa)
# Scoring: logodds score
Scanning:   0%|                                                                                                                                                                                                                                                                                | 0/1 [00:00<?, ? sequences/s]
Traceback (most recent call last):
  File "/home/mabe/.conda/envs/memegimme/bin/gimme", line 12, in <module>
    cli(sys.argv[1:])
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/cli.py", line 755, in cli
    args.func(args)
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/commands/pfmscan.py", line 20, in pfmscan
    scan_to_file(
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/__init__.py", line 402, in scan_to_file
    for line in command_scan(
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/__init__.py", line 287, in command_scan
    for row in it:
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/__init__.py", line 224, in scan_normal
    for i, result in enumerate(result_it):
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/base.py", line 573, in scan
    for result in it:
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/base.py", line 637, in _scan_sequences
    motifs = [(m, thresholds[m.id]) for m in read_motifs(self.motifs)]
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/base.py", line 637, in <listcomp>
    motifs = [(m, thresholds[m.id]) for m in read_motifs(self.motifs)]
KeyError: 'AC0081:NFIA_NFIC:SMAD AC0081:NFIA/NFIC:SMAD'
Scanning:   0%|   
$ gimme scan in/IFIH1.fasta -g ../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -p ../5/in2/consensus_pwms.jaspar
2024-03-20 15:57:51,945 - WARNING - multiple motifs with same id: AC0001:GATA_PROP:GATA AC0001:GATA/PROP:GATA
<....SNIP....>
2024-03-20 15:57:53,062 - WARNING - multiple motifs with same id: AC0637:AHR:bHLH AC0637:AHR:bHLH
# GimmeMotifs version 0.18.0
# Input: in/IFIH1.fasta
# Motifs: ../5/in2/consensus_pwms.jaspar
# FPR: 0.01 (../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa)
# Scoring: logodds score
2024-03-20 15:57:53,302 - WARNING - multiple motifs with same id: AC0001:GATA_PROP:GATA AC0001:GATA/PROP:GATA
2024-03-20 15:57:53,304 - WARNING - multiple motifs with same id: AC0002:PROP_ALX:Homeodomain AC0002:PROP/ALX:Homeodomain
2024-03-20 15:57:53,306 - WARNING - multiple motifs with same id: AC0003:HNF1A_HNF1B:Homeodomain AC0003:HNF1A/HNF1B:Homeodomain
2024-03-20 15:57:53,308 - WARNING - multiple motifs with same id: AC0004:ZSCAN:C2H2_ZF AC0004:ZSCAN:C2H2_ZF
2024-03-20 15:57:53,310 - WARNING - multiple motifs with same id: AC0005:POU3F_POU1F:Homeodomain,POU AC0005:POU3F/POU1F:Homeodomain,POU
2024-03-20 15:57:53,311 - WARNING - multiple motifs with same id: AC0006:MEOX:Homeodomain AC0006:MEOX:Homeodomain
2024-03-20 15:57:53,313 - WARNING - multiple motifs with same id: AC0007:BARX_NKX:Homeodomain AC0007:BARX/NKX:Homeodomain
2024-03-20 15:57:53,315 - WARNING - multiple motifs with same id: AC0008:VENTX:Homeodomain AC0008:VENTX:Homeodomain
2024-03-20 15:57:53,316 - WARNING - multiple motifs with same id: AC0009:PAX_VSX:Homeodomain AC0009:PAX/VSX:Homeodomain
<....SNIP....>
2024-03-20 15:57:59,796 - WARNING - multiple motifs with same id: AC0637:AHR:bHLH AC0637:AHR:bHLH
Traceback (most recent call last):
  File "/home/mabe/.conda/envs/memegimme/bin/gimme", line 12, in <module>
    cli(sys.argv[1:])
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/cli.py", line 755, in cli
    args.func(args)
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/commands/pfmscan.py", line 20, in pfmscan
    scan_to_file(
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/__init__.py", line 402, in scan_to_file
    for line in command_scan(
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/__init__.py", line 287, in command_scan
    for row in it:
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/__init__.py", line 224, in scan_normal
    for i, result in enumerate(result_it):
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/base.py", line 573, in scan
    for result in it:
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/base.py", line 636, in _scan_sequences
    thresholds = self.get_gc_thresholds(seqs, zscore=zscore)
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/gimmemotifs/scanner/base.py", line 605, in get_gc_thresholds
    maxt = pd.Series([m.max_score for m in motifs], index=_threshold.columns)
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/pandas/core/series.py", line 461, in __init__
    com.require_length_match(data, index)
  File "/home/mabe/.conda/envs/memegimme/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (1274) does not match length of index (1273)

The pwm files are attached: consensus_pwms.zip . Without specifying my own matrices, it works.

Installation information (please complete the following information):

  • OS: [Ubuntu 22.04.4 LTS]
  • Installation [conda]
  • Version [0.18.0]

Additional context

Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant