Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load annotation for custom #29

Open
LPerlaza opened this issue Dec 12, 2023 · 14 comments
Open

Unable to load annotation for custom #29

LPerlaza opened this issue Dec 12, 2023 · 14 comments

Comments

@LPerlaza
Copy link

Hi, I am running the Scenic+plus pipeline with Zebra fish and I am getting Unable to load annotation for custom warnings when I run run_pycistarget

All databases created specifically for Zebra fish have been generated:

from pycistarget.motif_enrichment_cistarget import *
rankings_db = 'Drerio_GRCz11_ensembl.regions_vs_motifs.rankings.feather'
scores_db = 'Drerio_GRCz11_ensembl.regions_vs_motifs.scores.feather'
motif_annotation = 'motifs-v10-zfish.tbl'

I have custom Annotation

dataset = pbm.Dataset(name='drerio_gene_ensembl',  host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']

and this is how I am running run_pycistarget

from scenicplus.wrappers.run_pycistarget import run_pycistarget
run_pycistarget(region_sets,
                 ctx_db_path = rankings_db,
                 save_path = './SCENIC_results/motifs/',
                 species='custom',
                 custom_annot=annot,
                 dem_db_path = scores_db,
                 run_without_promoters = True,
                 promoter_space = 500,
                 ctx_auc_threshold = 0.005,
                 ctx_nes_threshold = 3.0,
                 ctx_rank_threshold = 0.05,
                 dem_log2fc_thr = 0.5,
                 dem_motif_hit_thr = 3.0,
                 dem_max_bg_regions = 500,
                 path_to_motif_annotations = motif_annotation,
                 annotation_version = 'v10nr_clust',
                 annotation = ['Direct_annot', 'Orthology_annot'],
                 n_cpu = 1,
                  _temp_dir = os.path.join(tmpDir, 'ray_spill'))

warning Unable to load annotation for custom

2023-12-08 13:41:16,364 pycisTarget_wrapper INFO     [./SCENIC_results/motifs/](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/SCENIC_results/motifs/) folder already exists.
2023-12-08 13:41:16,405 pycisTarget_wrapper INFO     Loading cisTarget database for Topics
2023-12-08 13:41:16,406 cisTarget    INFO     Reading cisTarget database
2023-12-08 13:42:54,968 pycisTarget_wrapper INFO     Running cisTarget for Topics
2023-12-08 13:42:54,970 cisTarget    INFO     Running cisTarget for Topic1 which has 10925 regions
2023-12-08 13:43:10,108 cisTarget    INFO     Annotating motifs for Topic1
2023-12-08 13:43:10,123 cisTarget    INFO     Unable to load annotation for custom
2023-12-08 13:43:10,219 cisTarget    INFO     Getting cistromes for Topic1

commenting the line: species='custom', generates the following error.

TypeError: run_pycistarget() missing 1 required positional argument: 'species'

@SeppeDeWinter
Copy link
Collaborator

Hi @LPerlaza

When using species = "custom" you should provide a motif-to-TF annotation file using the path_to_motif_annotations parameter.

For the format of this file, please refer to this example (for human): https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl.

I hope this helps,

All the best,

Seppe

@LPerlaza
Copy link
Author

Hi Seppe, Thanks for the prompt response. I have generated the motif-to-TF annotation file which I have added in the command as motif_annotation = 'motifs-v10-zfish.tbl' I have followed instructions from: https://github.com/JoGraesslin/Zebrafish_SCENIC

This still doesn't work. Do you have any additional recomendation?

Thanks!
Laura

@SeppeDeWinter
Copy link
Collaborator

Hi @LPerlaza

Can you show the output of

from pycistarget.utils import load_motif_annotations

load_motif_annotations(specie = None, fname = "motifs-v10-zfish.tbl")

All the best,

Seppe

@LPerlaza
Copy link
Author

Hi Seppe, thanks for the tip. Here is what the outp

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-a9b5369841b1> in <module>
      1 from pycistarget.utils import load_motif_annotations
      2 
----> 3 load_motif_annotations(specie = None, fname = "motifs-v10-zfish.tbl")

[~/.local/lib/python3.8/site-packages/pycistarget/utils.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pycistarget/utils.py) in load_motif_annotations(specie, version, fname, column_names, motif_similarity_fdr, orthologous_identity_threshold)
    102             name='flybase'
    103         fname = 'https://resources.aertslab.org/cistarget/motif2tf/motifs-'+version+'-nr.'+name+'-m0.001-o0.0.tbl'
--> 104     df = pd.read_csv(fname, sep='\t', usecols=column_names)
    105     df.rename(columns={'#motif_id':"MotifID",
    106                        'gene_name':"TF",

[~/.local/lib/python3.8/site-packages/pandas/util/_decorators.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pandas/util/_decorators.py) in wrapper(*args, **kwargs)
    209                 else:
    210                     kwargs[new_arg_name] = new_arg_value
--> 211             return func(*args, **kwargs)
    212 
    213         return cast(F, wrapper)

[~/.local/lib/python3.8/site-packages/pandas/util/_decorators.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pandas/util/_decorators.py) in wrapper(*args, **kwargs)
    315                     stacklevel=find_stack_level(inspect.currentframe()),
    316                 )
--> 317             return func(*args, **kwargs)
    318 
    319         return wrapper
[~/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py) in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    948     kwds.update(kwds_defaults)
    949 
--> 950     return _read(filepath_or_buffer, kwds)
    951 
    952 

[~/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py) in _read(filepath_or_buffer, kwds)
    603 
    604     # Create the parser.
--> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
    606 
    607     if chunksize or iterator:

[~/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py) in __init__(self, f, engine, **kwds)
   1440 
   1441         self.handles: IOHandles | None = None
-> 1442         self._engine = self._make_engine(f, self.engine)
   1443 
   1444     def close(self) -> None:

[~/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py) in _make_engine(self, f, engine)
   1745 
   1746         try:
-> 1747             return mapping[engine](f, **self.options)
   1748         except Exception:
   1749             if self.handles is not None:
[~/.local/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py) in __init__(self, src, **kwds)
    146                 self.orig_names
    147             ):
--> 148                 self._validate_usecols_names(usecols, self.orig_names)
    149 
    150             # error: Cannot determine type of 'names'

[~/.local/lib/python3.8/site-packages/pandas/io/parsers/base_parser.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pandas/io/parsers/base_parser.py) in _validate_usecols_names(self, usecols, names)
    916         missing = [c for c in usecols if c not in names]
    917         if len(missing) > 0:
--> 918             raise ValueError(
    919                 f"Usecols do not match columns, columns expected but not found: "
    920                 f"{missing}"

ValueError: Usecols do not match columns, columns expected but not found: ['motif_similarity_qvalue', 'description', 'gene_name', 'orthologous_identity', '#motif_id']

For your reference, the header of the file motifs-v10-zfish.tbl looks like this:

#motif_id,motif_name,motif_description,source_name,source_version,gene_name,motif_similarity_qvalue,similar_motif_id,similar_motif_description,orthologous_identity,orthologous_gene_name,orthologous_species,description,score,Gene stable ID version_x
metacluster_196.3,EcR_usp,EcR/usp,bergman,1.1,hnf4a,0.0,None,None,0.270042,FBgn0003964,D. melanogaster,gene is orthologous to FBgn0003964 in D. melanogaster (identity = 27%) which is directly annotated for motif,,
metacluster_121.1,usp,usp,bergman,1.1,hnf4a,0.0,None,None,0.270042,FBgn0003964,D. melanogaster,gene is orthologous to FBgn0003964 in D. melanogaster (identity = 27%) which is directly annotated for motif,,
cisbp__M00182,M00182,Hnf4a[gene ID: ENSMUSG00000017950 species: Mus musculus TF status: direct TF family: Nuclear receptor DBDs: zf-C4]; Nr6a1[gene ID: ENSMUSG00000063972 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4],cisbp,2.00,hnf4a,0.0,None,None,0.957806,ENSMUSG00000017950,M. musculus,gene is orthologous to ENSMUSG00000017950 in M. musculus (identity = 95%) which is directly annotated for motif,,
cisbp__M00243,M00243,WT1[gene ID: ENSG00000184937 species: Homo sapiens TF status: direct TF family: C2H2 ZF DBDs: zf-C2H2],cisbp,2.00,hnf4a,6.46887e-06,tfdimers__MD00149,"M00967_forward_-1_M00192_forward dimer: HNF4, COUP / GR",1.0,None,None,"motif similar to tfdimers__MD00149 ('M00967_forward_-1_M00192_forward dimer: HNF4, COUP / GR'; q-value = 6.47e-06) which is directly annotated",,
cisbp__M00245,M00245,WT1[gene ID: ENSG00000184937 species: Homo sapiens TF status: direct TF family: C2H2 ZF DBDs: zf-C2H2],cisbp,2.00,hnf4a,6.53111e-06,tfdimers__MD00149,"M00967_forward_-1_M00192_forward dimer: HNF4, COUP / GR",1.0,None,None,"motif similar to tfdimers__MD00149 ('M00967_forward_-1_M00192_forward dimer: HNF4, COUP / GR'; q-value = 6.53e-06) which is directly annotated",,
cisbp__M00246,M00246,WT1[gene ID: ENSG00000184937 species: Homo sapiens TF status: direct TF family: C2H2 ZF DBDs: zf-C2H2],cisbp,2.00,hnf4a,6.66711e-06,tfdimers__MD00149,"M00967_forward_-1_M00192_forward dimer: HNF4, COUP / GR",1.0,None,None,"motif similar to tfdimers__MD00149 ('M00967_forward_-1_M00192_forward dimer: HNF4, COUP / GR'; q-value = 6.67e-06) which is directly annotated",,
metacluster_121.1,M00811,NR1H2[gene ID: ENSG00000131408 species: Homo sapiens TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; NR4A3[gene ID: ENSG00000119508 species: Homo sapiens TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Nr1h5[gene ID: ENSMUSG00000048938 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Nr2f6[gene ID: ENSMUSG00000002393 species: Mus musculus TF status: direct TF family: Nuclear receptor DBDs: zf-C4]; Nr4a3[gene ID: ENSMUSG00000028341 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Nr6a1[gene ID: ENSMUSG00000063972 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Ppard[gene ID: ENSMUSG00000002250 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Thrb[gene ID: ENSMUSG00000021779 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4],cisbp,2.00,hnf4a,1.59694e-06,metacluster_121.1,usp_SANGER_5_FBgn0003964,0.270042,FBgn0003964,D. melanogaster,gene is orthologous to FBgn0003964 in D. melanogaster (identity = 27%) which is annotated for similar motif flyfactorsurvey__usp_SANGER_5_FBgn0003964 ('usp_SANGER_5_FBgn0003964'; q-value = 1.6e-06),,
metacluster_121.1,M02392,NR1H2[gene ID: ENSG00000131408 species: Homo sapiens TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Nr1h5[gene ID: ENSMUSG00000048938 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Nr6a1[gene ID: ENSMUSG00000063972 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Ppard[gene ID: ENSMUSG00000002250 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Thrb[gene ID: ENSMUSG00000021779 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4],cisbp,2.00,hnf4a,1.43593e-06,metacluster_121.1,usp_SANGER_5_FBgn0003964,0.270042,FBgn0003964,D. melanogaster,gene is orthologous to FBgn0003964 in D. melanogaster (identity = 27%) which is annotated for similar motif flyfactorsurvey__usp_SANGER_5_FBgn0003964 ('usp_SANGER_5_FBgn0003964'; q-value = 1.44e-06),,
metacluster_121.1,M02399,NR1H2[gene ID: ENSG00000131408 species: Homo sapiens TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; NR4A3[gene ID: ENSG00000119508 species: Homo sapiens TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Nr1h5[gene ID: ENSMUSG00000048938 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Nr2c1[gene ID: ENSMUSG00000005897 species: Mus musculus TF status: direct TF family: Nuclear receptor DBDs: zf-C4]; Nr4a3[gene ID: ENSMUSG00000028341 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Nr6a1[gene ID: ENSMUSG00000063972 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4]; Ppard[gene ID: ENSMUSG00000002250 species: Mus musculus TF status: inferred TF family: Nuclear receptor DBDs: zf-C4],cisbp,2.00,hnf4a,1.59694e-06,metacluster_121.1,usp_SANGER_5_FBgn0003964,0.270042,FBgn0003964,D. melanogaster,gene is orthologous to FBgn0003964 in D. melanogaster (identity = 27%) which is annotated for similar motif flyfactorsurvey__usp_SANGER_5_FBgn0003964 ('usp_SANGER_5_FBgn0003964'; q-value = 1.6e-06),,

@LPerlaza
Copy link
Author

LPerlaza commented Dec 19, 2023

Hi Seppe, do you have any advice? I can see that the column names are as expected. Any tip to track back the problem?

@SeppeDeWinter
Copy link
Collaborator

Hi @LPerlaza

I see that your motif annotation table is comma separated, the function expects it to be tab seperated.

All the best,

Seppe

@LPerlaza
Copy link
Author

LPerlaza commented Jan 4, 2024

Hi Seppe, Thanks! That fixed the problem. I managed to run run_pycistarget without warnings but I am now having issues running run_scenicplus. I am trying to run it like:

run_scenicplus(
        scplus_obj = scplus_obj,
        variable = ['GEX_celltype'],
        species = 'drerio',
        assembly = 'GRCz11',
        tf_file = 'Danio_rerio_TF_listgenes.txt',
        save_path = os.path.join(work_dir, 'scenicplus'),
        biomart_host = 'http://www.ensembl.org',
        upstream = [1000, 150000],
        downstream = [1000, 150000],
        calculate_TF_eGRN_correlation = True,
        calculate_DEGs_DARs = True,
        export_to_loom_file = True,
        export_to_UCSC_file = True,
        path_bedToBigBed = 'drerio_bed',
        n_cpu = 12,
        _temp_dir = os.path.join(tmp_dir, 'ray_spill'))

which generated the following error:

2024-01-04 11:48:30,358 SCENIC+_wrapper INFO     [./SCENIC_results/scenicplus](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/SCENIC_results/scenicplus) folder already exists.
2024-01-04 11:48:30,359 SCENIC+_wrapper INFO     Getting search space
2024-01-04 11:48:33,317 R2G          INFO     Downloading gene annotation from biomart dataset: drerio_gene_ensembl
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-18-f3739854d831> in <module>
----> 1 run_scenicplus(
      2         scplus_obj = scplus_obj,
      3         variable = ['GEX_celltype'],
      4         species = 'drerio',
      5         assembly = 'GRCz11',

[~/.local/lib/python3.8/site-packages/scenicplus/wrappers/run_scenicplus.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/scenicplus/wrappers/run_scenicplus.py) in run_scenicplus(scplus_obj, variable, species, assembly, tf_file, save_path, biomart_host, upstream, downstream, region_ranking, gene_ranking, simplified_eGRN, calculate_TF_eGRN_correlation, calculate_DEGs_DARs, export_to_loom_file, export_to_UCSC_file, tree_structure, path_bedToBigBed, n_cpu, _temp_dir, save_partial, **kwargs)
    132     if 'search_space' not in scplus_obj.uns.keys():
    133         log.info('Getting search space')
--> 134         get_search_space(scplus_obj,
    135                      biomart_host = biomart_host,
    136                      species = species,

[~/.local/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py) in get_search_space(SCENICPLUS_obj, species, assembly, pr_annot, pr_chromsizes, predefined_boundaries, use_gene_boundaries, upstream, downstream, extend_tss, remove_promoters, biomart_host, inplace, key_added, implode_entries)
    236             chromsizes = pr.PyRanges(chromsizes)
    237         else:
--> 238             raise Exception(
    239                 'The assembly {} could not be found in http://hgdownload.cse.ucsc.edu/goldenPath/. Check assembly name or consider manually providing chromosome sizes!'.format(assembly))
    240     else:

Exception: The assembly GRCz11 could not be found in http://hgdownload.cse.ucsc.edu/goldenPath/. Check assembly name or consider manually providing chromosome sizes!

When I checked the code I realized that the condition of specifying assembly only includes human, mouse and fly assembly versions. I try to pass the error by adding the assembly version for zebra fish using this code:

ASM_SYNONYMS.update(
    { 'GRCz11':'GRCz11'}
)

but I get this error:

2024-01-04 12:27:57,176 SCENIC+_wrapper INFO     [./SCENIC_results/scenicplus](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/SCENIC_results/scenicplus) folder already exists.
2024-01-04 12:27:57,177 SCENIC+_wrapper INFO     Getting search space
2024-01-04 12:28:00,649 R2G          INFO     Downloading gene annotation from biomart dataset: drerio_gene_ensembl
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-20-903a82104d8b> in <module>
----> 1 run_scenicplus(
      2         scplus_obj = scplus_obj,
      3         variable = ['GEX_celltype'],
      4         species = 'drerio',
      5         assembly = 'GRCz11',

[~/.local/lib/python3.8/site-packages/scenicplus/wrappers/run_scenicplus.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/scenicplus/wrappers/run_scenicplus.py) in run_scenicplus(scplus_obj, variable, species, assembly, tf_file, save_path, biomart_host, upstream, downstream, region_ranking, gene_ranking, simplified_eGRN, calculate_TF_eGRN_correlation, calculate_DEGs_DARs, export_to_loom_file, export_to_UCSC_file, tree_structure, path_bedToBigBed, n_cpu, _temp_dir, save_partial, **kwargs)
    132     if 'search_space' not in scplus_obj.uns.keys():
    133         log.info('Getting search space')
--> 134         get_search_space(scplus_obj,
    135                      biomart_host = biomart_host,
    136                      species = species,

[~/.local/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py) in get_search_space(SCENICPLUS_obj, species, assembly, pr_annot, pr_chromsizes, predefined_boundaries, use_gene_boundaries, upstream, downstream, extend_tss, remove_promoters, biomart_host, inplace, key_added, implode_entries)
    236             chromsizes = pr.PyRanges(chromsizes)
    237         else:
--> 238             raise Exception(
    239                 'The assembly {} could not be found in http://hgdownload.cse.ucsc.edu/goldenPath/. Check assembly name or consider manually providing chromosome sizes!'.format(assembly))
    240     else:

Exception: The assembly GRCz11 could not be found in http://hgdownload.cse.ucsc.edu/goldenPath/. Check assembly name or consider manually providing chromosome sizes!

I can see the options pr_annot and pr_chromsizes for get_search_space that should be wrapped by run_scenicplus
but the assembly is not optional for run_scenicplus and there are not pr_annot and pr_chromsizes options aswell. How can I run run_scenicplus for zebrafish? I do have the annotation and the chromsizes dataframes.

Thanks! and Happy New Year btw!
Laura

@SeppeDeWinter
Copy link
Collaborator

Hi Laura

Happy new year to you as well.

This is indeed a limitation on how SCENIC+ was written in this main branch, different species are better handled in the code of the development branch (https://github.com/aertslab/scenicplus/tree/development).

It is still possible to use the main branch, for this please read these instructions: https://scenicplus.readthedocs.io/en/development/pbmc_multiome_tutorial.html#Notes-on-the-individual-function-calls-in-the-wrapper-function. (under the section: "Notes on the individual function calls in the wrapper function.").

I hope this helps.

All the best,

Seppe

@LPerlaza
Copy link
Author

LPerlaza commented Jan 9, 2024

Hi Seppe, Thanks so much for your suggestions. Looking at the notes is very helpful. I was following the other tutorial and I missed that section. I am running into this error and I can't see why. Do You have any advice of how to troubleshoot it? I fail to see where the problem is.

I am trying to run this:

from scenicplus.enhancer_to_gene import get_search_space
get_search_space(
    scplus_obj,
    upstream = [100, 15000], 
    downstream = [100, 15000],
    pr_annot = pr_annot, #see above
    pr_chromsizes = pr_chromsizes, #see above
    biomart_host = 'http://www.ensembl.org')

get this error:

2024-01-09 16:18:44,336 R2G          INFO     Extending promoter annotation to 10 bp upstream and 10 downstream
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-86-5aaf7cea1dc0> in <module>
      2 
      3 from scenicplus.enhancer_to_gene import get_search_space
----> 4 get_search_space(
      5     scplus_obj,
      6     upstream = [100, 15000],

[~/.local/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py) in get_search_space(SCENICPLUS_obj, species, assembly, pr_annot, pr_chromsizes, predefined_boundaries, use_gene_boundaries, upstream, downstream, extend_tss, remove_promoters, biomart_host, inplace, key_added, implode_entries)
    262     log.info('Extending promoter annotation to {} bp upstream and {} downstream'.format(
    263         str(extend_tss[0]), str(extend_tss[1])))
--> 264     pr_promoters = extend_pyranges(pr_promoters, extend_tss[0], extend_tss[1])
    265 
    266     if use_gene_boundaries or predefined_boundaries:

[~/.local/lib/python3.8/site-packages/scenicplus/utils.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/scenicplus/utils.py) in extend_pyranges(pr_obj, upstream, downstream)
     47         negative_pr.Start = (negative_pr.Start - downstream).astype(np.int32)
     48         negative_pr.End = (negative_pr.End + upstream).astype(np.int32)
---> 49     extended_pr = pr.PyRanges(
     50         pd.concat([positive_pr.df, negative_pr.df], axis=0, sort=False))
     51     return extended_pr

[~/.local/lib/python3.8/site-packages/pyranges/pyranges_main.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pyranges/pyranges_main.py) in __init__(self, df, chromosomes, starts, ends, strands, int64, copy_df)
    176             df = pd.DataFrame(columns="Chromosome Start End".split())
    177 
--> 178         _init(self, df, chromosomes, starts, ends, strands, copy_df)
    179 
    180     def __array_ufunc__(self, *args, **kwargs):
[~/.local/lib/python3.8/site-packages/pyranges/methods/init.py](https://vscode-remote+ssh-002dremote-002bbio1.vscode-resource.vscode-cdn.net/home/lper0012/tasks/margo.montandon/Analysis/~/.local/lib/python3.8/site-packages/pyranges/methods/init.py) in _init(self, df, chromosomes, starts, ends, strands, copy_df)
    118 
    119     if isinstance(df, pd.DataFrame):
--> 120         assert all(
    121             c in df for c in "Chromosome Start End".split()
    122         ), "The dataframe does not have all the columns Chromosome, Start and End."

AssertionError: The dataframe does not have all the columns Chromosome, Start and End.

Both of pr_annot and pr_chromsizes objects seems to have the correct format and type.

The pr_annot was created like:

import pybiomart as pbm

dataset = pbm.Dataset(name='drerio_gene_ensembl',  host='http://www.ensembl.org')
dataset

annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
annot['Transcription_Start_Site']=annot['Start']
annot['End']=annot['Start']
annot=annot[['Chromosome', 'Start','End', 'Strand', 'Gene', 'Transcription_Start_Site','Transcript_type']]
annot=annot.loc[:,['Chromosome', 'Start','End', 'Strand', 'Gene', 'Transcription_Start_Site','Transcript_type']]
filter = annot['Chromosome'].str.contains('ALT')
annot = annot[~filter]
pr_annot = pr.PyRanges(annot)
pr_annot.copy

bound method PyRanges.copy of +--------------+-----------+-----------+--------------+------------+-------+
| Chromosome   | Start     | End       | Strand       | Gene       | +2    |
| (category)   | (int64)   | (int64)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+-------|
| 1            | 49266886  | 49266886  | 1            | caly       | ...   |
| 1            | 44949909  | 44949909  | 1            | vps37c     | ...   |
| 1            | 44950170  | 44950170  | 1            | vps37c     | ...   |
| 1            | 9885320   | 9885320   | -1           | ints15     | ...   |
| ...          | ...       | ...       | ...          | ...        | ...   |
| MT           | 11299     | 11299     | 1            | mt-nd4     | ...   |
| MT           | 12897     | 12897     | 1            | mt-nd5     | ...   |
| MT           | 15232     | 15232     | -1           | mt-nd6     | ...   |
| MT           | 15308     | 15308     | 1            | mt-cyb     | ...   |
+--------------+-----------+-----------+--------------+------------+-------+
Unstranded PyRanges object has 42,032 rows and 7 columns from 266 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
Considered unstranded due to these Strand values: '1', '-1'
2 hidden columns: Transcription_Start_Site, Transcript_type>

pr_chromsizes was created like:

import pandas as pd
genomeref = pd.read_pickle(r"./GenomeRef_pyAnnotation.obj") #object from R with from GRanges 
genomeref=genomeref.df.loc[:,['Chromosome', 'Start', 'End']]
pr_chromsizes =  pr.PyRanges(genomeref)

pr_chromsizes looks like:

pr_chromsizes.copy
<bound method PyRanges.copy of +--------------+-----------+-----------+
| Chromosome   | Start     | End       |
| (category)   | (int64)   | (int64)   |
|--------------+-----------+-----------|
| 1            | 0         | 43039672  |
| 2            | 0         | 39555667  |
| 3            | 0         | 60282382  |
| 4            | 0         | 76270779  |
| ...          | ...       | ...       |
| KZ116055.1   | 0         | 1335      |
| KZ116064.1   | 0         | 82859     |
| KZ116065.1   | 0         | 20619     |
| MT           | 0         | 14717     |
+--------------+-----------+-----------+
Unstranded PyRanges object has 267 rows and 3 columns from 266 chromosomes.
For printing, the PyRanges was sorted on Chromosome.>

Thanks!
Laura

@SeppeDeWinter
Copy link
Collaborator

Hi Laura

The issue is caused because the strands should be formatted as "+", "-" instead of 1 and -1.

This is where thing starts going wrong: https://github.com/aertslab/scenicplus/blob/e5ba6fcf42459b6e6b70e27359ddd11289d70cc5/src/scenicplus/utils.py#L40

All the best,

Seppe

@LPerlaza
Copy link
Author

Hi Seppe,

I managed to run the whole pipeline, thanks so much for all your help. I have now a dataframe with the eRegulons which is great! My conundrum now is that some of my genes of interest don't show up on my eRegulons list. I figured that the main problem due to unknown motifs for genes in Zebrafish. I am working with a motif file that I generated by finding the orthologous genes in human and assuming that if the gene is orthologous between human and zebra fish then the motif that connects a gene to the TF is also the same. I have tried including chicken, mouse and fly annotations to check for more orthologies but this is not improving the results. Do you have any advice on how to include more gene-TF interactions. Also any advice on how to explore the complete list of gene-TF interactions vs the gene-TF on the eRegulon results (gene-TF present in the experiment vs gene-TF of importance)

@SeppeDeWinter
Copy link
Collaborator

Hi @LPerlaza

Great to hear the workflow is working.

To note, the motif enrichment step does not determine which target genes are found (it only determines what TFs are found). So the gene-TF interactions are not determined by the motifs you include in your analysis.

The complete list of TF-gene interactions can be found in the TF2G_adj (in the .uns field of your SCENIC+ object).
This comes from the GRNboost step.

All the best,

Seppe

@LPerlaza
Copy link
Author

LPerlaza commented Apr 9, 2024

Hi Seppe, I am working with a bigger dataset (including more cells) now and I had to move my project to a different location (cluster). I was running out of RAM in when running run_scenicplus. I installed scenicplus in this new cluster but my code doesn't work anymore. It seem some updates in this last version of scenicplus have made my current code obsolete. I tried coping the older version from my other server and building in the cluster but this setup.py doesn't seem to be working anymore, is trying to reach github repos that don't exist anymore.
Things I have figure out:

  1. The merge_cistromes (now get_and_merge_cistromes) and get_search_space are located in a different script file, so are imported differently from scenicplus.data_wrangling.cistarget_wrangling import *
  2. They have different parameters. I tried several solutions but I couldn't make it work.

This was working:

from scenicplus.enhancer_to_gene import get_search_space
get_search_space(
    scplus_obj,
    upstream = [100, 15000], 
    downstream = [100, 15000],
    pr_annot = pr_annot, #see above
    pr_chromsizes = pr_chromsizes, #see above
    biomart_host = 'http://www.ensembl.org')

This is not working:


get_search_space(
    scplus_obj,
    scplus_genes=scplus_obj.metadata_genes['features'],
    upstream = [100, 15000], 
    downstream = [100, 15000],
    gene_annotation = pr_annot, #see above
    chromsizes = pr_chromsizes, #see above
)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[56], line 4
      1 #because I have custom annotation and chromosome sizes I need to run get_search_space outside of the wrapper function (run_scenicplus)
      2 from scenicplus.data_wrangling.gene_search_space import *
----> 4 get_search_space(
      5     scplus_obj,
      6     scplus_genes=scplus_obj.metadata_genes['features'],
      7     upstream = [100, 15000], 
      8     downstream = [100, 15000],
      9     gene_annotation = pr_annot, #see above
     10     chromsizes = pr_chromsizes, #see above
     11 )

File /fs03/df22/perlaza/margo.montandon/scenicplus/src/scenicplus/data_wrangling/gene_search_space.py:292, in get_search_space(scplus_region, scplus_genes, gene_annotation, chromsizes, use_gene_boundaries, upstream, downstream, extend_tss, remove_promoters)
    288     raise ValueError(
    289         f"chromsizes should have the following columns: {', '.join(_chromsizes_required_cols)}")
    291 #convert to pyranges
--> 292 pr_gene_annotation = pr.PyRanges(gene_annotation.query("Gene in @scplus_genes").copy())
    293 pr_chromsizes = pr.PyRanges(chromsizes)
    294 pr_regions = pr.PyRanges(region_names_to_coordinates(scplus_region))

File ~/.conda/envs/SCENIC/lib/python3.11/site-packages/pyranges/pyranges_main.py:265, in __getattr__(self, name)

File ~/.conda/envs/SCENIC/lib/python3.11/site-packages/pyranges/methods/attr.py:65, in _getattr(self, name)
     62 def _getattr(self, name):
     64     if name in self.columns:
---> 65         return pd.concat([df[name] for df in self.values()])
     66     else:
     67         raise AttributeError("PyRanges object has no attribute", name)

AttributeError: ('PyRanges object has no attribute', 'query')

I tried to seearch for the notes you directed me last time but the section is gone. So I am out of ideas to solve this. Could you please indicate me to docs for customized references for the new release please, or tell me how I can build the last version on the new cluster.

Thanks!
Laura

@SeppeDeWinter
Copy link
Collaborator

Hi @LPerlaza

Did you install the development version by any chance? Indeed, these functions have changed a lot.
For tutorial on how to use that version, please see: https://scenicplus.readthedocs.io/en/development/tutorials.html.

You can still download the none-development version of the code if you want, that one is still the default branch on GitHub.

All the best,

Seppe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants