Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing phenotypes for gene AP1G1 #71

Open
Alx-Kouris opened this issue Jan 17, 2023 · 6 comments
Open

Missing phenotypes for gene AP1G1 #71

Alx-Kouris opened this issue Jan 17, 2023 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@Alx-Kouris
Copy link

Hello, I visited the Monarch website for gene AP1G1 https://monarchinitiative.org/gene/HGNC:555#phenotype and I see only some EFO phenotypes being reported.

I would assume to find HPO phenotypes related, based on this page https://hpo.jax.org/app/browse/gene/164.

Is this expected? Or some kind of bug?

@sagehrke
Copy link
Member

Hi @Alx-Kouris,
Thank you for submitting a question to the Monarch HelpDesk! We sincerely appreciate you taking the time to reach out and are working on an answer to share with you soon.

@kevinschaper
Copy link
Member

Hi @Alx-Kouris,

Unfortunately, I think what you're seeing is just the age of the data on our production site and in our production graph.

We're working to rebuild our stack, starting from the graph and moving up through the API and website, so I can see that the data is present, but we've got a few months before we'll be showing it.

Pulling the development sqlite database artifact from https://data.monarchinitiative.org/monarch-kg-dev/latest/index.html I can see:

sqlite3 -markdown monarch-kg.db "select subject, predicate, object from edges where subject = 'HGNC:555' and predicate = 'biolink:has_phenotype'"
subject predicate object
HGNC:555 biolink:has_phenotype HP:0001252
HGNC:555 biolink:has_phenotype HP:0000286
HGNC:555 biolink:has_phenotype HP:0000316
HGNC:555 biolink:has_phenotype HP:0030953
HGNC:555 biolink:has_phenotype HP:0001274
HGNC:555 biolink:has_phenotype HP:0000007
HGNC:555 biolink:has_phenotype HP:0000767
HGNC:555 biolink:has_phenotype HP:0001263
HGNC:555 biolink:has_phenotype HP:0000718
HGNC:555 biolink:has_phenotype HP:0001249
HGNC:555 biolink:has_phenotype HP:0001250
HGNC:555 biolink:has_phenotype HP:0000358
HGNC:555 biolink:has_phenotype HP:0001257
HGNC:555 biolink:has_phenotype HP:0000369
HGNC:555 biolink:has_phenotype HP:0001388
HGNC:555 biolink:has_phenotype HP:0000336
HGNC:555 biolink:has_phenotype HP:0000218
HGNC:555 biolink:has_phenotype HP:0004626
HGNC:555 biolink:has_phenotype HP:0000750
HGNC:555 biolink:has_phenotype HP:0001250
HGNC:555 biolink:has_phenotype HP:0004209
HGNC:555 biolink:has_phenotype HP:0000716
HGNC:555 biolink:has_phenotype HP:0001252
HGNC:555 biolink:has_phenotype HP:0002007
HGNC:555 biolink:has_phenotype HP:0001249
HGNC:555 biolink:has_phenotype HP:0000262
HGNC:555 biolink:has_phenotype HP:0000739
HGNC:555 biolink:has_phenotype HP:0001263
HGNC:555 biolink:has_phenotype HP:0011220
HGNC:555 biolink:has_phenotype HP:0000343
HGNC:555 biolink:has_phenotype HP:0000565
HGNC:555 biolink:has_phenotype HP:0030820
HGNC:555 biolink:has_phenotype HP:0000722
HGNC:555 biolink:has_phenotype HP:0002942
HGNC:555 biolink:has_phenotype HP:0000750
HGNC:555 biolink:has_phenotype HP:0000768
HGNC:555 biolink:has_phenotype HP:0012803
HGNC:555 biolink:has_phenotype HP:0000718
HGNC:555 biolink:has_phenotype HP:0002938
HGNC:555 biolink:has_phenotype HP:0000006
HGNC:555 biolink:has_phenotype HP:0009381
HGNC:555 biolink:has_phenotype HP:0004691
HGNC:555 biolink:has_phenotype HP:0001257
HGNC:555 biolink:has_phenotype HP:0001763
HGNC:555 biolink:has_phenotype HP:0100716
HGNC:555 biolink:has_phenotype HP:0000752
HGNC:555 biolink:has_phenotype HP:0000646
HGNC:555 biolink:has_phenotype HP:0000729

Hopefully that at least looks good.

Thank you for pointing out the discrepancy and submitting an issue, and hopefully we'll at least have a beta for the new API & site to look at soon.

@sagehrke sagehrke added the question Further information is requested label Jan 18, 2023
@chapplec
Copy link

Thank you for the quick response @kevinschaper. However, we (I work with Alex who asked the question here) deal with thousands of genes in a high throughput way. Is there a way for us to download the correct data? We have been downloading from https://data.monarchinitiative.org/latest/tsv/gene_associations/index.html but if those data are not reliable, do we have another option?

  1. Can we download correct gene-disease associations from somewhere or do we need to wait for your fix?
  2. If we cannot get correct data, is there a way for us to identify which genes have this issue so we can at least flag the bad data in our system?

@kevinschaper
Copy link
Member

Hi @chapplec,

We aren't producing those nice association subset files from the new pipeline yet, but we do plan to.

You can get all associations in tsv format from monarch-kg_edges.tsv within https://data.monarchinitiative.org/monarch-kg-dev/latest/monarch-kg.tar.gz, and then subset on the category field for biolink:GeneToDiseaseAssociation - and you may also want to subset on the predicate field as well.

The new graph is intentionally more limited in gene to disease associations (currently only data from OMIM) and has predicates (in biolink, which is equivalent to relation in the OBAN model) that are more accurate / cautious, in particularly with respect to claims of causation.

I know that I prefer to use delimited files for pipelines, but I'm going to go back to the sqlite database again for quick subsetting:

Quickly, these are the two predicates we're using. biolink:risk_affected_by is the stronger assertion.

sqlite3 -markdown monarch-kg.db "select distinct predicate from edges where category = 'biolink:GeneToDiseaseAssociation'"
predicate
biolink:gene_associated_with_condition
biolink:risk_affected_by

You can get them together with

sqlite3 -markdown monarch-kg.db "select subject, predicate, object from edges where category = 'biolink:GeneToDiseaseAssociation' limit 10"
subject predicate object
HGNC:2593 biolink:gene_associated_with_condition MONDO:0008730
HGNC:2593 biolink:gene_associated_with_condition MONDO:0008730
HGNC:26404 biolink:gene_associated_with_condition MONDO:0014464
HGNC:91 biolink:gene_associated_with_condition MONDO:0012392
HGNC:21024 biolink:gene_associated_with_condition MONDO:0010117
HGNC:29092 biolink:gene_associated_with_condition MONDO:0013039
HGNC:25367 biolink:gene_associated_with_condition MONDO:0013627
HGNC:6936 biolink:gene_associated_with_condition MONDO:0008861
HGNC:6937 biolink:gene_associated_with_condition MONDO:0008862
HGNC:4799 biolink:gene_associated_with_condition MONDO:0017715

You likely a way that you'd prefer to subset the tsv files as a part of a pipeline, but just to show it quickly as an sqlite3 one liner, and I'll attach the file:

sqlite3 monarch-kg.db -cmd ".mode tabs" -cmd ".headers on" "select * from edges where category = 'biolink:GeneToDiseaseAssociation'" > gene_disease.tsv

gene_disease.tsv.gz

Finally, we are still in an awkward position between the old system, which is becoming outdated and the new, which is under development and still naturally has bugs to be discovered. Unfortunately, I noticed that within that file I attached, there are 127 rows where there is an HGNC curie in the disease column (subject, for these associations). I created an issue for this bug, and we'll get it fixed ASAP.

@chapplec
Copy link

Sorry, @kevinschaper I just saw this! We'll have a look and see if we can work with what you've given us. Thanks for responding!

@kevinschaper
Copy link
Member

I can give a little bit of an update on the odd G2D associations too, we have both MONDO to MONDO associations getting created as gene-to-disease as well as HGNC to HGNC. That investigation is happening in monarch-initiative/monarch-app#721.

One thing that you can do with those records is look at the orignal_subject & original_object columns, which shows the step back before it goes through our ID mapping process - but probably the safest thing to do is exclude them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants