dppkb

Disease Pathophysiology Knowledge Base FOR DEMO PURPOSES

This repo contains a mostly automated demo KB of diseases, pathophysiology, treatments, etiology etc generated using DRAGON-AI/CurateGPT.

The KB is created via a cycle:

Human expert creates one or two seed entries
New entries are created from latent knowledgebase of LLM
Pubmed is searched for support/refute evidence on a per-assertion basis
LLM acts as critic guided by human to constantly refine

Website

https://monarch-initiative.github.io/dppkb

Click on "Diseases" to browse the "Knowledge Base". You will see a highly generic rendering of auto-generated disease entries.

What is this?

This is an experiment in using CurateGPT for de-novo human-driven Knowledge Base cuation.

The general workflow is:

A human writes some sample YAML files for a few entries
- the schema can be invented "on the fly"
Iterate using claude.ai
- ask it to suggest other fields
- use as a template to create more
Save as a .yaml file
Iterate with curate-gpt
- complete command will generate a new entry
- citeseek command will add support/refute evidence from pubmed
- update command will enrich specific fields
- review command will use LLM as a critic and suggest changes

Files

kb/dppkb.yaml - main KB

Details

Create an CurateGPT index

Run

make index

This should be run periodically - it makes a local ChromaDB that will be used for RAG

Note: this loads a pre-processed version that has the evidence removed; we want to hide this when doing RAG as we want to avoid publication hallucination.

Generate a new entity

Run this:

make tmp/complete-Tuberculosis.yaml

This uses RAG/DRAGON-AI to make a candidate entry. You can then copy this into the kb/dppkb.yaml, or you can manually tweak it, or ask claude to tweak it.

The idea is that as the KB is incrementally built up with high quality examples, there will be less need for manual tweaking, RAG will be good enough.

Also recall we can enhance in future steps

NOTE: This step does not use the pubmed directly. We are relying on the fact that the LLM has already ingested and compressed all the literature and can do a pretty good first-pass job at re-exporting that in any format we like. It doesn't have to be perfect though, subsequent steps are designed to refine this.

Adding evidence

make tmp/with-evidence.yaml

This with run CurateGPT citeseek over all assertions, if there is no evidence tag it will query pubmed for supporting/refuting evidence.

Periodic Review

It is recommended to periodically inspect the file wearing a lead curator role, and to ask for reviews.

Either global reviews:

curategpt review --model gpt-4o -p db -c disease "{}" -t patch --primary-key name > tmp/review.patch.yaml

Or focused, e.g. if you want pathophysiology to be fleshed out:

curategpt -vv review --model gpt-4o -p db -c disease "{}" -Z pathophysiology -P name -t patch --primary-key name --rule "include as many mechanisms and molecular steps as you can" > tmp/pathophys-review.yaml

The result is a patch file, This can be manually examined, edited, and applied:

curategpt apply-patch --patch tmp/patch.yaml --primary-key name kb/dppkb.yaml > tmp/patched.kb.yaml

Do a diff then move it

YAML normalization

there are different ways to write YAML. Ensure the kb representation is normalized:

make normalize

Linking to ontology term IDs

Currently we use labels not IDs as these are easier for humans reviewing the YAML, and for LLMs.

Grounding is expected to be trivial and highly reliable, will add a simple mappings to every entry.

End to end automation

TODO

Running the app

make app

This will create a streamlit app where you can chat with the KB, visualize clusters, etc.

Clustering

Ask a question:

See results clustered:

Chat

results:

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
config		config
derived		derived
examples		examples
kb		kb
project		project
src		src
tests		tests
util		util
.cruft.json		.cruft.json
.env.public		.env.public
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
project.Makefile		project.Makefile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dppkb

Website

What is this?

Files

Details

Create an CurateGPT index

Generate a new entity

Adding evidence

Periodic Review

YAML normalization

Linking to ontology term IDs

End to end automation

Running the app

Clustering

Chat

About

Releases

Packages

Languages

License

monarch-initiative/dppkb

Folders and files

Latest commit

History

Repository files navigation

dppkb

Website

What is this?

Files

Details

Create an CurateGPT index

Generate a new entity

Adding evidence

Periodic Review

YAML normalization

Linking to ontology term IDs

End to end automation

Running the app

Clustering

Chat

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages