GeoBoost2 (part of the ZoDo project) is a system for extracting the location of infected hosts (LOIH) for a given set of GenBank metadata accessions. This involves API implementations that utilize NCBI API's for accessing and parsing information and delivering it to the researcher interested in investigating a set of sequences. You can also use this tool mine Geographic Locations in scientific articles (available as abstracts on PubMed or PMC-OA) or text files. The online version of the ZoDo system is available here
The implementation accepts the following inputs through API's:
- GenBank Accession identifiers (see option
genbank
) - PubMed IDs/ PubMedCentral IDs (see option
pubmed
) - Plain Texts (see option
text
)
GeoBoost2 relies on a few external packages for use:
- The entire code is based on python. So, firstly install
python 3
(tested with version 3.7). The system also depends on a few python packages that can be installed using the command:
pip install --upgrade -r requirements.txt
- Next, clone and install zoophy-geonames and start the service by following the instructions listed there. We use Geonames for disabmiguation and normalization.
- A file containing word embeddings i.e word vectors that can be loaded using the gensim model. You can download word embeddings trained on PubMed and Wikipedia articles(wikipedia-pubmed-and-PMC-w2v.bin) from http://bio.nlplab.org/ and place the bin file in the
resources
directory.
The GeoBoost2 tool can be run in a standalone mode on three types of inputs. To explore the options, use the -h
option i.e.
python run.py -h
GenBank accessions (in a list) : This option downloads GenBank accessions and linked PubMed/PMC articles and extracts the location of infected hosts. GeoBoost2 can process GenBank accessions in a list loaded from a file.
python run.py genbank examples/ebola_accessions.txt out/ebola/
It produces 2 files as output.
Locations.txt
containing the best LOIHs for a given accessionConfidence.txt
containing multiple possible LOIHs for a given accession along with their respective probabilities To explore additional options, use the-h
option i.e.python run.py genbank -h
Running GeoBoost2 for extracting geographic locations from scientific articles (PubMed-IDs / PMC-IDs)
python run.py pubmed examples/ebola_pubmedids.txt out/ebola/
It produces outputs in a single file.
Locations.txt
containing geographic locations mentioned in the articles To explore additional options, use the-h
option i.e.python run.py pubmed -h
python run.py pubmed examples/ebola_pubmedids.txt out/ebola/
It produces outputs in a single file.
Locations.txt
containing geographic locations mentioned in the article To explore additional options, use the-h
option i.e.python run.py text -h
GeoBoost2 can run as a service and can be accessed through API. To start the server
python server.py
To explore additional options, use the -h
option i.e. python server.py text -h
If you wish to use the API, follow the examples below:
- Type: GET
- Path: /accession?text=<comma separated Genbank Accessions>
- Example: https://zodo.asu.edu/geoboost2/accession?text=GQ457496,KU497555,AB110657
- Type: GET
- Path: /pubmed?text=<comma separated PubMed ids OR comma separated PMC ids with prefix PMC>
- Example: https://zodo.asu.edu/geoboost2/pubmed?text=10325350,10074191
- Example: https://zodo.asu.edu/geoboost2/pubmed?text=PMC84989,PMC104101
- Type: POST (Data in JSON)
- Path: /resolve
- JSON: "The virus was found in Springfield, Illinois..."
- Example: https://zodo.asu.edu/geoboost2/resolve
Tests can be run using:
pytest tests/
The following fuctions are for internal use only. Additional packages are required for performing these operations.
sudo apt-get install libpq-dev
pip install psycopg2
To insert records into ZooPhy database run the following:
python maintain.py insert
To explore additional options, use the -h
option i.e. python maintain.py insert -h
To train the NER model for extracting geographic locations from texts run the following:
python maintain.py train_ner
To explore additional options, use the -h
option i.e. python maintain.py insert -h
If you find this work useful you can cite it using:
@article{maggeISMB2018bidirectional,
title={Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature},
author={Magge, Arjun and Weissenbacher, Davy and Sarker, Abeed and Scotch, Matthew and Gonzalez-Hernandez, Graciela},
journal={Pacific Symposium on Biocomputing},
publisher={World Scientific Publishing Company},
year={2018}
}