gcn4epi

Graph Convolutional Networks for Prediction of Enhancer-Promoter Interactions

Environment Setup

Create and activate a fresh virtual environment:

conda create -n gcn_env python=3.7
conda activate gcn_env
pip install --upgrade pip

Install pcdhit and cd-hit packages.

git clone https://github.com/simomarsili/pcdhit.git
python pcdhit/setup.py install
git clone https://github.com/weizhongli/cdhit.git
cd cdhit
make openmp=no
realpath cd-hit /usr/local/bin | xargs sudo ln -s
cd ..

Install all required packages:

cd gcn4epi
export PYTHONPATH="/home/darg1/Desktop/samet/pcdhit/"
pip install -r requirements.txt

Running Instructions

Run prepare_gcn_data.py and train.py modules, respectively.

Example:

python prepare_data.py --cell_line='GM12878' --cross_cell_line='K562' --k_mer=5 --label_rate=0.2 --label=1 --seed=42 --from_scratch --balanced
python split_data.py --cell_line='GM12878' --cross_cell_line='K562' --k_mer=5 --label_rate=0.2 --seed=42
python train_test.py --cell_line='GM12878' --cross_cell_line='K562' --k_mer=5 --label_rate=0.2 --label=1 --seed=42

⚠️ prepare_data.py is already executed for each cell-line by the default parameters listed below. You don't have to run it unless you need to regenerate features, nodes, labels, and graph files. It takes 1-2 hours in total for all cell-lines. Seed change does not require a rerun. But, changing the other parameters does.

⚠️ By default --frag_len=200 --k_mer=5 --label_rate=0.2 --seed=42.

⚠️ Unset --cross_cell_line for testing on the same cell-line.

Data Requirements

Download Human Genome GRCh37 from Human Genome Resources at NCBI and place it under data/ directory. Example: data/GRCh37_latest_genomic.fna
Run prepare_gcn_data.py module to prepare data files required by train.py module.

File Name	Description
lx_20.index	the indices (IDs) of labeled train instances as list object (for label_rate = 20%)
ux_20.index	the indices (IDs) of unlabeled train instances as list object (for label_rate = 20%)
vx_20.index	the indices (IDs) of validation instances as list object (for label_rate = 20%)
tx_20.index	the indices (IDs) of test instances as list object (for label_rate = 20%)
features_5mer	the feature vectors of all instances as scipy.sparse.csr.csr_matrix object (for k_mer = 5)
nodes	a dict in the format {chromosome_name: ID} as collections.defaultdict object
labels	the one-hot labels of all instances as numpy.ndarray object
graph	a dict in the format {ID: [IDs_of_neighbor_nodes]} as collections.defaultdict object

References

TargetFinder: https://github.com/shwhalen/targetfinder

Planetoid: https://github.com/kimiyoung/planetoid

GCN: https://github.com/tkipf/gcn

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
explanations		explanations
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
cell_lines.png		cell_lines.png
combine_results.py		combine_results.py
inits.py		inits.py
layers.py		layers.py
metrics.py		metrics.py
models.py		models.py
prepare_data.py		prepare_data.py
prepare_graphxai_data.py		prepare_graphxai_data.py
requirements.txt		requirements.txt
run.sh		run.sh
split_data.py		split_data.py
train_test.py		train_test.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gcn4epi

Environment Setup

Running Instructions

Data Requirements

References

About

Releases

Packages

Languages

License

smtnkc/gcn4epi

Folders and files

Latest commit

History

Repository files navigation

gcn4epi

Environment Setup

Running Instructions

Data Requirements

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages