DeepCBA: a deep learning framework for gene expression prediction in maize based on DNA sequence and chromatin interaction
DeepCBA is a high-precision maize gene expression prediction model, which includes convolution neural network (CNN), bidirectional long short-term memory network (BiLSTM) and self-attention mechanism.
For additional details, we kindly invite you to refer to the DeepCBA publication: DeepCBA: a deep learning framework for gene expression prediction in maize based on DNA sequence and chromatin interaction.
We also offer you the online service version of DeepCBA. Please visit http://www.deepcba.com for more information.
We built a deep learning mode called DeepCBA to predict maize gene expression based on chromatin interactions.
DeepCBA includes three modules, and the convolution neural network (CNN) is used to extract features of the encoded chromatin sequence and reduce the dimensionality. The bidirectional long short-term memory network (BiLSTM) can capture bidirectional information and have the ability to capture dependencies between features by accessing long-range context. The BiLSTM is used to capture distal interactions among chromatin sequence features in this study. The self-attention mechanism is used to capture the contribution of key features for the model.
In this study, the experimental data utilized comprises published maize chromatin interaction and expression data from three distinct tissues: shoot, ear, and tassel (Peng et al., 2019; Li et al., 2019; Sun et al., 2020).
The chromatin interaction data is divided into two categories based on the type of elements that interact with genes.
- The promoter proximal region interaction (PPI) data, which includes five datasets: Shoot-2, Shoot-1, Ear-1, Ear-2, and Tassel-1.
- The promoter distal region interactions (PDI) data includes three datasets: Shoot-2, Shoot-1, and Ear-1.
The average interaction number of PPI in five datasets (Shoot-2, Shoot-1, Ear-1, Ear-2 and Tassel-1) is 38595, and the average interaction number of PDI in three datasets (Shoot-2, Shoot-1 and Ear-1) is 12367. Then we defined the DNA sequence for a specific gene as an inclusion of 1 kb upstream and 0.5 kb downstream of the transcription start site (TSS), and 0.5 kb upstream and 1 kb downstream of the transcription termination site (TTS) (Washburn et al., 2019).
If you are running this project using GPU, please configure CUDA and cuDNN according to this version.
Version | |
---|---|
CUDA | 8.0 |
cuDNN | 11.0 |
This project is based on Python 3.8.13. The required environment is as follows:
Version | |
---|---|
numpy | 1.19.5 |
pandas | 1.2.4 |
tensorflow | 2.4.0 |
tf-keras-vis | 0.8.4 |
biopython | 1.79 |
tqdm | 4.62.3 |
For more required packages, please refer to the requirements.txt file in this project.
If you have any questions, requests, or comments, we kindly invite you to contact us at [email protected], [email protected].