Welcome to EPBD-BERT’s documentation!#
This repository corresponds to the article titled “Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention”.
Understanding the impact of genomic variants on transcription factor binding and gene regulation remains a key area of research, with implications for unraveling the complex mechanisms underlying various functional effects. This software framework delves into the role of DNA’s biophysical properties, including thermodynamic stability, shape, and flexibility in transcription factor (TF) binding. In this library, we have developed a multi-modal deep learning model integrating these properties with DNA sequence data. Trained on ChIP-Seq (chromatin immunoprecipitation sequencing) data in-vivo involving 690 TF-DNA binding events in human genome, our model significantly improves prediction performance in over 660 binding events, with up to 9.6% increase in AUROC metric compared to the baseline model when using no DNA biophysical properties explicitly. Further, we expanded our analysis to in-vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (SELEX) dataset, comparing our model with established frameworks. The inclusion of EPBD features consistently improved TF binding predictions across different cell lines in these datasets. Notably, for complex ChIP-Seq datasets, integrating DNABERT2 with a cross-attention mechanism provided greater predictive capabilities and insights into the mechanisms of disease-related non-coding variants found in genome-wide association studies. This work highlights the importance of DNA biophysical characteristics in TF binding and the effectiveness of multi-modal deep learning models in gene regulation studies
Resources#
Installation#
# Installation of virtural environment
git clone https://github.com/lanl/EPBD-BERT.git
cd EPBD-BERT
conda create -c conda-forge -p .venvs/epbd_bert_condavenv_test1 python=3.11 -y
conda activate .venvs/epbd_bert_condavenv_test1
python setup.py install
conda install -c conda-forge scikit-learn scipy -y
pip uninstall triton # We did not utilize triton for underlying hardware dependency
# To deactivate and remove the venv
conda deactivate
conda remove --name epbd_bert_condavenv_test1 --all -y
conda remove -p .venvs/epbd_bert_condavenv_test1 --all -y
Data Preprocessing Steps#
The ‘data_preprocessing’ directory holds all the data generation steps and divided into modules for data generation and bug tracking. We utilized ‘[bedtools](https://bedtools.readthedocs.io/en/latest/)’ software for genome operation. Follow the [bedtools installation guide](https://bedtools.readthedocs.io/en/latest/content/installation.html). We also provide a bare minimum script that downloads the pre-compiled binary of the software into the bedtools directory:
bash setup_bedtools.sh
export PATH=$PATH:$(pwd)/bedtools
Step |
Scripts |
---|---|
Download human genome assembly (GRCh37/hg19) and uniform TFBS |
|
Preprocess TFBS narrowpeak files and human genome |
|
Overlapping computation for label association |
|
Label association |
|
Data preprocessing for DNA breathing dynamics generation and DNABERT2 |
|
Train/validation/test split |
|
Associating numeric values for each label |
|
Further processing on negative regions |
|
Preprocessed dataset loading#
Preprocessed dataset can be downloaded from here (will be provided).
Dataset Module |
Usage |
---|---|
|
Loads sequence only dataset |
|
Loads sequence and EPBD (flat) features |
|
Loads sequence and EPBD (matrix) features |
Note: There are some other dataset modules. Each module provides example running instructions at the bottom.
Training and testing the developed models#
Model Module |
Usage |
---|---|
DNABERT2-finetuned |
|
|
Train DNABERT2 using train/validation split |
|
Test finetuned DNABERT2 on test split |
VanillaEPBD-DNABERT2-coordflip |
|
|
Train VanillaEPBD-DNABERT2 using train/validation split |
|
Test VanillaEPBD-DNABERT2 on test split |
EPBDxDNABERT-2 |
|
|
Train EPBDxDNABERT-2 using train/validation split |
|
Test EPBDxDNABERT-2 on test split |
Note: Details of each model with other ablation study can be found in the Paper. To run train/test: python -m epbd_bert.dnabert2_classifier.test
.
Acknowledgements#
Los Alamos National Lab (LANL), T-1
Copyright notice#
© 2024. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare. derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so. LANL O#4717
License#
This program is Open-Source under the BSD-3 License.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Documentation#
Indices and tables#
How to cite EPBD-BERT?#
@article{kabir2024advancing,
title = {Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention},
author = {Kabir, Anowarul and
Bhattarai, Manish and
Rasmussen, Kim {\O} and
Shehu, Amarda and
Bishop, Alan R and
Alexandrov, Boian and
Usheva, Anny},
journal = {bioRxiv},
pages = {2024--01},
year = {2024},
publisher= {Cold Spring Harbor Laboratory},
doi = {10.5281/zenodo.11130474},
url = {https://www.biorxiv.org/content/10.1101/2024.01.16.575935v2}
}