import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np
TF Binding prediction project
TF binding prediction model
The goal of this project is to create a neural network model that predicts TF binding strength in a DNA sequence.
To do this, we have extracted 300 base pair-long DNA sequences that have a predicted binding site(s) from a TF, from a couple of chromosomes.
The training data is the following:
The sequences files:
chr#_sequences.txt.gz
store the 300 bp-long DNA sequences. A “window_name” in the formatchr#_start_end
has been assigned to each one.The scores files:
chr#_scores.txt.gz
store a 300 bases long vector for each DNA sequence. Each position in these vectors correspond to a the sequence position. The values for each position represent the “binding score” that was predicted to that site by Homer, which is a widely used tool to discover motif binding sites for a given TF across the genome.
1. Read-in the data
The data files for a couple of chromosomes are stored in the following link. Download them to your local folder.
Let’s explore how the sequence and score data look like:
Define data paths, this should be changed to your personal paths:
= '/Users/sofiasalazar/Library/CloudStorage/Box-Box/imlab-data/Courses/AI-in-Genomics-2025/'
PROJECT = os.path.join(PROJECT, 'data') DATA
Sequence data for chromosome 22
= pd.read_csv(os.path.join(DATA, 'chr22_sequences.txt.gz'), sep="\t", compression='gzip') sequences
sequences.head()
sequence | window_name | |
---|---|---|
0 | GCAAGACTCAGTCTCAAGGAAAAAAAAAAGCTCGAAAAATGTTTGC... | chr22_10510500_10510799 |
1 | AATCAAAAAGAATATTAGAAAACAAGCTGACAAAAAAATAAAAAAA... | chr22_10512900_10513199 |
2 | AGAAAAAGATATAAAGGCATCCAAATTGGAAAGGAAGAAGTAAGTA... | chr22_10514100_10514399 |
3 | CAAATGGATTGAAGACTTAAATGTAAGAACTAAAGCTGTAAAACTA... | chr22_10515300_10515599 |
4 | AAAATAGACCTACCATATGATGCAGCAATCCCACTTGTGGGCATTT... | chr22_10515900_10516199 |
sequences.shape
(23139, 2)
TF binding scores for chromosome 22
Here, each column has 300 values for each sequence, each value is the TF binding score for each position of the sequence. Most positions have ‘0’ as no motif is predicted to bind at those positions. One motif is a couple of bp-long and all of those bp will have the same score since they belong to the same motif.
= pd.read_csv(os.path.join(DATA, 'chr22_scores.txt.gz'), sep="\t", compression='gzip') scores
0]) np.array(scores.iloc[:,
array([0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 9.708916, 9.708916, 9.708916,
9.708916, 9.708916, 9.708916, 9.708916, 9.708916, 9.708916,
9.708916, 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
7.859208, 7.859208, 7.859208, 7.859208, 7.859208, 7.859208,
7.859208, 7.859208, 7.859208, 7.859208, 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
7.693852, 7.693852, 7.693852, 7.693852, 7.693852, 7.693852,
7.693852, 7.693852, 7.693852, 7.693852, 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ])
Take a look at how these score vectors look like, the blue sections represent the predicted binding sites for this TF:
= np.arange(300)
x = 0.4
bar_width =(12, 5))
plt.figure(figsize- bar_width, scores.iloc[:, 0], width=bar_width, label="Predicted", alpha=0.7, color='b')
plt.bar(x "Position in sequence window")
plt.xlabel("Homer score")
plt.ylabel(='y', linestyle='--', alpha=0.6)
plt.grid(axis plt.show()
2. Model training
Now, the goal is to use these sequences to train a neural network model that predicts the scores vectors. Overall, the structure of the code should more or less follow these steps:
One-hot-encode the DNA sequences
Split sequences and their corresponding scores into training, test and validation sets
Build dataloaders for the training and test sets using sequences as predictor features and the scores as targets
Define a NN model architecture
Train the model
Test the model on the test sequences
This process will be iterative as you find an optimal set of hyperparameters. Please share your best-performing model and we will test it on a set of held-out-data.
Additional notes
Note how the scores are values greater than 1, you can try binarizing these values so they are between 0 and 1 and compare between models
To assess performance, you can use the following code to compute correlations between predicted scores and the ground truth
from scipy.stats import pearsonr
def plot_comparison(pred, obs):
= pearsonr(pred, obs)
r_value = np.arange(len(pred))
x = 0.4
bar_width =(12, 5))
plt.figure(figsize- bar_width, pred, width=bar_width, label="Predicted", alpha=0.7, color='b')
plt.bar(x + bar_width, obs, width=bar_width, label="Observed", alpha=0.7, color='r')
plt.bar(x "Position sequence window")
plt.xlabel("Value")
plt.ylabel("Comparison of sequence scores")
plt.title(=f"Pearson R: {r_value:.2f}")
plt.legend(title='y', linestyle='--', alpha=0.6)
plt.grid(axis plt.show()