import tensorflow as tf
Predicting alternative polyadenylation with Borzoi
Predicting alternative polyadenylation with Borzoi
Overview
Borzoi is a deep learning model that, similarily to Enformer:
Takes DNA sequences as input (one-hot encoded)
Uses a convolutional neural network architecture and transformer layers
Predicts multiple genomic features simultaneously
However, Borzoi was rained to predict RNA-seq coverage as well, which makes it useful for the following aplication: predicting 3’UTR alternative polyadenylation (APA).
3’UTR APA is a post-transcriptional regulatory mechanism that allows transcripts from a gene to have different lengths due to differential polyA signal usage leading to different lengths of 3’UTR regions. Variants on the 3’UTR sequence that alter the length of the region are refered to 3’aQTLs read more here.
Objective
In this notebook we will learn how to use borzoi to predict RNA-seq coverage around one 3’aQTL by predicting on both the reference and the alternative sequence. We will plot the differences on RNA-seq prediction at the 3’UTR for the whole blood borzoi tracks.
1. Set up a borzoi-adequate environment
The requirements for running borzoi are outlined here. The most important dependencies are python ==3.10 and tensorflow==2.15.x.
I had to install python 3.10 and found that working with a venv
worked as of may 5th 2025.
I followed the instructions to use the GPU of my macbook here
brew install python@3.10
python3.10 -m venv ~/borzoiTools
source ~/borzoiTools/bin/activate
python -m pip install -U pip
python -m pip install ipykernel
python3 -m ipykernel install --user --name=borzoiTools
python3 -m pip install tensorflow==2.15
python3 -m pip install tensorflow-metal
To verify python and tensforflow versions, run the following chunks:
= tf.config.experimental.list_physical_devices('GPU')
gpus gpus
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
print(gpu)
True)
tf.config.experimental.set_memory_growth(gpu, print("TensorFlow is using the GPU.")
except RuntimeError as e:
print(e)
else:
print("TensorFlow is not using the GPU. Check your TensorFlow installation.")
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
TensorFlow is using the GPU.
import sys
print(sys.version)
3.10.17 (main, Apr 8 2025, 12:10:59) [Clang 16.0.0 (clang-1600.0.26.6)]
print(tf.__version__)
2.15.0
2. Install borzoi and dependencies
Borzoi depends on the baskerville software, run the following chunks to install both on your environment:
%cd /Users/sofiasalazar/Desktop/IM_lab/borzoi_folder
/Users/sofiasalazar/Desktop/IM_lab/borzoi_folder
/Users/sofiasalazar/borzoiTools/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: This is now an optional IPython functionality, setting dhist requires you to install the `pickleshare` library.
self.shell.db['dhist'] = compress_dhist(dhist)[-100:]
!git clone https://github.com/calico/baskerville.git
%cd baskerville
%pip install -e .
%cd ..
!git clone https://github.com/calico/borzoi.git
%cd borzoi
%pip install -e .
Then I installed other libraries that we will be using as well
%pip install kipoiseq
%pip install h5py
Here I had to restart the kernel to use baskerville
3. Download models
The borzoi models are each ~1GB (~4GB in total as there are 4 models, each correspond to a cross validation fold from the training process). Further down you’ll see that we can choose to predict using the 4 models or only a couple of them.
%cd /Users/sofiasalazar/Desktop/IM_lab/borzoi_folder/borzoi
! ./download_models.sh
4. Set up borzoi and helper functions
The previous steps are run only once, afterwards, we can set up the borzoi functions as well as other utilities to extract the reference sequence and mutate it.
import os
import h5py
import numpy as np
import pandas as pd
import tensorflow as tf
import baskerville
from baskerville import seqnn
from baskerville import gene as bgene
from baskerville import dna
import json
import kipoiseq
import pyfaidx
= tf.config.experimental.list_physical_devices('GPU')
gpus gpus
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
= "/GPU:0" DEVICE
Change the following paths to your personal paths
= "/Users/sofiasalazar/Desktop/IM_lab/borzoi_folder/borzoi"
BORZOI_PATH = os.path.join(BORZOI_PATH, "examples/saved_models" )
models_path = "/Users/sofiasalazar/Library/CloudStorage/Box-Box/imlab-data/Reference-Data/ref_sequences/hg38/Homo_sapiens_assembly38.fasta"
fasta_file = os.path.join(BORZOI_PATH, "examples/params_pred.json")
params_file = os.path.join(BORZOI_PATH, "examples/targets_human.txt") targets_file
The following classes will help us extract the reference genome sequence, mutate it and one hot encode it.
class FastaStringExtractor:
def __init__(self, fasta_file):
self.fasta = pyfaidx.Fasta(fasta_file)
self._chromosome_sizes = {k: len(v) for k, v in self.fasta.items()}
#import pd.Interval as Interval
def extract(self, interval: kipoiseq.Interval, snp_position = None, snp_base=None, **kwargs) -> str:
# Truncate interval if it extends beyond the chromosome lengths.
= self._chromosome_sizes[interval.chrom]
chromosome_length = kipoiseq.Interval(interval.chrom,
trimmed_interval max(interval.start, 0),
min(interval.end, chromosome_length),
)
# pyfaidx wants a 1-based interval
= str(self.fasta.get_seq(trimmed_interval.chrom,
sequence + 1,
trimmed_interval.start
trimmed_interval.stop).seq).upper()
# Apply SNP modification if specified
if snp_position and snp_base:
= snp_position - trimmed_interval.start - 1
pos if 0 <= pos < len(sequence):
print(sequence[pos-1:pos+2])
= sequence[:pos] + snp_base + sequence[pos + 1:]
sequence print(sequence[pos-1:pos+2])
# Fill truncated values with N's.
= 'N' * max(-interval.start, 0)
pad_upstream = 'N' * max(interval.end - chromosome_length, 0)
pad_downstream return pad_upstream + sequence + pad_downstream
def close(self):
return self.fasta.close()
def one_hot_encode(sequence):
return kipoiseq.transforms.functional.one_hot_dna(sequence).astype(np.float32)
Borzoi prediction functions
Here we define how many model folds we want to use for the prediction
= 1 n_reps
= 524288
seq_len = True #Average across reverse-complement prediction
rc
with open(params_file) as params_open :
= json.load(params_open)
params
= params['model']
params_model = params['train']
params_train
'trunk'][-2]['cropping'] = 0
params_model[
= pd.read_csv(targets_file, index_col=0, sep='\t')
targets_df = targets_df.index
target_index
#Create local index of strand_pair (relative to sliced targets)
if rc :
= targets_df.strand_pair
strand_pair
= {ix : i for i, ix in enumerate(target_index.values.tolist())}
target_slice_dict = np.array([
slice_pair if ix in target_slice_dict else ix for ix in strand_pair.values.tolist()
target_slice_dict[ix] ='int32')
], dtype
= []
models for rep_ix in range(n_reps) :
= os.path.join(BORZOI_PATH, f"examples/saved_models/f3c{rep_ix}/train/model0_best.h5")
model_file = seqnn.SeqNN(params_model)
seqnn_model 0)
seqnn_model.restore(model_file,
seqnn_model.build_slice(target_index)if rc :
seqnn_model.strand_pair.append(slice_pair)0])
seqnn_model.build_ensemble(rc, [
models.append(seqnn_model)
def predict_tracks(models, sequence_one_hot):
= []
predicted_tracks for fold_ix in range(len(models)):
= models[fold_ix](sequence_one_hot[None, ...])[:, None, ...].astype(
yh "float16"
)
predicted_tracks.append(yh)
= np.concatenate(predicted_tracks, axis=1)
predicted_tracks
return predicted_tracks
def make_prediction(sequence):
with tf.device(DEVICE):
= predict_tracks(models, sequence)
y_wt return y_wt
2025-05-05 13:09:41.154901: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2025-05-05 13:09:41.154939: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2025-05-05 13:09:41.154952: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2025-05-05 13:09:41.155004: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-05-05 13:09:41.155040: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
5. Predict for a 3’aQTL
In this paper, 3’ alternative polyadenylation QTLs (1’aQTLs) where identified using gtex individuals. They were stored here.
For the purpose of this example, we will focus on the variant rs10954213: chr7_128949373_G_A (hg38), which is one of the whole blood 3’aQTLs that was reported on the paper (Figure 1b)
= FastaStringExtractor(fasta_file) fasta_extractor
= kipoiseq.Interval("chr7", 128949373, 128949374) sequence_interval
= one_hot_encode(fasta_extractor.extract(sequence_interval.resize(seq_len))) reference_one_hot
Here the snp_position
argument of the extract
function refers to the position in the sequence where we want to introduce the mutation. snp_base
is the nucleotide which will replace the reference nucleotide.
Note that in the output, the reference and mutated nucleotides are printed in the middle of the upstream and downstream nucleotides.
= one_hot_encode(fasta_extractor.extract(sequence_interval.resize(seq_len), snp_position=128949373, snp_base='A')) mutated_one_hot
TGA
TAA
= reference_one_hot.astype("float32")
reference_one_hot = mutated_one_hot.astype("float32") mutated_one_hot
= make_prediction(reference_one_hot) pred_reference
pred_reference.shape
(1, 1, 16384, 7611)
We remove the 0 dimension and average across the 1 dimension (folds)
= pred_reference.squeeze(0).mean(axis = 0)
pred_reference pred_reference.shape
(16384, 7611)
Then we keep only the whole blood RNA-seq prediction see track description here
= pred_reference[:, 7531:7534]
pred_reference pred_reference.shape
(16384, 3)
Average across the RNA blood tracks
= pred_reference.mean(axis = 1 ) pred_reference
pred_reference.shape
(16384,)
We do the same for the mutated sequence
= make_prediction(mutated_one_hot)
pred_mutated = pred_mutated.squeeze(0).mean(axis = 0)
pred_mutated = pred_mutated[:, 7531:7534]
pred_mutated = pred_mutated.mean(axis = 1 ) pred_mutated
pred_mutated.shape
(16384,)
import matplotlib.pyplot as plt
import seaborn as sns
def plot_tracks_nested(nested_tracks, interval, highlight_position = None, alpha = 0.4, height=1.5):
"""
nested_tracks: dict of dicts
Outer keys: data sources (e.g., replicates or conditions)
Inner keys: track names to plot (these become the rows)
interval: a kipoiseq interval with .start and .end
"""
= set()
all_inner_keys for subdict in nested_tracks.values():
all_inner_keys.update(subdict.keys())= sorted(all_inner_keys)
all_inner_keys
= len(all_inner_keys)
n_tracks = plt.subplots(n_tracks, 1, figsize=(20, height * n_tracks), sharex=True)
fig, axes
if n_tracks == 1:
= [axes]
axes
for ax, inner_key in zip(axes, all_inner_keys):
for outer_key, subdict in nested_tracks.items():
if inner_key in subdict:
= subdict[inner_key]
y = np.linspace(interval.start, interval.end, num=len(y))
x =alpha, label=outer_key)
ax.fill_between(x, y, alpha=11)
ax.set_title(inner_key, fontsize='upper right', fontsize=8, frameon=False)
ax.legend(loc=ax, top=True, right=True, bottom=True)
sns.despine(axif highlight_position is not None and interval.start <= highlight_position <= interval.end:
=highlight_position, color='black', linestyle='--', linewidth=1)
ax.axvline(x
-1].set_xlabel(str(interval))
axes[
plt.tight_layout() plt.show()
Since we are only interested in the 3’UTR region, we will center our plot on the 64 middle bins (for borzoi, each bin is 32 base pair-long). Remember that the prediction is centered at the 3’aQTL (dashed line on the plot). Indeed the coverage difference between the reference and alternative sequences is evident.
8160:8224, ].shape pred_reference[
(64,)
= {'Reference': {'Whole Blood RNA-seq': pred_reference[8160:8224, ]},
tracks 'Mutated': {'Whole Blood RNA-seq': pred_mutated[8160:8224, ]}}
64*32), highlight_position=sequence_interval.start) plot_tracks_nested(tracks, sequence_interval.resize(