Predicting alternative polyadenylation with Borzoi

notebook

gene46100

Author

Sofia Salazar

Published

May 5, 2025

Predicting alternative polyadenylation with Borzoi

Overview

Borzoi is a deep learning model that, similarily to Enformer:

Takes DNA sequences as input (one-hot encoded)
Uses a convolutional neural network architecture and transformer layers
Predicts multiple genomic features simultaneously

However, Borzoi was rained to predict RNA-seq coverage as well, which makes it useful for the following aplication: predicting 3’UTR alternative polyadenylation (APA).

3’UTR APA is a post-transcriptional regulatory mechanism that allows transcripts from a gene to have different lengths due to differential polyA signal usage leading to different lengths of 3’UTR regions. Variants on the 3’UTR sequence that alter the length of the region are refered to 3’aQTLs read more here.

Objective

In this notebook we will learn how to use borzoi to predict RNA-seq coverage around one 3’aQTL by predicting on both the reference and the alternative sequence. We will plot the differences on RNA-seq prediction at the 3’UTR for the whole blood borzoi tracks.

1. Set up a borzoi-adequate environment

The requirements for running borzoi are outlined here. The most important dependencies are python ==3.10 and tensorflow==2.15.x.

I had to install python 3.10 and found that working with a venv worked as of may 5th 2025.

I followed the instructions to use the GPU of my macbook here

brew install python@3.10
python3.10 -m venv ~/borzoiTools
source ~/borzoiTools/bin/activate
python -m pip install -U pip
python -m pip install ipykernel
python3 -m ipykernel install --user --name=borzoiTools
python3 -m pip install tensorflow==2.15
python3 -m pip install tensorflow-metal

To verify python and tensforflow versions, run the following chunks:

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
gpus

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            print(gpu)
            tf.config.experimental.set_memory_growth(gpu, True)
        print("TensorFlow is using the GPU.")
    except RuntimeError as e:
        print(e)
else:
    print("TensorFlow is not using the GPU. Check your TensorFlow installation.")

PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
TensorFlow is using the GPU.

import sys
print(sys.version)

3.10.17 (main, Apr  8 2025, 12:10:59) [Clang 16.0.0 (clang-1600.0.26.6)]

print(tf.__version__)

2.15.0

2. Install borzoi and dependencies

Borzoi depends on the baskerville software, run the following chunks to install both on your environment:

%cd /Users/sofiasalazar/Desktop/IM_lab/borzoi_folder

/Users/sofiasalazar/Desktop/IM_lab/borzoi_folder

/Users/sofiasalazar/borzoiTools/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: This is now an optional IPython functionality, setting dhist requires you to install the `pickleshare` library.
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]

!git clone https://github.com/calico/baskerville.git

%cd baskerville
%pip install -e .

%cd ..
!git clone https://github.com/calico/borzoi.git

%cd borzoi
%pip install -e .

Then I installed other libraries that we will be using as well

%pip install kipoiseq
%pip install h5py

Here I had to restart the kernel to use baskerville

3. Download models

The borzoi models are each ~1GB (~4GB in total as there are 4 models, each correspond to a cross validation fold from the training process). Further down you’ll see that we can choose to predict using the 4 models or only a couple of them.

%cd /Users/sofiasalazar/Desktop/IM_lab/borzoi_folder/borzoi
! ./download_models.sh

4. Set up borzoi and helper functions

The previous steps are run only once, afterwards, we can set up the borzoi functions as well as other utilities to extract the reference sequence and mutate it.

import os
import h5py
import numpy as np
import pandas as pd
import tensorflow as tf
import baskerville
from baskerville import seqnn
from baskerville import gene as bgene
from baskerville import dna
import json
import kipoiseq
import pyfaidx

gpus = tf.config.experimental.list_physical_devices('GPU')
gpus

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

DEVICE = "/GPU:0"

Change the following paths to your personal paths

BORZOI_PATH = "/Users/sofiasalazar/Desktop/IM_lab/borzoi_folder/borzoi"
models_path = os.path.join(BORZOI_PATH, "examples/saved_models" )
fasta_file = "/Users/sofiasalazar/Library/CloudStorage/Box-Box/imlab-data/Reference-Data/ref_sequences/hg38/Homo_sapiens_assembly38.fasta"
params_file = os.path.join(BORZOI_PATH, "examples/params_pred.json")
targets_file = os.path.join(BORZOI_PATH, "examples/targets_human.txt")

The following classes will help us extract the reference genome sequence, mutate it and one hot encode it.

class FastaStringExtractor:

    def __init__(self, fasta_file):
        self.fasta = pyfaidx.Fasta(fasta_file)
        self._chromosome_sizes = {k: len(v) for k, v in self.fasta.items()}
    #import pd.Interval as Interval
    def extract(self, interval: kipoiseq.Interval, snp_position = None, snp_base=None, **kwargs) -> str:
        # Truncate interval if it extends beyond the chromosome lengths.
        chromosome_length = self._chromosome_sizes[interval.chrom]
        trimmed_interval = kipoiseq.Interval(interval.chrom,
                                    max(interval.start, 0),
                                    min(interval.end, chromosome_length),
                                    )
        
        # pyfaidx wants a 1-based interval
        sequence = str(self.fasta.get_seq(trimmed_interval.chrom,
                                          trimmed_interval.start + 1,
                                          trimmed_interval.stop).seq).upper()
        
        # Apply SNP modification if specified
        if snp_position and snp_base:
            pos = snp_position - trimmed_interval.start - 1
            if 0 <= pos < len(sequence):
                print(sequence[pos-1:pos+2])
                sequence = sequence[:pos] + snp_base + sequence[pos + 1:]
                print(sequence[pos-1:pos+2])
                
        # Fill truncated values with N's.
        pad_upstream = 'N' * max(-interval.start, 0)
        pad_downstream = 'N' * max(interval.end - chromosome_length, 0)
        return pad_upstream + sequence + pad_downstream

    def close(self):
        return self.fasta.close()
    
def one_hot_encode(sequence):
  return kipoiseq.transforms.functional.one_hot_dna(sequence).astype(np.float32)

Borzoi prediction functions

Here we define how many model folds we want to use for the prediction

n_reps = 1

seq_len = 524288
rc = True         #Average across reverse-complement prediction

with open(params_file) as params_open :
    
    params = json.load(params_open)
    
    params_model = params['model']
    params_train = params['train']

params_model['trunk'][-2]['cropping'] = 0


targets_df = pd.read_csv(targets_file, index_col=0, sep='\t')
target_index = targets_df.index 

#Create local index of strand_pair (relative to sliced targets)
if rc :
    strand_pair = targets_df.strand_pair
    
    target_slice_dict = {ix : i for i, ix in enumerate(target_index.values.tolist())}
    slice_pair = np.array([
        target_slice_dict[ix] if ix in target_slice_dict else ix for ix in strand_pair.values.tolist()
    ], dtype='int32')

models = []
for rep_ix in range(n_reps) :
    model_file = os.path.join(BORZOI_PATH, f"examples/saved_models/f3c{rep_ix}/train/model0_best.h5")
    seqnn_model = seqnn.SeqNN(params_model)
    seqnn_model.restore(model_file, 0)
    seqnn_model.build_slice(target_index)
    if rc :
        seqnn_model.strand_pair.append(slice_pair)
    seqnn_model.build_ensemble(rc, [0])
    
    models.append(seqnn_model)

def predict_tracks(models, sequence_one_hot):

    predicted_tracks = []
    for fold_ix in range(len(models)):

        yh = models[fold_ix](sequence_one_hot[None, ...])[:, None, ...].astype(
            "float16"
        )

        predicted_tracks.append(yh)

    predicted_tracks = np.concatenate(predicted_tracks, axis=1)

    return predicted_tracks

def make_prediction(sequence):
    with tf.device(DEVICE):
        y_wt = predict_tracks(models, sequence)
        return y_wt

2025-05-05 13:09:41.154901: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2025-05-05 13:09:41.154939: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2025-05-05 13:09:41.154952: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2025-05-05 13:09:41.155004: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-05-05 13:09:41.155040: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

5. Predict for a 3’aQTL

In this paper, 3’ alternative polyadenylation QTLs (1’aQTLs) where identified using gtex individuals. They were stored here.

For the purpose of this example, we will focus on the variant rs10954213: chr7_128949373_G_A (hg38), which is one of the whole blood 3’aQTLs that was reported on the paper (Figure 1b)

fasta_extractor = FastaStringExtractor(fasta_file)

sequence_interval = kipoiseq.Interval("chr7", 128949373, 128949374)

reference_one_hot = one_hot_encode(fasta_extractor.extract(sequence_interval.resize(seq_len)))

Here the snp_position argument of the extract function refers to the position in the sequence where we want to introduce the mutation. snp_base is the nucleotide which will replace the reference nucleotide.

Note that in the output, the reference and mutated nucleotides are printed in the middle of the upstream and downstream nucleotides.

mutated_one_hot = one_hot_encode(fasta_extractor.extract(sequence_interval.resize(seq_len), snp_position=128949373, snp_base='A'))

TGA
TAA

reference_one_hot = reference_one_hot.astype("float32")
mutated_one_hot = mutated_one_hot.astype("float32")

pred_reference = make_prediction(reference_one_hot)

pred_reference.shape

(1, 1, 16384, 7611)

We remove the 0 dimension and average across the 1 dimension (folds)

pred_reference = pred_reference.squeeze(0).mean(axis = 0)
pred_reference.shape

(16384, 7611)

Then we keep only the whole blood RNA-seq prediction see track description here

pred_reference = pred_reference[:, 7531:7534]
pred_reference.shape

(16384, 3)

Average across the RNA blood tracks

pred_reference = pred_reference.mean(axis = 1 )

pred_reference.shape

(16384,)

We do the same for the mutated sequence

pred_mutated = make_prediction(mutated_one_hot)
pred_mutated = pred_mutated.squeeze(0).mean(axis = 0)
pred_mutated = pred_mutated[:, 7531:7534]
pred_mutated = pred_mutated.mean(axis = 1 )

pred_mutated.shape

(16384,)

import matplotlib.pyplot as plt
import seaborn as sns

def plot_tracks_nested(nested_tracks, interval, highlight_position = None, alpha = 0.4, height=1.5):
    """
    nested_tracks: dict of dicts
        Outer keys: data sources (e.g., replicates or conditions)
        Inner keys: track names to plot (these become the rows)
    interval: a kipoiseq interval with .start and .end
    """
    all_inner_keys = set()
    for subdict in nested_tracks.values():
        all_inner_keys.update(subdict.keys())
    all_inner_keys = sorted(all_inner_keys)

    n_tracks = len(all_inner_keys)
    fig, axes = plt.subplots(n_tracks, 1, figsize=(20, height * n_tracks), sharex=True)

    if n_tracks == 1:
        axes = [axes]

    for ax, inner_key in zip(axes, all_inner_keys):
        for outer_key, subdict in nested_tracks.items():
            if inner_key in subdict:
                y = subdict[inner_key]
                x = np.linspace(interval.start, interval.end, num=len(y))
                ax.fill_between(x, y, alpha=alpha, label=outer_key)
        ax.set_title(inner_key, fontsize=11)
        ax.legend(loc='upper right', fontsize=8, frameon=False)
        sns.despine(ax=ax, top=True, right=True, bottom=True)
        if highlight_position is not None and interval.start <= highlight_position <= interval.end:
            ax.axvline(x=highlight_position, color='black', linestyle='--', linewidth=1)

    axes[-1].set_xlabel(str(interval))
    plt.tight_layout()
    plt.show()

Since we are only interested in the 3’UTR region, we will center our plot on the 64 middle bins (for borzoi, each bin is 32 base pair-long). Remember that the prediction is centered at the 3’aQTL (dashed line on the plot). Indeed the coverage difference between the reference and alternative sequences is evident.

pred_reference[8160:8224, ].shape

(64,)

tracks = {'Reference': {'Whole Blood RNA-seq': pred_reference[8160:8224, ]},
          'Mutated': {'Whole Blood RNA-seq': pred_mutated[8160:8224, ]}}
plot_tracks_nested(tracks, sequence_interval.resize(64*32), highlight_position=sequence_interval.start)