Using Borzoi for predicting on a personal genome, derived from VCF
Author
Sofia Salazar
Published
May 5, 2025
Borzoi Personal Genome Prediction Tutorial
This notebook demonstrates how to use Borzoi, a deep learning model for predicting genomic regulatory activity, to analyze personal genome sequences. We’ll learn how to:
Set up the environment and install required packages
Download and prepare model files and reference data
Process personal genome data from VCF files
Make predictions using Borzoi
Visualize and interpret the results
1. Environment Setup
First, we need to set up our Python environment. Borzoi requires Python 3.10 or lower due to compatibility requirements.
conda create --name borzoi46100 python=3.10
Show the code
import osGITHUB_DIR ='/Users/haekyungim/Github'# clone baskerville and borzoi if not already clonedbaskerville_path = os.path.join(GITHUB_DIR, 'baskerville')borzoi_path = os.path.join(GITHUB_DIR, 'borzoi')ifnot os.path.exists(baskerville_path):!git clone https://github.com/calico/baskerville.git {GITHUB_DIR}/baskervilleifnot os.path.exists(borzoi_path):!git clone https://github.com/calico/borzoi.git {GITHUB_DIR}/borzoi
After loading baskerville, restart runtime, run code from here
Modifies ref sequence to personal sequence given the vcf
Show the code
def vcf_to_seq_faster(target_interval, individual, vcf_file, fasta_extractor):# target_inverval is a kipoiseq interval, e.g. kipoiseq.Interval("chr22", 18118779, 18145669)# individual is the id of the individual in the vcf file target_fa = fasta_extractor.extract(target_interval.resize(SEQUENCE_LENGTH)) window_coords = target_interval.resize(SEQUENCE_LENGTH)# two haplotypes per individual haplo_1 =list(target_fa[:]) haplo_2 =list(target_fa[:])# Open the VCF file vcf_reader = cyvcf2.VCF(vcf_file)# Specific genomic region CHROM = window_coords.chrom START =max(1,window_coords.start) END =min(window_coords.end, fasta_extractor._chromosome_sizes[CHROM]-1) count1 =0 count2 =0# Iterate through variants in the specified regionfor variant in vcf_reader(CHROM +':'+str(START) +'-'+str(END)):# Access the individual's genotype using index (0-based) of the sample individual_index = vcf_reader.samples.index(individual) genotype = variant.genotypes[individual_index] ALT=variant.ALT[0] REF=variant.REF POS=variant.POSif REF == target_fa[POS - window_coords.start -1]:if genotype[0] ==1: haplo_1[POS - window_coords.start -1] = ALT count1 = count1 +1if genotype[1] ==1: haplo_2[POS - window_coords.start -1] = ALT count2 = count2 +1else:print("ERROR: REF in vcf "+ REF +"!= REF from the build"+ target_fa[POS - window_coords.start -1])print("number of changes haplo1:")print(count1)print("number of changes haplo2:")print(count2)return haplo_1, haplo_2
make_prediction function
Show the code
def make_prediction(gene, interval, haplo, sequence_one_hot, window =131072): # snp_pos, alt_allelewith tf.device('/GPU:0'): search_gene = gene # gene to predict resized_interval = interval.resize(SEQUENCE_LENGTH) start = resized_interval.start end = resized_interval.end center_pos = start + SEQUENCE_LENGTH//2# figure center position chrom = resized_interval.chr# chromosome#Get exon bin range gene_keys = [gene_key for gene_key in transcriptome.genes.keys() if search_gene in gene_key] gene = transcriptome.genes[gene_keys[0]]#Determine output sequence start seq_out_start = start + seqnn_model.model_strides[0]*seqnn_model.target_crops[0] seq_out_len = seqnn_model.model_strides[0]*seqnn_model.target_lengths[0]#Determine output positions of gene exons gene_slice = gene.output_slice(seq_out_start, seq_out_len, seqnn_model.model_strides[0], False)#Make predictions#y_wtreturn predict_tracks(models, sequence_one_hot)
Get indexes of tracks function
Show the code
# Get indexes of trackdef get_track_idx(tracks): track_idx = [] targets_df['local_index'] = np.arange(len(targets_df))for track in tracks: track_idx.append(targets_df.loc[targets_df['description'] == track]['local_index'].tolist())return track_idx
SEQUENCE_LENGTH =524288#n_folds = 4 #To use only one model fold, change to 'n_folds = 1'n_folds =1#To use only one model fold, change to 'n_folds = 1'rc =True#Average across reverse-complement prediction#Read model parameterswithopen(params_file) as params_open : params = json.load(params_open) params_model = params['model'] params_train = params['train']#Read targetstargets_df = pd.read_csv(targets_file, index_col=0, sep='\t')target_index = targets_df.index#Create local index of strand_pair (relative to sliced targets)if rc : strand_pair = targets_df.strand_pair target_slice_dict = {ix : i for i, ix inenumerate(target_index.values.tolist())} slice_pair = np.array([ target_slice_dict[ix] if ix in target_slice_dict else ix for ix in strand_pair.values.tolist() ], dtype='int32')#Initialize model ensemblemodels = []for fold_ix inrange(n_folds) : model_file = models_path +"f"+str(fold_ix) +"/model0_best.h5" seqnn_model = seqnn.SeqNN(params_model) seqnn_model.restore(model_file, 0) seqnn_model.build_slice(target_index)if rc : seqnn_model.strand_pair.append(slice_pair)#seqnn_model.build_ensemble(rc, '0') seqnn_model.build_ensemble(rc, [0]) models.append(seqnn_model)
4. Build sequence
Show the code
# read VCFs and encode haplotypesCHROM ='chr22'vcf_file = CONTENT_DIR +"/data/ALL."+ CHROM +".shapeit2_integrated_SNPs_v2a_27022019.GRCh38.phased.gz"target_interval = kipoiseq.Interval(CHROM, 18118779, 18145669)haplo1, haplo2 = vcf_to_seq_faster(target_interval, 'HG00097', vcf_file, fasta_extractor)haplo0 = fasta_extractor.extract(target_interval.resize(SEQUENCE_LENGTH)) # ref seq# removed extra dimension as input layer is of shape=(None, 524288, 4)haplo1_enc = make_seq_1hot("".join(haplo1))haplo2_enc = make_seq_1hot("".join(haplo2))haplo0_enc = make_seq_1hot("".join(haplo0))
Borzoi is a deep learning model that: - Takes DNA sequences as input (one-hot encoded) - Uses a convolutional neural network architecture and transformer layers - Predicts multiple genomic features simultaneously - Can identify regulatory elements and their activity levels
Key Parameters
SEQUENCE_LENGTH: 524,288 bp (512 kb) - the size of genomic windows processed
n_folds: 4 - number of model replicates for ensemble predictions
rc: True - whether to average predictions from forward and reverse complement sequences
Data Requirements
Reference Genome (hg38):
Used as baseline for sequence comparison
Required for proper variant mapping
VCF File:
Contains personal genome variants
Must be phased for haplotype analysis
Should be properly indexed
Annotation Files:
GTF file for gene annotations
Additional files for specific features (CAGE, DNase, etc.)