Why Deep Learning for Genomics?

GENE 46100 — Unit 0

Haky Im

2025-03-25

Course Roadmap

Unit Topic
0 MLP, CNN, DNA scoring
1 Transformers & Genomic GPT
2 Enformer & Borzoi
3 Applications

Tools we’ll use:

  • Python + PyTorch
  • Google Colab (GPU)
  • Weights & Biases

By the end: You’ll understand how state-of-the-art models predict gene expression from DNA sequence

What Can Deep Learning Do in Genomics?

  • Predict TF binding from DNA sequence alone
  • Score regulatory variants without experiments
  • Predict gene expression from 200kb of sequence (Enformer)
  • Generate protein structures (AlphaFold)
  • Build DNA language models that learn grammar of the genome

These models learn patterns we didn’t know to look for.

The Starting Point: Linear Models

You already know this from statistics/ML:

\[y = X\beta + \epsilon\]

Closed-form solution: \(\hat{\beta} = (X^TX)^{-1}X^Ty\)

Works great when the relationship is linear.

But what if it’s not?

What If the Truth Is Curved?

Consider: \(y = x^3\)

Linear model:

  • Fits best straight line
  • Residuals are huge
  • Fundamentally wrong model

What we need:

  • A model that can learn any shape
  • Without us specifying the functional form
  • From data alone

You’ll see this exact demo in the hands-on notebook

Gradient Descent: Learning by Iterating

Instead of a closed-form solution, we iteratively improve:

  1. Start with random parameters
  2. Compute prediction error (loss)
  3. Compute gradient: which direction reduces loss?
  4. Take a small step in that direction
  5. Repeat

\[\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)\]

Why? Because for nonlinear models, there’s no closed-form solution.

Enter the MLP

Multi-Layer Perceptron — the simplest neural network

Try it yourself: playground.hakyimlab.org

MLP: The Math

The formula for one hidden layer:

\[y = W_2 \cdot \sigma(W_1 x + b_1) + b_2\]

The activation function \(\sigma\) (e.g., ReLU) is what makes it nonlinear.

Why the Activation Function Matters

Without activation: stacking linear layers is still linear

\[W_2(W_1 x) = (W_2 W_1) x = W'x\]

With ReLU activation: \(\text{ReLU}(x) = \max(0, x)\)

  • Each neuron becomes a “hinge”
  • Combinations of hinges approximate any curve
  • More neurons → better approximation

Universal Approximation Theorem

An MLP with a single hidden layer can approximate any continuous function to arbitrary accuracy, given enough hidden neurons.

What this means for us:

  • We don’t need to know the functional form
  • The network learns it from data
  • We just need enough neurons and enough data

What this doesn’t mean:

  • It doesn’t say learning is easy
  • It doesn’t say how many neurons you need
  • It doesn’t say it will generalize

PyTorch: Automatic Differentiation

The key idea: you define the forward pass, PyTorch computes the gradients

# Define model
model = MLP(input_dim=1, hidden_dim=64, output_dim=1)

# Training loop
for epoch in range(1000):
    y_hat = model(x)          # forward pass
    loss = loss_fn(y_hat, y)  # compute loss
    loss.backward()           # compute gradients (automatic!)
    optimizer.step()          # update parameters
    optimizer.zero_grad()     # reset gradients

You’ll implement this in the hands-on notebook

The Training Loop — Visually

                    ┌─────────────┐
                    │  Input Data │
                    └──────┬──────┘
                           ▼
                    ┌─────────────┐
              ┌────▶│ Forward Pass│
              │     └──────┬──────┘
              │            ▼
              │     ┌─────────────┐
              │     │ Compute Loss│
              │     └──────┬──────┘
              │            ▼
              │     ┌─────────────┐
              │     │  Backward   │ ← PyTorch does this
              │     │  (Gradients)│
              │     └──────┬──────┘
              │            ▼
              │     ┌─────────────┐
              └─────│Update Params│
                    └─────────────┘

DNA as Data: One-Hot Encoding

How do we feed DNA to a neural network?

Sequence:  A  T  G  C  G  T  A

           A  T  G  C
     1:  [ 1  0  0  0 ]  ← A
     2:  [ 0  1  0  0 ]  ← T
     3:  [ 0  0  1  0 ]  ← G
     4:  [ 0  0  0  1 ]  ← C
     5:  [ 0  0  1  0 ]  ← G
     6:  [ 0  1  0  0 ]  ← T
     7:  [ 1  0  0  0 ]  ← A

Shape: (sequence_length, 4)

Each base → a 4-dimensional unit vector. No ordinal relationship imposed.

From DNA to Prediction

DNA sequence          One-hot matrix       Neural network      Prediction

ATGCGTAACG...  →  ┌─────────────┐   →   ┌──────────┐   →   binding score
                  │ 1 0 0 0     │       │  MLP or  │       expression level
                  │ 0 1 0 0     │       │   CNN    │       variant effect
                  │ 0 0 1 0     │       │          │       ...
                  │ 0 0 0 1     │       └──────────┘
                  │ ...         │
                  └─────────────┘

This is the core pattern of the entire course.

Every model we study takes DNA sequence as input and predicts a biological output.

MLP vs CNN: A Preview

MLP (this week)

  • Flatten DNA → single vector
  • Every input connected to every neuron
  • No concept of “position”
  • Good for learning the basics

CNN (next week)

  • Preserve 2D structure
  • Sliding window = motif scanner
  • Learned filters ≈ PWMs
  • The workhorse of genomic DL

Key question CNN answers: Is there a motif anywhere in this sequence?

Why CNNs Work for DNA

A CNN filter sliding over one-hot DNA is mathematically equivalent to scoring with a Position Weight Matrix (PWM)

Filter:  [weights for A, T, G, C] × filter_length

   = a learned PWM!

But better:

  • Learns the motifs from data (no database needed)
  • Stacks of filters capture combinations of motifs
  • Multiple layers capture hierarchical patterns

Erin Wilson’s DNA scoring notebook makes this beautifully concrete

Unit 0 Plan: Weeks 1–3

Session Topic Notebook
Week 1a Setup, intro (this lecture)
Week 1b Linear → MLP in PyTorch hands-on-introduction_to_deep_learning.ipynb
Week 2 CNN for DNA scoring updated-basic_DNA_tutorial.ipynb
Week 3a TF binding project tf-binding-prediction-starter.ipynb
Week 3b Hyperparameter tuning tf-binding-wandb.ipynb

What’s Coming Next

Unit 1: Transformers & GPT

  • Attention mechanism
  • Karpathy’s nanoGPT
  • Train a DNA language model
  • Fine-tune for promoter prediction

Unit 2: Enformer & Borzoi

  • Predict epigenome from 200kb DNA
  • Variant effect prediction
  • Connection to GWAS/PrediXcan

The arc: MLP → CNN → Transformer → Genomic foundation models

Getting Started

Environment: Google Colab (GPU provided, no local setup needed)

First notebook: hands-on-introduction_to_deep_learning.ipynb

You will:

  1. Fit a linear model with gradient descent
  2. See it fail on nonlinear data
  3. Build an MLP that succeeds
  4. Visualize the learning curve

Key device line for all notebooks:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Resources

Videos:

Interactive:

Papers (optional):

  • Avsec et al. 2021 — Enformer
  • Linder et al. 2023 — Borzoi

Questions?