GENE 46100 — Unit 0
2025-03-25
| Unit | Topic |
|---|---|
| 0 | MLP, CNN, DNA scoring |
| 1 | Transformers & Genomic GPT |
| 2 | Enformer & Borzoi |
| 3 | Applications |
Tools we’ll use:
By the end: You’ll understand how state-of-the-art models predict gene expression from DNA sequence
These models learn patterns we didn’t know to look for.
You already know this from statistics/ML:
\[y = X\beta + \epsilon\]
Closed-form solution: \(\hat{\beta} = (X^TX)^{-1}X^Ty\)
Works great when the relationship is linear.
But what if it’s not?
Consider: \(y = x^3\)
Linear model:
What we need:
You’ll see this exact demo in the hands-on notebook
Instead of a closed-form solution, we iteratively improve:
\[\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)\]
Why? Because for nonlinear models, there’s no closed-form solution.
Multi-Layer Perceptron — the simplest neural network
The formula for one hidden layer:
\[y = W_2 \cdot \sigma(W_1 x + b_1) + b_2\]
The activation function \(\sigma\) (e.g., ReLU) is what makes it nonlinear.
Without activation: stacking linear layers is still linear
\[W_2(W_1 x) = (W_2 W_1) x = W'x\]
With ReLU activation: \(\text{ReLU}(x) = \max(0, x)\)
An MLP with a single hidden layer can approximate any continuous function to arbitrary accuracy, given enough hidden neurons.
What this means for us:
What this doesn’t mean:
The key idea: you define the forward pass, PyTorch computes the gradients
# Define model
model = MLP(input_dim=1, hidden_dim=64, output_dim=1)
# Training loop
for epoch in range(1000):
y_hat = model(x) # forward pass
loss = loss_fn(y_hat, y) # compute loss
loss.backward() # compute gradients (automatic!)
optimizer.step() # update parameters
optimizer.zero_grad() # reset gradientsYou’ll implement this in the hands-on notebook
┌─────────────┐
│ Input Data │
└──────┬──────┘
▼
┌─────────────┐
┌────▶│ Forward Pass│
│ └──────┬──────┘
│ ▼
│ ┌─────────────┐
│ │ Compute Loss│
│ └──────┬──────┘
│ ▼
│ ┌─────────────┐
│ │ Backward │ ← PyTorch does this
│ │ (Gradients)│
│ └──────┬──────┘
│ ▼
│ ┌─────────────┐
└─────│Update Params│
└─────────────┘
How do we feed DNA to a neural network?
Sequence: A T G C G T A
A T G C
1: [ 1 0 0 0 ] ← A
2: [ 0 1 0 0 ] ← T
3: [ 0 0 1 0 ] ← G
4: [ 0 0 0 1 ] ← C
5: [ 0 0 1 0 ] ← G
6: [ 0 1 0 0 ] ← T
7: [ 1 0 0 0 ] ← A
Shape: (sequence_length, 4)
Each base → a 4-dimensional unit vector. No ordinal relationship imposed.
DNA sequence One-hot matrix Neural network Prediction
ATGCGTAACG... → ┌─────────────┐ → ┌──────────┐ → binding score
│ 1 0 0 0 │ │ MLP or │ expression level
│ 0 1 0 0 │ │ CNN │ variant effect
│ 0 0 1 0 │ │ │ ...
│ 0 0 0 1 │ └──────────┘
│ ... │
└─────────────┘
This is the core pattern of the entire course.
Every model we study takes DNA sequence as input and predicts a biological output.
MLP (this week)
CNN (next week)
Key question CNN answers: Is there a motif anywhere in this sequence?
A CNN filter sliding over one-hot DNA is mathematically equivalent to scoring with a Position Weight Matrix (PWM)
Filter: [weights for A, T, G, C] × filter_length
= a learned PWM!
But better:
Erin Wilson’s DNA scoring notebook makes this beautifully concrete
| Session | Topic | Notebook |
|---|---|---|
| Week 1a | Setup, intro | (this lecture) |
| Week 1b | Linear → MLP in PyTorch | hands-on-introduction_to_deep_learning.ipynb |
| Week 2 | CNN for DNA scoring | updated-basic_DNA_tutorial.ipynb |
| Week 3a | TF binding project | tf-binding-prediction-starter.ipynb |
| Week 3b | Hyperparameter tuning | tf-binding-wandb.ipynb |
Unit 1: Transformers & GPT
Unit 2: Enformer & Borzoi
The arc: MLP → CNN → Transformer → Genomic foundation models
Environment: Google Colab (GPU provided, no local setup needed)
First notebook: hands-on-introduction_to_deep_learning.ipynb
You will:
Videos:
Interactive:
Papers (optional):
GENE 46100 · Deep Learning in Genomics