
GENE 46100 — Unit 00
2025-03-25
| Unit | Question | Topic | Weeks |
|---|---|---|---|
| 0 | What is DL? Can we detect regulatory motifs in DNA? | MLP, CNN, DNA scoring | 1–3 |
| 1 | Can we learn the “language” of DNA? | Transformers & Genomic GPT | 4–5 |
| 2 | Can we predict gene regulation from sequence? | Enformer & Borzoi | 6–7 |
| 3 | Can we model microbial communities? | 8–9 |
By the end: You’ll understand how state-of-the-art models predict gene expression from DNA sequence and apply DL to microbiome and other areas.
These models learn patterns we didn’t know to look for.
\[y = X\beta + \epsilon ~~~~~ \text{where }\epsilon \sim N(0, \sigma)\]
For linear regression, there’s a closed-form solution:
\[\hat{\beta} = (X^TX)^{-1}X^Ty\] this is called the normal equation.
Estimated: \(\hat{\beta} \approx\) true \(\beta\)

The normal equation works for linear models…
We need a general-purpose optimization method.
A gradient is the derivative of a function — it tells us the slope.
\[f'(\beta) = \lim_{\Delta\beta \to 0} \frac{f(\beta + \Delta\beta) - f(\beta)}{\Delta\beta}\]


Algorithm:
\(\beta_{t+1} = \beta_t - \alpha \cdot \nabla L(\beta_t)\)
\(\alpha\) = learning rate (step size)
Too small (\(\alpha = 0.0001\))
Just right (\(\alpha = 0.003\))
Too large (\(\alpha = 0.1\))



Tiny steps → very slow
Smooth convergence
Overshoots → diverges!
In practice, computing gradients on all data is expensive.
Solution: use a random mini-batch at each step. \[\nabla L \approx \frac{1}{|B|} \sum_{i \in B} \nabla L_i\]
Full-batch GD:
SGD (mini-batch):
The noise actually helps escape bad local minima!
Every ML model needs these three ingredients:
| Component | Linear regression | Neural network |
|---|---|---|
| Model | \(y = X\beta\) | \(y = W_2 \sigma(W_1 x + b_1) + b_2\) |
| Loss | MSE | MSE, cross-entropy, … |
| Optimizer | Normal equation | SGD, Adam, … |
Mean Squared Error: \[L(\beta) = \frac{1}{m}\sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2\]
Its gradient (for linear model):
\[\frac{\partial L}{\partial \beta_j} = \frac{1}{m}\sum_{i} (\hat{y}^{(i)} - y^{(i)}) \cdot x_j^{(i)}\]
Settings:

Watch the loss curve drop as the gradient descent finds the optimal \(w\) and \(b\).
Try changing the learning rate — see it converge faster or explode.
The playground fits \(y = wx + b\) perfectly on the linear dataset.
But what if the data isn’t linear?

The best possible linear fit to \(y = x^3\) is still terrible.
No amount of training will fix this — the model is wrong, not the optimizer.
Switch to dataset → Cubic (\(y \approx x^3\))
Keep: 1 neuron, Linear activation
Hit ▶ Play and wait…

Loss stays high. The best linear fit is a flat plane through a curve.
No matter how long you train — a linear model can’t bend.

Solution: add nonlinearity → build a neural network.

A neural network with:
\[y = W_2 \cdot \sigma(W_1 x + b_1) + b_2\]
Key insight: \(\sigma\) (activation function) introduces the bends that let the model fit curves.
activation
x₁ ──w₁──┐ function
├──▶ Σ ──▶ σ(·) ──▶ output
x₂ ──w₂──┘ +b
Without activation (linear):
\(\text{output} = w_1 x_1 + w_2 x_2 + b\)
Just a weighted sum — still a line!
With ReLU activation:
\(\text{output} = \max(0, w_1 x_1 + w_2 x_2 + b)\)
Now it can produce a kink — a bent line!

Change activation → ReLU
Set hidden layer → 2 neurons
Hit ▶ Play
Select Pred vs True to show the scatter plot

Each neuron contributes one ReLU “kink.” Combined, they approximate a curve — but it’s rough.
Loss is lower, but the fit isn’t great yet.
Increase to 4 neurons
Hit ▶ Play
Select Pred vs True to show the scatter plot

More neurons = more “kinks” = smoother approximation of \(x^3\).
Add a second layer: 4 neurons → 2 neurons
Hit ▶ Play

Layer 1 learns basic bends → Layer 2 combines them into a smooth curve.
Stacking layers = composing simple features into complex functions.
LINEAR (1 neuron, no activation) 2 NEURONS + ReLU
┌────────────────────────┐ ┌────────────────────────┐
│ ────────────── │ │ ─────╱ │
│ flat line │ │ ╱────── │
│ can't bend! │ │ two bends │
└────────────────────────┘ └────────────────────────┘
4 NEURONS + ReLU 2 LAYERS (4→2) + ReLU
┌────────────────────────┐ ┌────────────────────────┐
│ ╱‾‾‾╲ │ │ ╱‾‾‾╲ │
│ ─────╱ ╲───── │ │ ─────╱ ╲───── │
│ four bends │ │ smooth curve │
└────────────────────────┘ └────────────────────────┘
A neural network approximates any function by combining simple bent lines.
Exercise: count parameters for this network:
x₁ ──┬──▶ h₁ ──┐
├──▶ h₂ ──┼──▶ ŷ
x₂ ──┴──▶ h₃ ──┘
. . .
Layer 1: 2×3 weights + 3 biases = 9
Layer 2: 3×1 weights + 1 bias = 4
Total: 13 parameters
Manually computing gradients for 13 parameters is tedious. For millions? Impossible.
This is why we need PyTorch.
Without activation, stacking layers is pointless:
\[W_2(W_1 x) = (W_2 W_1) x = W'x\]
Any number of linear layers = one linear layer!
Try it yourself: set activation to “Linear” in the playground. Add as many layers as you want — the output is always a flat gradient.
The activation function breaks linearity, allowing each layer to add new “bends.”
Common activations:
An MLP with a single hidden layer can approximate any continuous function to arbitrary accuracy, given enough hidden neurons.
In playground terms: with enough neurons, you have enough bends to trace any curve.
What this doesn’t tell you:
The playground does everything behind the scenes. In PyTorch, you write it:
1. Define the model:
| PyTorch code | Playground equivalent |
|---|---|
MLP(input_dim, hid_dim, output_dim) |
Network architecture (boxes) |
F.relu(...) |
Activation dropdown |
model(x) |
Data flows through network |
loss_fn(y_hat, y) |
Loss number updates |
loss.backward() |
Gradients computed (invisible in playground) |
optimizer.step() |
Weights change, output updates |
lr=0.001 |
Learning rate slider |
PyTorch gives you full control over what the playground hides.
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px', 'nodePadding': 16}}}%%
flowchart LR
A[Input Data] --> B["Forward Pass<br/>model(x)"]
B --> C["Compute Loss<br/>loss_fn(ŷ, y)"]
C --> D["Backward Pass<br/>loss.backward()"]
D --> E["Update Params<br/>optimizer.step()"]
E --> B
Each iteration of this loop = one “tick” of the playground.
Plot loss vs epoch to monitor training:

Open playground.hakyimlab.org and try:
How do we feed DNA to a neural network?
Sequence: A T G C G T A
A T G C
1: [ 1 0 0 0 ] ← A
2: [ 0 1 0 0 ] ← T
3: [ 0 0 1 0 ] ← G
4: [ 0 0 0 1 ] ← C
5: [ 0 0 1 0 ] ← G
6: [ 0 1 0 0 ] ← T
7: [ 1 0 0 0 ] ← A
Shape: (sequence_length, 4)
Each base → a 4-dimensional unit vector. No ordinal relationship imposed.
DNA sequence One-hot matrix Neural network Prediction
ATGCGTAACG... → ┌─────────────┐ → ┌──────────┐ → binding score
│ 1 0 0 0 │ │ MLP or │ expression level
│ 0 1 0 0 │ │ CNN │ variant effect
│ 0 0 1 0 │ │ │ ...
│ 0 0 0 1 │ └──────────┘
│ ... │
└─────────────┘
This is the core pattern of the most of the course.
Every model we study takes DNA sequence as input and predicts a biological output.
MLP (this unit)
CNN (next)
A CNN filter sliding over one-hot DNA is mathematically equivalent to scoring with a Position Weight Matrix (PWM) — but learned from data!
| Session | Topic | Notebook |
|---|---|---|
| Week 1a | Setup, intro | (this lecture) |
| Week 1b | Linear → MLP in PyTorch | hands-on-introduction_to_deep_learning.ipynb |
| Week 2 | CNN for DNA scoring | updated-basic_DNA_tutorial.ipynb |
| Week 3a | TF binding project | tf-binding-prediction-starter.ipynb |
| Week 3b | Hyperparameter tuning | tf-binding-wandb.ipynb |
Unit 1: Transformers & GPT
Unit 2: Enformer & Borzoi
The arc: MLP → CNN → Transformer → Genomic foundation models
Videos:
Interactive:
Environment: Google Colab (GPU provided, no local setup needed)
First notebook: hands-on-introduction_to_deep_learning.ipynb
GENE 46100 · Deep Learning in Genomics