Measuring information in DNA sequence

Author

Haky Im

Published

April 3, 2025

Understanding Entropy and Information Content in DNA Sequence Analysis

In the analysis of DNA sequences, particularly when studying motifs and patterns, two key concepts from information theory are entropy and information content. These measures help quantify the variability and conservation of nucleotides at specific positions within a set of aligned sequences.

Entropy (H):

Entropy measures the uncertainty or randomness at a given position. For DNA with four possible bases (A, C, G, T) and their respective probabilities at a position being p(A), p(C), p(G), and p(T) (where p(A) + p(C) + p(G) + p(T) = 1), the Shannon entropy (H) in bits is calculated using the following formula:

H = - (p(A) * log2(p(A)) + p(C) * log2(p(C)) + p(G) * log2(p(G)) + p(T) * log2(p(T))) A higher entropy value indicates greater uncertainty or variability in the nucleotides at that position. A lower entropy value indicates less uncertainty, meaning one or a few nucleotides are more dominant. The maximum possible entropy for DNA (when all four bases are equally likely) is 2 bits. The minimum possible entropy (when only one base is present) is 0 bits. 2. Information Content (IC):

Information content measures the amount of information gained by knowing the nucleotide at a particular position. It reflects the conservation of that position relative to a background distribution (often assumed to be uniform). The information content (IC) in bits is calculated as:

IC = log2(N) - H Where:

N is the number of possible nucleotides (4 for DNA).

log2(N) represents the maximum possible information content (2 bits for DNA).

H is the Shannon entropy calculated for that position.

A higher information content value indicates greater conservation and importance of specific nucleotides at that position.

A lower information content value indicates less conservation and more variability.

The maximum possible information content for DNA is 2 bits (when entropy is 0).

The minimum possible information content (when entropy is maximum) is 0 bits.

import math

def calculate_entropy(pA, pC, pG, pT):
    """
    Calculates the Shannon entropy for DNA bases given their probabilities.

    Args:
        pA (float): Probability of Adenine (A).
        pC (float): Probability of Cytosine (C).
        pG (float): Probability of Guanine (G).
        pT (float): Probability of Thymine (T).

    Returns:
        float: The Shannon entropy in bits.
    """
    entropy = 0
    if pA > 0:
        entropy -= pA * math.log2(pA)
    if pC > 0:
        entropy -= pC * math.log2(pC)
    if pG > 0:
        entropy -= pG * math.log2(pG)
    if pT > 0:
        entropy -= pT * math.log2(pT)
    return entropy

def calculate_information_content(pA, pC, pG, pT):
    """
    Calculates the information content (in bits) at a given position
    based on the nucleotide probabilities.

    Args:
        pA (float): Probability of Adenine (A).
        pC (float): Probability of Cytosine (C).
        pG (float): Probability of Guanine (G).
        pT (float): Probability of Thymine (T).

    Returns:
        float: The information content in bits.
    """
    num_nucleotides = 4  # A, C, G, T
    max_entropy = math.log2(num_nucleotides)  # Maximum entropy for 4 equally likely bases
    entropy = calculate_entropy(pA, pC, pG, pT)
    information_content = max_entropy - entropy
    return information_content

# Example Usage:
probability_A = 0.3
probability_C = 0.2
probability_G = 0.25
probability_T = 0.25

entropy_value = calculate_entropy(probability_A, probability_C, probability_G, probability_T)
information_content_value = calculate_information_content(probability_A, probability_C, probability_G, probability_T)

print(f"Probabilities: p(A)={probability_A:.2f}, p(C)={probability_C:.2f}, p(G)={probability_G:.2f}, p(T)={probability_T:.2f}")
print(f"Entropy: {entropy_value:.4f} bits")
print(f"Information Content: {information_content_value:.4f} bits")

# Example with uniform distribution:
entropy_uniform = calculate_entropy(0.25, 0.25, 0.25, 0.25)
information_content_uniform = calculate_information_content(0.25, 0.25, 0.25, 0.25)
print(f"\nProbabilities (uniform): p(A)=0.25, p(C)=0.25, p(G)=0.25, p(T)=0.25")
print(f"Entropy (uniform): {entropy_uniform:.4f} bits")
print(f"Information Content (uniform): {information_content_uniform:.4f} bits")

# Example with a fully conserved position:
entropy_conserved = calculate_entropy(1.0, 0.0, 0.0, 0.0)
information_content_conserved = calculate_information_content(1.0, 0.0, 0.0, 0.0)
print(f"\nProbabilities (fully conserved): p(A)=1.00, p(C)=0.00, p(G)=0.00, p(T)=0.00")
print(f"Entropy (fully conserved): {entropy_conserved:.4f} bits")
print(f"Information Content (fully conserved): {information_content_conserved:.4f} bits")

Probabilities: p(A)=0.30, p(C)=0.20, p(G)=0.25, p(T)=0.25
Entropy: 1.9855 bits
Information Content: 0.0145 bits

Probabilities (uniform): p(A)=0.25, p(C)=0.25, p(G)=0.25, p(T)=0.25
Entropy (uniform): 2.0000 bits
Information Content (uniform): 0.0000 bits

Probabilities (fully conserved): p(A)=1.00, p(C)=0.00, p(G)=0.00, p(T)=0.00
Entropy (fully conserved): 0.0000 bits
Information Content (fully conserved): 2.0000 bits

** text and code generated with gemini 2.0 via a series of prompting **