← back
CST / Part IA / Lent / Machine Learning & Real-world Data

Machine Learning and Real-world Data

Cambridge CST · concise comprehensive notes based on the lecture PDFs provided · focus: Naive Bayes, HMMs, network analysis, and sound experimental methodology

Course Map

MLRD is less about fancy models than about simple models used correctly. The course teaches three machine learning settings and repeatedly stresses that evaluation methodology matters as much as the algorithm.

Main Topics
  • Statistical classification: movie-review sentiment with Naive Bayes.
  • Sequence analysis: Hidden Markov Models and decoding with Viterbi.
  • Social networks: degree, diameter, centrality, clique/community structure.
Recurring Methodology
  • Separate training, validation, and test data.
  • Use significance tests, not raw score differences alone.
  • Guard against overtraining/overfitting.
  • Choose metrics that match the task and class balance.

Standard ML Workflow

  1. Define features and classes.
  2. Estimate parameters from labelled training data.
  3. Apply the model to unseen test items.
  4. Measure performance with appropriate metrics.
  5. Check whether observed improvements are statistically significant.
  6. Prevent overtraining by disciplined use of development and test data.
The course objective is not just to implement the model, but to make defensible claims about whether it works.

Classification Basics

In sentiment classification, the observed data are review words and the hidden label is sentiment. Features are automatically observable properties of the input; in this practical, the features are words. Classes are the meaningful labels, typically POS and NEG, later extended to include NEUTRAL.

Feature
Observable attribute of an item, e.g. word occurrence.
Class
Target label to predict, e.g. positive or negative sentiment.
Classifier
Function mapping feature observations to class decisions.
Probabilistic classifier
Returns a distribution over classes, then usually predicts the argmax.
P(c | O) for each class c
ĉ = argmax P(c | O)

Why learn from data rather than hand-written lexicons? Because lexicons depend on human intuition, are static, and require manual labour. A trained classifier can adapt to the actual data distribution.

Naive Bayes

Naive Bayes is a simple probabilistic classifier using Bayes' theorem plus a strong conditional independence assumption between features given the class.

Bayes Decision Rule

P(c | O) = P(c) P(O | c) / P(O)
ĉNB = argmax P(c) P(O | c)

P(O) is constant for a fixed document, so it can be dropped under argmax.

Naive Independence Assumption

O = {w1, ..., wn}
P(O | c) = P(w1, ..., wn | c) ≈ ∏ P(wi | c)
ĉNB = argmax P(c) ∏ P(wi | c)

Parameter Estimation

Estimate probabilities by maximum likelihood from labelled training data.

P̂(wi | c) = count(wi, c) / Σw∈V count(w, c)
P̂(c) = Nc / Nrev

Log-space Computation

Products of many small probabilities underflow numerically, so classify in log-space.

ĉNB = argmax [log P(c) + Σ log P(wi | c)]

Smoothing

Without smoothing, an unseen word in class c gives probability 0 and can wipe out the entire product. Add-one/Laplace smoothing redistributes some probability mass to unseen events.

P̂(wi | c) = (count(wi, c) + 1) / (Σw∈V count(w, c) + |V|)
Smoothing helps because natural language has a long tail: many low-frequency words will not appear in training even though they are real and likely to appear later.

Zipf's Law and Heaps' Law

These laws explain why sparse lexical data is normal and why smoothing matters.

Zipf's Law

A word's frequency is approximately inversely proportional to its rank. A few words are extremely common; very many words are rare.

fw ≈ k / rw^α
  • Produces a long-tail distribution.
  • Means unseen words are expected, not exceptional.
  • Explains why unsmoothed models fail badly on new text.

Heaps' Law

As corpus size grows, vocabulary size also grows, though with diminishing returns.

V(n) ≈ K n^β
  • New text keeps introducing new types.
  • You do not “finish” observing the vocabulary.
  • So train/test mismatch in word types is structurally unavoidable.
Why This Matters For NB

Zipf says probability mass is heavily concentrated but the tail is huge. Heaps says the tail keeps growing as more text arrives. Together they explain why a classifier must reserve some probability for unseen words.

Significance Testing

Higher accuracy on one test set does not by itself prove that one system is truly better. Variation in the sample may create apparent improvements by chance. MLRD uses a paired, non-parametric sign test.

Null Hypothesis

The two systems are equally good; observed differences come from the same underlying distribution.

Sign Test Setup

  • Positive event: System 1 beats System 2 on an item.
  • Negative event: System 2 beats System 1 on an item.
  • Tie: split as 0.5 positive and 0.5 negative, then round up conservatively.
X ~ Binomial(N, 0.5)
P(X = k) = (N choose k) 0.5^N
P(X ≤ k) = Σi=0..k (N choose i) 0.5^N

Interpretation

  • If p ≤ α, reject the null hypothesis.
  • A two-tailed test is more conservative than a one-tailed test.
  • Passing the test supports a claim of better.
  • Failing the test is inconclusive, not proof of equality.

Errors

Type I error
Claiming a difference that is not real. Probability α. Specificity is 1 - α.
Type II error
Missing a real difference. Probability β. Power is 1 - β.
Never say one system is “better” on the basis of raw numbers alone. Report effect size and significance separately.

Overtraining and Cross-validation

Overtraining or overfitting means improving performance on familiar data while making the model worse at generalising to truly unseen data.

How Overtraining Happens

  • Repeatedly tweak the model.
  • Measure on the same small test set each time.
  • Keep the variant that happens to score best there.

This indirectly trains on the test data because the system design becomes adapted to accidental quirks of that set.

N-Fold Cross-validation

  1. Split data into N folds.
  2. For each fold: train on N-1 folds, test on the held-out fold.
  3. Average the fold scores.
var = (1 / n) Σ (xi - μ)^2

Low variance across folds is reassuring; high variance means your result depends strongly on the split.

Important Variants

  • Stratified cross-validation: preserve class proportions in each fold.
  • Leave-one-out / jack-knifing: extreme case with one item per fold.
  • Dependency-sensitive splitting: isolate known dependencies, e.g. genre or source.

Validation vs Test

The clean workflow is:

  1. Use training data to fit parameters.
  2. Use a validation/development set to choose hyperparameters and stop feature fiddling.
  3. Use the test set once, at the end, for the final claim.

Confusion Matrix

Useful for debugging because aggregate accuracy hides where the errors come from.

Pred POSPred NEG
True POSCorrectFalse negative
True NEGFalse positiveCorrect

Uncertainty, Neutrality, and Human Agreement

Binary sentiment classification is artificially clean because the practical initially uses only extreme reviews. Real data includes mid-range, ambiguous, or mixed reviews, so a NEUTRAL class becomes necessary.

Why Ground Truth Gets Hard

  • Writers may express mixed opinions.
  • Readers interpret tone differently.
  • Star ratings are only a proxy for sentiment.

The lecture’s key epistemic claim is that for subjective labels, human agreement is the only empirically available source of truth.

Observed Agreement

With many raters, plain accuracy is inappropriate. Instead compute average pairwise agreement P̄a.

P̄a = (1 / N) Σi (observed agreeing rater pairs on item i / possible rater pairs on item i)

For k raters, the number of possible rater pairs is k(k - 1) / 2.

Chance Agreement

Some agreement would occur even if raters guessed according to class frequencies.

P̄e = Σc p(c)^2

Fleiss' Kappa

κ = (P̄a - P̄e) / (1 - P̄e)
  • Measures agreement above chance.
  • Higher is better.
  • Useful when many raters assign categorical labels.

Network Basics

Social networks are modelled here as unweighted, undirected graphs. Nodes are entities; edges are social or structural links.

Distance
Length of the shortest path between two nodes.
Diameter
Maximum shortest-path distance over all node pairs.
Degree
Number of neighbours of a node.
Connected component
Maximal set of nodes connected by paths.
Giant component
A component containing most of the graph’s nodes.

Small-world Property

  • Low average path length.
  • High clustering.
  • Sparse overall connectivity.

Weak Ties and Bridges

Weak ties often hold the giant component together. A bridge disconnects components if removed. A local bridge is an edge whose endpoints have no common neighbour; removing it increases shortest-path distance between them.

Triadic Closure and Clustering

If A knows B and A knows C, then B and C are more likely to know each other. This is triadic closure.

  • Global clustering coefficient: closed triads / all triads.
  • Node clustering coefficient: probability that two neighbours of a node are themselves neighbours.

Basic Algorithms

For this course, degree is direct to compute, and diameter can be found by running BFS from every node because the graphs are unweighted.

Betweenness Centrality and Clique Intuition

Betweenness centrality identifies gatekeepers: nodes or edges that lie on many shortest paths and therefore connect otherwise denser regions.

Node Betweenness

CB(v) = Σs,t σ(s, t | v) / σ(s, t)
σ(s, t)
Number of shortest paths from s to t.
σ(s, t | v)
How many of those shortest paths pass through v.

A node with high betweenness may not have high degree: it matters where the node sits, not just how many edges it has.

Clique / Community Intuition

If an edge has high edge betweenness, it is plausibly acting as a connector between dense subgraphs. Repeatedly removing such edges tends to isolate communities. This is the idea behind clique/community splitting in the social-network part of the course.

Brandes Algorithm

Naively computing betweenness by enumerating all shortest paths is too slow. Brandes exploits dynamic programming structure.

Core Recurrences

σ(s, t) = Σu∈Pred(t) σ(s, u)
δ(s | v) = Σ(v,w)∈E, d(s,w)=d(s,v)+1 [σ(s,v) / σ(s,w)] (1 + δ(s | w))
CB(v) = Σs δ(s | v)

Algorithm Structure

  1. For each source node s, run BFS.
  2. During BFS, compute distances, predecessor lists, and shortest-path counts σ.
  3. Push visited nodes onto a stack in visitation order.
  4. Pop nodes in reverse order and back-propagate dependencies δ.
  5. Add dependency contributions into CB(v).
In undirected graphs, each pair is effectively counted twice, so Brandes' final scores are halved.

Markov Models

A first-order Markov model assumes the next state depends only on the current state, not the full history.

P(Xt | X1, ..., Xt-1) ≈ P(Xt | Xt-1)

This yields the standard sequence factorisation:

P(X1, ..., XT) ≈ ∏t P(Xt | Xt-1)

Markov Chain

  • Finite discrete states.
  • States are fully observable.
  • Transitions have probabilities.

Markov chains are good when the states themselves are directly visible. They are not enough when the real states are hidden and we only observe side effects.

Hidden Markov Models

An HMM adds an observation layer to a Markov chain. Hidden states evolve according to transition probabilities, and each hidden state emits observable symbols according to an emission distribution.

Assumptions

  1. Markov / limited-horizon assumption: transitions depend only on the previous state.
  2. Output independence assumption: the observation at time t depends only on the current hidden state Xt.
P(Ot | X1...XT, O1...OT) ≈ P(Ot | Xt)

Formal Components

Se
Set of emitting hidden states {s1, ..., sN}.
s0, sf
Special start and end states.
K
Observation alphabet {k1, ..., kM}.
A
Transition matrix, where aij = P(Xt = sj | Xt-1 = si).
B
Emission matrix, where bi(kj) = P(Ot = kj | Xt = si).

Labelled Learning

Given paired state and observation sequences, estimate the parameters by relative frequency counts.

aij ≈ counttrans(si → sj) / counttrans(si)
bi(kj) ≈ countemission(kj from si) / countemission(si)

Add-one smoothing can also be applied here.

Core HMM Problems

ProblemMeaning
Labelled learningLearn A and B from paired hidden-state and observation sequences.
Unlabelled learningLearn parameters from observations only.
LikelihoodCompute P(O | μ) for a model μ = (A, B).
DecodingFind the most likely hidden state sequence for the observations.
Course Application

The toy example is the dishonest casino with fair and loaded dice. The broader course application is biological sequence analysis, where the hidden states correspond to structural categories and the observations are amino-acid symbols.

Viterbi Decoding

Decoding asks for the single most likely hidden state sequence given the observation sequence and HMM parameters.

X̂ = argmaxX P(X, O | μ)
   = argmaxX ∏t P(Ot | Xt, B) P(Xt | Xt-1, A)

Brute force tries all O(N^T) state sequences, which is infeasible.

Dynamic Programming Idea

For each time and state, keep only the probability of the best path ending there. This works because of the first-order Markov property.

Trellis Variables

δj(t) = max1≤i≤N [δi(t-1) aij bj(Ot)]

δj(t) is the probability of the best path that ends in state sj at time t.

ψj(t) = argmax1≤i≤N [δi(t-1) aij bj(Ot)]

ψj(t) stores the backpointer: which previous state gave that optimum.

Algorithm

  1. Initialise first column from start probabilities and first emissions.
  2. Recurrence fill the trellis left to right using δ and ψ.
  3. Terminate by choosing the best final state.
  4. Backtrace through ψ to recover the best hidden path.

Complexity

For a first-order HMM, Viterbi runs in O(N^2 T), a huge improvement over brute force.

You must keep the best path for every state at every time, not just the globally best state at each time. A locally second-best state can later become part of the globally best sequence.

Metrics Cheat Sheet

Accuracy

Accuracy = correct / total

Good when classes are balanced and all errors matter roughly equally.

Precision and Recall

Useful when one class is the real focus, or class balance is skewed.

Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1        = 2PR / (P + R)

In the HMM dice task, the lectures emphasise measuring these for the interesting class L rather than relying only on accuracy.

When To Prefer What

MetricBest Use
AccuracyBalanced tasks with comparable error costs.
PrecisionFalse positives are costly.
RecallFalse negatives are costly.
F1Need a single score balancing precision and recall.
KappaAgreement among multiple human annotators.

What To Say In Answers

For classification questions

  • Define features, classes, and the independence assumption explicitly.
  • State why smoothing is needed using Zipf and Heaps, not just “because unseen words exist”.
  • Separate training, validation, and test roles clearly.
  • Never equate higher score with proven superiority unless significance is addressed.

For network questions

  • Distinguish degree from centrality.
  • Explain that betweenness counts shortest-path dependence, not generic “importance”.
  • For diameter on unweighted graphs, mention BFS-based shortest paths.
  • For community splitting, connect edge betweenness to sparse inter-cluster links.

For HMM questions

  • State the two assumptions: first-order transitions and output independence.
  • Distinguish hidden states from observations.
  • Say exactly what A and B represent.
  • For Viterbi, mention trellis, recurrence, backpointers, and O(N^2 T) complexity.
The course rewards methodological precision. A short answer that names the right assumptions, data split discipline, and evaluation logic is usually stronger than a longer vague one.