MLRD — Revision Notes

Course Map

MLRD is less about fancy models than about simple models used correctly. The course teaches three machine learning settings and repeatedly stresses that evaluation methodology matters as much as the algorithm.

Main Topics

Statistical classification: movie-review sentiment with Naive Bayes.
Sequence analysis: Hidden Markov Models and decoding with Viterbi.
Social networks: degree, diameter, centrality, clique/community structure.

Recurring Methodology

Separate training, validation, and test data.
Use significance tests, not raw score differences alone.
Guard against overtraining/overfitting.
Choose metrics that match the task and class balance.

Standard ML Workflow

Define features and classes.
Estimate parameters from labelled training data.
Apply the model to unseen test items.
Measure performance with appropriate metrics.
Check whether observed improvements are statistically significant.
Prevent overtraining by disciplined use of development and test data.

The course objective is not just to implement the model, but to make defensible claims about whether it works.

Classification Basics

In sentiment classification, the observed data are review words and the hidden label is sentiment. Features are automatically observable properties of the input; in this practical, the features are words. Classes are the meaningful labels, typically POS and NEG, later extended to include NEUTRAL.

Feature: Observable attribute of an item, e.g. word occurrence.
Class: Target label to predict, e.g. positive or negative sentiment.
Classifier: Function mapping feature observations to class decisions.
Probabilistic classifier: Returns a distribution over classes, then usually predicts the argmax.

P(c | O) for each class c
ĉ = argmax P(c | O)

Why learn from data rather than hand-written lexicons? Because lexicons depend on human intuition, are static, and require manual labour. A trained classifier can adapt to the actual data distribution.

Naive Bayes

Naive Bayes is a simple probabilistic classifier using Bayes' theorem plus a strong conditional independence assumption between features given the class.

Bayes Decision Rule

P(c | O) = P(c) P(O | c) / P(O)
ĉNB = argmax P(c) P(O | c)

P(O) is constant for a fixed document, so it can be dropped under argmax.

Naive Independence Assumption

O = {w1, ..., wn}
P(O | c) = P(w1, ..., wn | c) ≈ ∏ P(wi | c)
ĉNB = argmax P(c) ∏ P(wi | c)

Parameter Estimation

Estimate probabilities by maximum likelihood from labelled training data.

P̂(wi | c) = count(wi, c) / Σw∈V count(w, c)
P̂(c) = Nc / Nrev

Log-space Computation

Products of many small probabilities underflow numerically, so classify in log-space.

ĉNB = argmax [log P(c) + Σ log P(wi | c)]

Smoothing

Without smoothing, an unseen word in class c gives probability 0 and can wipe out the entire product. Add-one/Laplace smoothing redistributes some probability mass to unseen events.

P̂(wi | c) = (count(wi, c) + 1) / (Σw∈V count(w, c) + |V|)

Smoothing helps because natural language has a long tail: many low-frequency words will not appear in training even though they are real and likely to appear later.

Zipf's Law and Heaps' Law

These laws explain why sparse lexical data is normal and why smoothing matters.

Zipf's Law

A word's frequency is approximately inversely proportional to its rank. A few words are extremely common; very many words are rare.

fw ≈ k / rw^α

Produces a long-tail distribution.
Means unseen words are expected, not exceptional.
Explains why unsmoothed models fail badly on new text.

Heaps' Law

As corpus size grows, vocabulary size also grows, though with diminishing returns.

V(n) ≈ K n^β

New text keeps introducing new types.
You do not “finish” observing the vocabulary.
So train/test mismatch in word types is structurally unavoidable.

Why This Matters For NB

Zipf says probability mass is heavily concentrated but the tail is huge. Heaps says the tail keeps growing as more text arrives. Together they explain why a classifier must reserve some probability for unseen words.

Significance Testing

Higher accuracy on one test set does not by itself prove that one system is truly better. Variation in the sample may create apparent improvements by chance. MLRD uses a paired, non-parametric sign test.

Null Hypothesis

The two systems are equally good; observed differences come from the same underlying distribution.

Sign Test Setup

Positive event: System 1 beats System 2 on an item.
Negative event: System 2 beats System 1 on an item.
Tie: split as 0.5 positive and 0.5 negative, then round up conservatively.

X ~ Binomial(N, 0.5)
P(X = k) = (N choose k) 0.5^N
P(X ≤ k) = Σi=0..k (N choose i) 0.5^N

Interpretation

If p ≤ α, reject the null hypothesis.
A two-tailed test is more conservative than a one-tailed test.
Passing the test supports a claim of better.
Failing the test is inconclusive, not proof of equality.

Errors

Type I error: Claiming a difference that is not real. Probability α. Specificity is 1 - α.
Type II error: Missing a real difference. Probability β. Power is 1 - β.

Never say one system is “better” on the basis of raw numbers alone. Report effect size and significance separately.

Overtraining and Cross-validation

Overtraining or overfitting means improving performance on familiar data while making the model worse at generalising to truly unseen data.

How Overtraining Happens

Repeatedly tweak the model.
Measure on the same small test set each time.
Keep the variant that happens to score best there.

This indirectly trains on the test data because the system design becomes adapted to accidental quirks of that set.

N-Fold Cross-validation

Split data into N folds.
For each fold: train on N-1 folds, test on the held-out fold.
Average the fold scores.

var = (1 / n) Σ (xi - μ)^2

Low variance across folds is reassuring; high variance means your result depends strongly on the split.

Important Variants

Stratified cross-validation: preserve class proportions in each fold.
Leave-one-out / jack-knifing: extreme case with one item per fold.
Dependency-sensitive splitting: isolate known dependencies, e.g. genre or source.

Validation vs Test

The clean workflow is:

Use training data to fit parameters.
Use a validation/development set to choose hyperparameters and stop feature fiddling.
Use the test set once, at the end, for the final claim.

Confusion Matrix

Useful for debugging because aggregate accuracy hides where the errors come from.

	Pred POS	Pred NEG
True POS	Correct	False negative
True NEG	False positive	Correct

Uncertainty, Neutrality, and Human Agreement

Binary sentiment classification is artificially clean because the practical initially uses only extreme reviews. Real data includes mid-range, ambiguous, or mixed reviews, so a NEUTRAL class becomes necessary.

Why Ground Truth Gets Hard

Writers may express mixed opinions.
Readers interpret tone differently.
Star ratings are only a proxy for sentiment.

The lecture’s key epistemic claim is that for subjective labels, human agreement is the only empirically available source of truth.

Observed Agreement

With many raters, plain accuracy is inappropriate. Instead compute average pairwise agreement P̄a.

P̄a = (1 / N) Σi (observed agreeing rater pairs on item i / possible rater pairs on item i)

For k raters, the number of possible rater pairs is k(k - 1) / 2.

Chance Agreement

Some agreement would occur even if raters guessed according to class frequencies.

P̄e = Σc p(c)^2

Fleiss' Kappa

κ = (P̄a - P̄e) / (1 - P̄e)

Measures agreement above chance.
Higher is better.
Useful when many raters assign categorical labels.

Network Basics

Social networks are modelled here as unweighted, undirected graphs. Nodes are entities; edges are social or structural links.

Distance: Length of the shortest path between two nodes.
Diameter: Maximum shortest-path distance over all node pairs.
Degree: Number of neighbours of a node.
Connected component: Maximal set of nodes connected by paths.
Giant component: A component containing most of the graph’s nodes.

Small-world Property

Low average path length.
High clustering.
Sparse overall connectivity.

Weak Ties and Bridges

Weak ties often hold the giant component together. A bridge disconnects components if removed. A local bridge is an edge whose endpoints have no common neighbour; removing it increases shortest-path distance between them.

Triadic Closure and Clustering

If A knows B and A knows C, then B and C are more likely to know each other. This is triadic closure.

Global clustering coefficient: closed triads / all triads.
Node clustering coefficient: probability that two neighbours of a node are themselves neighbours.

Basic Algorithms

For this course, degree is direct to compute, and diameter can be found by running BFS from every node because the graphs are unweighted.

Betweenness Centrality and Clique Intuition

Betweenness centrality identifies gatekeepers: nodes or edges that lie on many shortest paths and therefore connect otherwise denser regions.

Node Betweenness

CB(v) = Σs,t σ(s, t | v) / σ(s, t)

σ(s, t): Number of shortest paths from s to t.
σ(s, t | v): How many of those shortest paths pass through v.

A node with high betweenness may not have high degree: it matters where the node sits, not just how many edges it has.

Clique / Community Intuition

If an edge has high edge betweenness, it is plausibly acting as a connector between dense subgraphs. Repeatedly removing such edges tends to isolate communities. This is the idea behind clique/community splitting in the social-network part of the course.

Brandes Algorithm

Naively computing betweenness by enumerating all shortest paths is too slow. Brandes exploits dynamic programming structure.

Core Recurrences

σ(s, t) = Σu∈Pred(t) σ(s, u)

δ(s | v) = Σ(v,w)∈E, d(s,w)=d(s,v)+1 [σ(s,v) / σ(s,w)] (1 + δ(s | w))

CB(v) = Σs δ(s | v)

Algorithm Structure

For each source node s, run BFS.
During BFS, compute distances, predecessor lists, and shortest-path counts σ.
Push visited nodes onto a stack in visitation order.
Pop nodes in reverse order and back-propagate dependencies δ.
Add dependency contributions into CB(v).

In undirected graphs, each pair is effectively counted twice, so Brandes' final scores are halved.

Markov Models

A first-order Markov model assumes the next state depends only on the current state, not the full history.

P(Xt | X1, ..., Xt-1) ≈ P(Xt | Xt-1)

This yields the standard sequence factorisation:

P(X1, ..., XT) ≈ ∏t P(Xt | Xt-1)

Markov Chain

Finite discrete states.
States are fully observable.
Transitions have probabilities.

Markov chains are good when the states themselves are directly visible. They are not enough when the real states are hidden and we only observe side effects.

Hidden Markov Models

An HMM adds an observation layer to a Markov chain. Hidden states evolve according to transition probabilities, and each hidden state emits observable symbols according to an emission distribution.

Assumptions

Markov / limited-horizon assumption: transitions depend only on the previous state.
Output independence assumption: the observation at time t depends only on the current hidden state Xt.

P(Ot | X1...XT, O1...OT) ≈ P(Ot | Xt)

Formal Components

Se: Set of emitting hidden states {s1, ..., sN}.
s0, sf: Special start and end states.
K: Observation alphabet {k1, ..., kM}.
A: Transition matrix, where aij = P(Xt = sj | Xt-1 = si).
B: Emission matrix, where bi(kj) = P(Ot = kj | Xt = si).

Labelled Learning

Given paired state and observation sequences, estimate the parameters by relative frequency counts.

aij ≈ counttrans(si → sj) / counttrans(si)
bi(kj) ≈ countemission(kj from si) / countemission(si)

Add-one smoothing can also be applied here.

Core HMM Problems

Problem	Meaning
Labelled learning	Learn `A` and `B` from paired hidden-state and observation sequences.
Unlabelled learning	Learn parameters from observations only.
Likelihood	Compute `P(O \| μ)` for a model `μ = (A, B)`.
Decoding	Find the most likely hidden state sequence for the observations.

Course Application

The toy example is the dishonest casino with fair and loaded dice. The broader course application is biological sequence analysis, where the hidden states correspond to structural categories and the observations are amino-acid symbols.

Viterbi Decoding

Decoding asks for the single most likely hidden state sequence given the observation sequence and HMM parameters.

X̂ = argmaxX P(X, O | μ)
   = argmaxX ∏t P(Ot | Xt, B) P(Xt | Xt-1, A)

Brute force tries all O(N^T) state sequences, which is infeasible.

Dynamic Programming Idea

For each time and state, keep only the probability of the best path ending there. This works because of the first-order Markov property.

Trellis Variables

δj(t) = max1≤i≤N [δi(t-1) aij bj(Ot)]

δj(t) is the probability of the best path that ends in state sj at time t.

ψj(t) = argmax1≤i≤N [δi(t-1) aij bj(Ot)]

ψj(t) stores the backpointer: which previous state gave that optimum.

Algorithm

Initialise first column from start probabilities and first emissions.
Recurrence fill the trellis left to right using δ and ψ.
Terminate by choosing the best final state.
Backtrace through ψ to recover the best hidden path.

Complexity

For a first-order HMM, Viterbi runs in O(N^2 T), a huge improvement over brute force.

You must keep the best path for every state at every time, not just the globally best state at each time. A locally second-best state can later become part of the globally best sequence.

Metrics Cheat Sheet

Accuracy

Accuracy = correct / total

Good when classes are balanced and all errors matter roughly equally.

Precision and Recall

Useful when one class is the real focus, or class balance is skewed.

Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1        = 2PR / (P + R)

In the HMM dice task, the lectures emphasise measuring these for the interesting class L rather than relying only on accuracy.

When To Prefer What

Metric	Best Use
Accuracy	Balanced tasks with comparable error costs.
Precision	False positives are costly.
Recall	False negatives are costly.
F1	Need a single score balancing precision and recall.
Kappa	Agreement among multiple human annotators.

What To Say In Answers

For classification questions

Define features, classes, and the independence assumption explicitly.
State why smoothing is needed using Zipf and Heaps, not just “because unseen words exist”.
Separate training, validation, and test roles clearly.
Never equate higher score with proven superiority unless significance is addressed.

For network questions

Distinguish degree from centrality.
Explain that betweenness counts shortest-path dependence, not generic “importance”.
For diameter on unweighted graphs, mention BFS-based shortest paths.
For community splitting, connect edge betweenness to sparse inter-cluster links.

For HMM questions

State the two assumptions: first-order transitions and output independence.
Distinguish hidden states from observations.
Say exactly what A and B represent.
For Viterbi, mention trellis, recurrence, backpointers, and O(N^2 T) complexity.

The course rewards methodological precision. A short answer that names the right assumptions, data split discipline, and evaluation logic is usually stronger than a longer vague one.

Machine Learning and Real-world Data

Course Map

Standard ML Workflow

Classification Basics

Naive Bayes

Bayes Decision Rule

Naive Independence Assumption

Parameter Estimation

Log-space Computation

Smoothing

Zipf's Law and Heaps' Law

Zipf's Law

Heaps' Law

Significance Testing

Null Hypothesis

Sign Test Setup

Interpretation

Errors

Overtraining and Cross-validation

How Overtraining Happens

N-Fold Cross-validation

Important Variants

Validation vs Test

Confusion Matrix

Uncertainty, Neutrality, and Human Agreement

Why Ground Truth Gets Hard

Observed Agreement

Chance Agreement

Fleiss' Kappa

Network Basics

Small-world Property

Weak Ties and Bridges

Triadic Closure and Clustering

Basic Algorithms

Betweenness Centrality and Clique Intuition

Node Betweenness

Clique / Community Intuition

Brandes Algorithm

Core Recurrences

Algorithm Structure

Markov Models

Markov Chain

Hidden Markov Models

Assumptions

Formal Components

Labelled Learning

Core HMM Problems

Viterbi Decoding

Dynamic Programming Idea

Trellis Variables

Algorithm

Complexity

Metrics Cheat Sheet

Accuracy

Precision and Recall

When To Prefer What

What To Say In Answers

For classification questions

For network questions

For HMM questions