Machine Learning and Real-world Data
Course Map
MLRD is less about fancy models than about simple models used correctly. The course teaches three machine learning settings and repeatedly stresses that evaluation methodology matters as much as the algorithm.
- Statistical classification: movie-review sentiment with Naive Bayes.
- Sequence analysis: Hidden Markov Models and decoding with Viterbi.
- Social networks: degree, diameter, centrality, clique/community structure.
- Separate training, validation, and test data.
- Use significance tests, not raw score differences alone.
- Guard against overtraining/overfitting.
- Choose metrics that match the task and class balance.
Standard ML Workflow
- Define features and classes.
- Estimate parameters from labelled training data.
- Apply the model to unseen test items.
- Measure performance with appropriate metrics.
- Check whether observed improvements are statistically significant.
- Prevent overtraining by disciplined use of development and test data.
Classification Basics
In sentiment classification, the observed data are review words and the hidden label is sentiment. Features are automatically observable properties of the input; in this practical, the features are words. Classes are the meaningful labels, typically POS and NEG, later extended to include NEUTRAL.
- Feature
- Observable attribute of an item, e.g. word occurrence.
- Class
- Target label to predict, e.g. positive or negative sentiment.
- Classifier
- Function mapping feature observations to class decisions.
- Probabilistic classifier
- Returns a distribution over classes, then usually predicts the
argmax.
P(c | O) for each class c
ĉ = argmax P(c | O)
Why learn from data rather than hand-written lexicons? Because lexicons depend on human intuition, are static, and require manual labour. A trained classifier can adapt to the actual data distribution.
Naive Bayes
Naive Bayes is a simple probabilistic classifier using Bayes' theorem plus a strong conditional independence assumption between features given the class.
Bayes Decision Rule
P(c | O) = P(c) P(O | c) / P(O)
ĉNB = argmax P(c) P(O | c)
P(O) is constant for a fixed document, so it can be dropped under argmax.
Naive Independence Assumption
O = {w1, ..., wn}
P(O | c) = P(w1, ..., wn | c) ≈ ∏ P(wi | c)
ĉNB = argmax P(c) ∏ P(wi | c)
Parameter Estimation
Estimate probabilities by maximum likelihood from labelled training data.
P̂(wi | c) = count(wi, c) / Σw∈V count(w, c)
P̂(c) = Nc / Nrev
Log-space Computation
Products of many small probabilities underflow numerically, so classify in log-space.
ĉNB = argmax [log P(c) + Σ log P(wi | c)]
Smoothing
Without smoothing, an unseen word in class c gives probability 0 and can wipe out the entire product. Add-one/Laplace smoothing redistributes some probability mass to unseen events.
P̂(wi | c) = (count(wi, c) + 1) / (Σw∈V count(w, c) + |V|)
Zipf's Law and Heaps' Law
These laws explain why sparse lexical data is normal and why smoothing matters.
Zipf's Law
A word's frequency is approximately inversely proportional to its rank. A few words are extremely common; very many words are rare.
fw ≈ k / rw^α
- Produces a long-tail distribution.
- Means unseen words are expected, not exceptional.
- Explains why unsmoothed models fail badly on new text.
Heaps' Law
As corpus size grows, vocabulary size also grows, though with diminishing returns.
V(n) ≈ K n^β
- New text keeps introducing new types.
- You do not “finish” observing the vocabulary.
- So train/test mismatch in word types is structurally unavoidable.
Zipf says probability mass is heavily concentrated but the tail is huge. Heaps says the tail keeps growing as more text arrives. Together they explain why a classifier must reserve some probability for unseen words.
Significance Testing
Higher accuracy on one test set does not by itself prove that one system is truly better. Variation in the sample may create apparent improvements by chance. MLRD uses a paired, non-parametric sign test.
Null Hypothesis
The two systems are equally good; observed differences come from the same underlying distribution.
Sign Test Setup
- Positive event: System 1 beats System 2 on an item.
- Negative event: System 2 beats System 1 on an item.
- Tie: split as 0.5 positive and 0.5 negative, then round up conservatively.
X ~ Binomial(N, 0.5)
P(X = k) = (N choose k) 0.5^N
P(X ≤ k) = Σi=0..k (N choose i) 0.5^N
Interpretation
- If
p ≤ α, reject the null hypothesis. - A two-tailed test is more conservative than a one-tailed test.
- Passing the test supports a claim of better.
- Failing the test is inconclusive, not proof of equality.
Errors
- Type I error
- Claiming a difference that is not real. Probability
α. Specificity is1 - α. - Type II error
- Missing a real difference. Probability
β. Power is1 - β.
Overtraining and Cross-validation
Overtraining or overfitting means improving performance on familiar data while making the model worse at generalising to truly unseen data.
How Overtraining Happens
- Repeatedly tweak the model.
- Measure on the same small test set each time.
- Keep the variant that happens to score best there.
This indirectly trains on the test data because the system design becomes adapted to accidental quirks of that set.
N-Fold Cross-validation
- Split data into
Nfolds. - For each fold: train on
N-1folds, test on the held-out fold. - Average the fold scores.
var = (1 / n) Σ (xi - μ)^2
Low variance across folds is reassuring; high variance means your result depends strongly on the split.
Important Variants
- Stratified cross-validation: preserve class proportions in each fold.
- Leave-one-out / jack-knifing: extreme case with one item per fold.
- Dependency-sensitive splitting: isolate known dependencies, e.g. genre or source.
Validation vs Test
The clean workflow is:
- Use training data to fit parameters.
- Use a validation/development set to choose hyperparameters and stop feature fiddling.
- Use the test set once, at the end, for the final claim.
Confusion Matrix
Useful for debugging because aggregate accuracy hides where the errors come from.
| Pred POS | Pred NEG | |
|---|---|---|
| True POS | Correct | False negative |
| True NEG | False positive | Correct |
Uncertainty, Neutrality, and Human Agreement
Binary sentiment classification is artificially clean because the practical initially uses only extreme reviews. Real data includes mid-range, ambiguous, or mixed reviews, so a NEUTRAL class becomes necessary.
Why Ground Truth Gets Hard
- Writers may express mixed opinions.
- Readers interpret tone differently.
- Star ratings are only a proxy for sentiment.
The lecture’s key epistemic claim is that for subjective labels, human agreement is the only empirically available source of truth.
Observed Agreement
With many raters, plain accuracy is inappropriate. Instead compute average pairwise agreement P̄a.
P̄a = (1 / N) Σi (observed agreeing rater pairs on item i / possible rater pairs on item i)
For k raters, the number of possible rater pairs is k(k - 1) / 2.
Chance Agreement
Some agreement would occur even if raters guessed according to class frequencies.
P̄e = Σc p(c)^2
Fleiss' Kappa
κ = (P̄a - P̄e) / (1 - P̄e)
- Measures agreement above chance.
- Higher is better.
- Useful when many raters assign categorical labels.
Network Basics
Social networks are modelled here as unweighted, undirected graphs. Nodes are entities; edges are social or structural links.
- Distance
- Length of the shortest path between two nodes.
- Diameter
- Maximum shortest-path distance over all node pairs.
- Degree
- Number of neighbours of a node.
- Connected component
- Maximal set of nodes connected by paths.
- Giant component
- A component containing most of the graph’s nodes.
Small-world Property
- Low average path length.
- High clustering.
- Sparse overall connectivity.
Weak Ties and Bridges
Weak ties often hold the giant component together. A bridge disconnects components if removed. A local bridge is an edge whose endpoints have no common neighbour; removing it increases shortest-path distance between them.
Triadic Closure and Clustering
If A knows B and A knows C, then B and C are more likely to know each other. This is triadic closure.
- Global clustering coefficient: closed triads / all triads.
- Node clustering coefficient: probability that two neighbours of a node are themselves neighbours.
Basic Algorithms
For this course, degree is direct to compute, and diameter can be found by running BFS from every node because the graphs are unweighted.
Betweenness Centrality and Clique Intuition
Betweenness centrality identifies gatekeepers: nodes or edges that lie on many shortest paths and therefore connect otherwise denser regions.
Node Betweenness
CB(v) = Σs,t σ(s, t | v) / σ(s, t)
σ(s, t)- Number of shortest paths from
stot. σ(s, t | v)- How many of those shortest paths pass through
v.
A node with high betweenness may not have high degree: it matters where the node sits, not just how many edges it has.
Clique / Community Intuition
If an edge has high edge betweenness, it is plausibly acting as a connector between dense subgraphs. Repeatedly removing such edges tends to isolate communities. This is the idea behind clique/community splitting in the social-network part of the course.
Brandes Algorithm
Naively computing betweenness by enumerating all shortest paths is too slow. Brandes exploits dynamic programming structure.
Core Recurrences
σ(s, t) = Σu∈Pred(t) σ(s, u)
δ(s | v) = Σ(v,w)∈E, d(s,w)=d(s,v)+1 [σ(s,v) / σ(s,w)] (1 + δ(s | w))
CB(v) = Σs δ(s | v)
Algorithm Structure
- For each source node
s, run BFS. - During BFS, compute distances, predecessor lists, and shortest-path counts
σ. - Push visited nodes onto a stack in visitation order.
- Pop nodes in reverse order and back-propagate dependencies
δ. - Add dependency contributions into
CB(v).
Markov Models
A first-order Markov model assumes the next state depends only on the current state, not the full history.
P(Xt | X1, ..., Xt-1) ≈ P(Xt | Xt-1)
This yields the standard sequence factorisation:
P(X1, ..., XT) ≈ ∏t P(Xt | Xt-1)
Markov Chain
- Finite discrete states.
- States are fully observable.
- Transitions have probabilities.
Markov chains are good when the states themselves are directly visible. They are not enough when the real states are hidden and we only observe side effects.
Hidden Markov Models
An HMM adds an observation layer to a Markov chain. Hidden states evolve according to transition probabilities, and each hidden state emits observable symbols according to an emission distribution.
Assumptions
- Markov / limited-horizon assumption: transitions depend only on the previous state.
- Output independence assumption: the observation at time
tdepends only on the current hidden stateXt.
P(Ot | X1...XT, O1...OT) ≈ P(Ot | Xt)
Formal Components
Se- Set of emitting hidden states
{s1, ..., sN}. s0,sf- Special start and end states.
K- Observation alphabet
{k1, ..., kM}. A- Transition matrix, where
aij = P(Xt = sj | Xt-1 = si). B- Emission matrix, where
bi(kj) = P(Ot = kj | Xt = si).
Labelled Learning
Given paired state and observation sequences, estimate the parameters by relative frequency counts.
aij ≈ counttrans(si → sj) / counttrans(si)
bi(kj) ≈ countemission(kj from si) / countemission(si)
Add-one smoothing can also be applied here.
Core HMM Problems
| Problem | Meaning |
|---|---|
| Labelled learning | Learn A and B from paired hidden-state and observation sequences. |
| Unlabelled learning | Learn parameters from observations only. |
| Likelihood | Compute P(O | μ) for a model μ = (A, B). |
| Decoding | Find the most likely hidden state sequence for the observations. |
The toy example is the dishonest casino with fair and loaded dice. The broader course application is biological sequence analysis, where the hidden states correspond to structural categories and the observations are amino-acid symbols.
Viterbi Decoding
Decoding asks for the single most likely hidden state sequence given the observation sequence and HMM parameters.
X̂ = argmaxX P(X, O | μ)
= argmaxX ∏t P(Ot | Xt, B) P(Xt | Xt-1, A)
Brute force tries all O(N^T) state sequences, which is infeasible.
Dynamic Programming Idea
For each time and state, keep only the probability of the best path ending there. This works because of the first-order Markov property.
Trellis Variables
δj(t) = max1≤i≤N [δi(t-1) aij bj(Ot)]
δj(t) is the probability of the best path that ends in state sj at time t.
ψj(t) = argmax1≤i≤N [δi(t-1) aij bj(Ot)]
ψj(t) stores the backpointer: which previous state gave that optimum.
Algorithm
- Initialise first column from start probabilities and first emissions.
- Recurrence fill the trellis left to right using
δandψ. - Terminate by choosing the best final state.
- Backtrace through
ψto recover the best hidden path.
Complexity
For a first-order HMM, Viterbi runs in O(N^2 T), a huge improvement over brute force.
Metrics Cheat Sheet
Accuracy
Accuracy = correct / total
Good when classes are balanced and all errors matter roughly equally.
Precision and Recall
Useful when one class is the real focus, or class balance is skewed.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2PR / (P + R)
In the HMM dice task, the lectures emphasise measuring these for the interesting class L rather than relying only on accuracy.
When To Prefer What
| Metric | Best Use |
|---|---|
| Accuracy | Balanced tasks with comparable error costs. |
| Precision | False positives are costly. |
| Recall | False negatives are costly. |
| F1 | Need a single score balancing precision and recall. |
| Kappa | Agreement among multiple human annotators. |
What To Say In Answers
For classification questions
- Define features, classes, and the independence assumption explicitly.
- State why smoothing is needed using Zipf and Heaps, not just “because unseen words exist”.
- Separate training, validation, and test roles clearly.
- Never equate higher score with proven superiority unless significance is addressed.
For network questions
- Distinguish degree from centrality.
- Explain that betweenness counts shortest-path dependence, not generic “importance”.
- For diameter on unweighted graphs, mention BFS-based shortest paths.
- For community splitting, connect edge betweenness to sparse inter-cluster links.
For HMM questions
- State the two assumptions: first-order transitions and output independence.
- Distinguish hidden states from observations.
- Say exactly what
AandBrepresent. - For Viterbi, mention trellis, recurrence, backpointers, and
O(N^2 T)complexity.