Operating Systems
System Organisation
A computer system has four layers: hardware (CPU, memory, I/O devices, buses) → operating system (controls and coordinates use of hardware) → application programs (use resources to solve users' problems) → users. The OS stays out of the way unless it's actively needed.
Hardware Resources
- CPU
- Executes programs using memory, devices, and bus. Operates on data from input devices or memory.
- Memory
- A large byte-addressed array for programs and data.
- Devices
- For input/output.
- Bus
- Transfers information between the above; CPUs and devices contend for memory cycles and bus access.
At the bottom the CPU just reads values from main memory into registers, performs operations, and writes results back.
Fetch-Execute Cycle
The CPU repeatedly fetches and decodes the next instruction, generating control signals and operand information. Inside the Execution Unit (EU), control signals select a Functional Unit and operation:
- ALU (Arithmetic Logic Unit): reads one/two registers, performs op, writes result back.
- BU (Branch Unit): tests condition, maybe adds a value to PC.
- MAU (Memory Access Unit): generates an address via the addressing mode and uses the bus to read/write.
Buses and Bus Hierarchy
Buses are shared communication wires — low cost and versatile, but a potential bottleneck. A bus typically has:
- address lines — how many devices fit on the bus;
- data lines — how many bits transfer at once;
- control lines — target device + selected operation.
Operation is initiator-responder: initiator puts address on bus and asserts read; responder reads address, retrieves data, puts data on bus; initiator reads it.
Different buses have different characteristics (width, length, device limit). Modern systems are a hierarchy: fastest processor bus (CPU↔cache), memory bus to main memory, PCI buses to devices, plus legacy (ISA, EISA). Bridges forward transactions between buses — e.g., to reach an ISA device, the CPU generates a physical address routed through memory bridge → PCI bridge → ISA bridge → device.
Booting & Interrupts
Booting
When power is applied, a bootstrap program (bootloader) runs:
- Traditionally in ROM as BIOS; now the more complex UEFI.
- Initialises all system parts (memory, device controllers).
- Finds, loads, and executes the kernel, possibly in stages.
The kernel then enables processes to be created, devices accessed, file system mounted. Finally system processes start — on Unix, init is the first.
Interrupts
I/O devices and CPU execute concurrently. Each device controller is responsible for one device type and has a local buffer; the CPU moves data between main memory and that buffer. The OS is interrupt-driven: the controller raises an interrupt to tell the CPU an operation is complete.
Interrupts decouple CPU requests from device responses — a disk read might take 2 ms (≈ 5 × 106 clock cycles) which the CPU must not simply waste waiting.
Interrupt Handling
- Hardware transfers control to the Interrupt Service Routine (ISR) via the interrupt vector: a table of ISR addresses indexed by interrupt number.
- Hardware saves the address of the interrupted instruction; resumes with a special instruction like
rti.
Interrupts are deferred to instruction boundaries (cleaner semantics). The ISR must not trash registers and must know where to resume, so the CPU normally saves most/all registers on entry and restores on return.
Storage Hierarchy
Basic units: a bit holds 0 or 1; a byte is 8 bits — the smallest unit most computers can address; a word is the architecture's native data unit (e.g. 64-bit machine → 8-byte word).
Course convention: 1 kB = 1024 bytes, MB = 1024², etc. (Strict IEC would be KiB/MiB/GiB for powers of 1024 and kB/MB for powers of 1000; usage is inconsistent across memory vs disk.)
Storage is organised as a hierarchy trading off speed, cost, and volatility:
- Registers / caches
- Tiny, very fast, volatile.
- Main memory
- Large, random access, volatile — directly accessible by CPU.
- Secondary storage
- Non-volatile. HDs (rigid metal/glass platters, magnetic coating, tracks subdivided into sectors) or SSDs (faster than HDs, no moving parts).
Each device has a device driver providing a uniform interface between its controller and the kernel.
Key Concepts
Layering & Multiplexing
- Layering manages complexity. Arrange components in a stack; a layer X may only depend on layer X−1 and provide service to X+1. Classical example: OSI / Internet stacks.
- Multiplexing: one resource serves multiple consumers simultaneously — combining signals over a shared medium.
Performance Metrics
- Latency
- How long something takes (e.g. read took 3 ms).
- Bandwidth / throughput
- Rate at which something occurs (e.g. 2 Gb/s).
- Jitter
- Statistical variation in latency (e.g. scheduling periodic with 50 μs jitter).
Always ask: is the absolute or relative value what matters? Is the full distribution interesting?
Caching vs Buffering
Used when components operate at different speeds — an impedance mismatch.
Small amount of higher-performance storage masks the impact of slower storage. Relies on temporal locality (finite space) and spatial locality (non-zero access cost). Examples: CPU registers, L1/L2/L3 caches, main memory.
Memory inserted between two components to absorb small, variable bandwidth imbalances. Example: a disk's on-board memory between controller and OS. Useless if long-term bandwidth mismatch exceeds buffer size.
Bottlenecks & the 80/20 Rule
The bottleneck is the most-constrained resource. Tuning focuses on eliminating bottlenecks — often creating new ones. A perfectly balanced system has all resources simultaneously bottlenecked (impossible in practice). Measurement is a prerequisite.
80/20 rule: 80% of time is spent in 20% of the code. Optimising the rare case is pointless.
What is an Operating System?
An OS is just a program that (efficiently) provides:
- Control over execution of all other programs,
- Multiplexing of resources between them,
- Abstraction over low-level complexity,
- Extensibility for evolution.
The word "operating system" is fuzzy — nobody really agrees what's in it beyond the kernel. This course focuses on the kernel.
Resource Management Responsibilities
- CPU
- Multiplexes running threads; manages lifecycle, synchronisation, communication.
- Memory
- Tracks ownership of code/data memory; allocates and frees.
- Storage
- Abstracts different media; manages files, directories, free space.
- I/O Subsystem
- Abstracts device peculiarities; manages drivers, buffering, caching, spooling.
Protecting the CPU
The OS must retain control — no application may "hog" the CPU. Implementation uses a countdown timer: load to initial value (e.g. 0xFFFF); every tick decrements; on zero, raise an interrupt. Guarantees the OS runs periodically provided:
- only the OS can load the timer, and
- the timer interrupt cannot be masked.
Also enables time-sharing.
Protecting Memory
Define a base and limit for each program. Hardware checks every memory reference:
- out-of-range → exception, vectored into the OS;
- only OS can update base/limit;
- memory protection can be disabled in kernel mode (bad idea).
Protecting I/O
Early attempts made I/O instructions privileged — applications couldn't mask interrupts or control devices. But some devices are memory-mapped, so applications could rewrite the interrupt vector. Therefore I/O protection relies on memory protection.
OS Evolution
Understanding how we got here is the best route into why current OSes look the way they do.
- Open shop (1947–1955)
- One machine, one CPU, one user, one program; programmer is operator; machine code only. EDSAC.
- Batch systems
- Tape drives collate and run a batch of programs, increasing efficiency. Spooling (Simultaneous Peripheral Operation On Line) allows I/O to overlap with computation.
- Multiprogramming
- One machine, one CPU, one running program but many loaded programs. Job scheduling picks which jobs to load; another scheduler picks which resident job to run.
- Timesharing
- Switching jobs so frequently that users have the illusion of simultaneous execution. Requires CPU scheduling — selecting which ready job to run. Enables interactive computing.
Example of Single-Tasking: MS-DOS
Command interpreter reads input; program is loaded over much of the command interpreter; IP jumps to program start. When program terminates, a stub reloads the command interpreter. An exit error code is available.
Dual-Mode Operation
To prevent malicious or buggy code from doing bad things, hardware provides at least two modes of operation via a mode bit:
- User mode — executing on behalf of the user (applications).
- Kernel mode — executing on behalf of the OS.
Some instructions are designated privileged and may only be executed in kernel mode. Modern CPUs support more than two modes — e.g. an additional VMM (Virtual Machine Manager) mode for guest VMs, or x86 rings 0–3 where inner rings have strictly more privilege. Disjoint/overlapping permissions are complex, so most CPUs use nested (strictly ordered) rings.
Kernels & System Calls
Protection prevents applications from doing I/O directly — the kernel must do it for them. We therefore need an unprivileged instruction to transition from user to kernel mode: a trap (software interrupt), which behaves similarly to a hardware interrupt.
OS services are exposed to user code as system calls.
System Call Parameters
Three main ways to pass parameters:
- Load into registers.
- Place on the stack; kernel pops them off.
- Place in a block of memory; put the block's address in a register.
Methods 2 and 3 are usually preferred (registers are limited in number and size).
Microkernels
OS interfaces must be extremely stable — making them hard to extend, and even harder to remove from. The microkernel alternative moves OS services into (possibly privileged) servers, improving modularity and extensibility. Services are invoked via message passing, which replaces trapping — so must be extremely efficient.
Most common OSes blur the line: Linux has kernel modules and some user-space servers; Windows NT 3.5 began as a microkernel but NT 4.0 moved services back into the kernel for performance.
Virtualisation & Containers
The trend is to encapsulate applications — make the system look like it's supporting a single application. Important for microservice architectures and cross-system portability.
Virtualisation via Hypervisor
- Type 1
- Runs directly on host hardware, possibly using hardware extensions (Intel VT-x). Examples: Xen, VMware ESX, Hyper-V.
- Type 2
- Runs above a full host OS kernel.
Both can support cross-architecture execution via emulation (slow) or interpretation (for non-native binaries).
Virtual Machines vs Containers
| Virtual Machines | Containers | |
|---|---|---|
| Encapsulation | Whole system including OS | Single process tree on shared kernel |
| Boot | Boots guest OS over hypervisor | Exposes kernel features to each container |
| Examples | Xen, VMware ESX, Hyper-V | Linux Containers, FreeBSD Jails, Solaris Zones |
Use cases: multiple OSes on desktops/laptops, cross-OS app development, QA testing, datacenter compute.
Security — Principle of Least Privilege, ACLs, Capabilities
Security is defence against internal and external attacks (DoS, worms, viruses, identity theft, theft of service).
User IDs, Group IDs, Privilege Escalation
Systems first distinguish users. Each has a user ID (name + number) associated with all their files and processes. A group ID lets sets of users share access controls. Privilege escalation allows a user to switch to an effective ID with more rights.
Principle of Least Privilege
Objects (hardware devices, files, programs, semaphores) should be given just enough privilege to perform their tasks. Permissions can be:
- Static — set for the life of the system/process.
- Dynamic — changed by the process as needed via domain switching or privilege escalation.
Compartmentalisation is a derivative concept: per-component protection using specific permissions.
Covert Channels
Leak information through side-effects:
- Hardware — wiretapping, electromagnetic emissions from devices.
- Software — page fault statistics, input-dependent timing. (The lowest layer of OCaml's TLS library had to be written in C to prevent the garbage collector from becoming a covert channel.)
Domains and the Access Matrix
An access-right is ⟨object-name, rights-set⟩ where rights-set ⊆ valid operations on that object. A domain is a set of access-rights. In UNIX, a domain is a user ID.
The access matrix has domains (subjects, principals) as rows and objects as columns; cell (i, j) lists the operations domain i may invoke on object j. (Operations can include "add/delete entries in the matrix" — policies may be mutable.) The matrix separates policy from mechanism.
The matrix is conceptually a table of ⟨domain, object, rights-set⟩ triples. It's too large to store densely but is typically sparse — two standard representations:
Store a list of ⟨domain, rights-set⟩ per object. Each column of the matrix. Common in filesystems — the ACL is inserted in the naming path (e.g., on file open). If ACLs live on disk, checks happen in software → use only on low-duty paths, or cache results.
Store a list of ⟨object, rights-set⟩ per domain. Each row of the matrix. A capability is a "secure pointer" — possession means the operation is allowed. Hardware capabilities (CHERI) have special instructions to restrict/pass them; software capabilities are protected by encryption (good for distributed systems).
Authentication
User to system. Password / passphrase / key, hashed with a salt (easy to compute, hard to invert). Multi-factor authentication adds extra components. Failed attempts are logged.
System to user. Avoid user talking to the wrong program. Historically: make the login character identical to the terminal attention. More recently: Windows NT's Ctrl-Alt-Del to log in — nothing else can trap it. (When your bank phones, how do you know it's them?)
Processes: Program vs Process
The computer is there to execute programs, not the OS!
The process is the unit of protection and resource allocation. Multiple processes can be created from a single program.
Each process's virtual processor has:
- Text — program code.
- Data — global variables.
- Heap — memory allocated at runtime.
- One or more threads of execution.
Each thread has a program counter (current instruction) and a stack (temporaries, parameters, return addresses).
Process Control Block & Threads
Process Control Block (PCB)
Data structure representing a process:
- Process ID — unique identifier.
- Current state — running, waiting, etc.
- CPU scheduling info — priorities, queue pointers.
- Memory-management info — memory allocated.
- Accounting info — CPU used, clock time, time limits.
Plus saved CPU state for when the process isn't running, open-file list, I/O state, etc.
Threads
A process contains one or more threads sharing one address space. Each thread has a Thread Control Block (TCB) with saved context (registers including SP), scheduler info, and PC. A scheduler picks which thread runs. Switching threads incurs a context switch; switching between processes also switches the process state (address space).
Context Switching
Switching between processes involves:
- Saving the context of the currently-executing process (if any).
- Restoring the context of the process being resumed.
Process Lifecycle
Process States
- New
- Process is being created.
- Ready
- Ready to run, waiting for the CPU.
- Running
- Instructions are being executed.
- Waiting (Blocked)
- Stopped executing; waiting for an event.
- Terminated (Exit)
- Has finished executing.
Process Creation
Most systems are hierarchical: parent processes create children, forming a process tree (on Linux rooted at init, pid 1).
Three design axes per OS:
- Resource sharing: parent and children share all / some / no resources?
- Child memory init: duplicate of parent (then modify) / load a fresh program?
- Execution: concurrent with parent / parent waits for child?
pid = fork() clones a child (duplicate of parent), returning 0 to the child and the child's PID to the parent. The child usually then calls execve to replace its memory with a new program. The parent may wait() until the child exit()s and collect its status.
Alternative: Windows NT/2K/XP CreateProcess specifies the program name directly.
Process Termination
- Process performs an illegal operation (unauthorised memory access, privileged instruction).
- Parent terminates child via
abort/kill— resource overrun, task no longer needed, or cascading termination (parent exiting forces children to exit). - Process executes its last statement and calls
exit.
wait for a terminated child, the child becomes a zombie. If the parent terminated first without waiting, the child becomes an orphan.
Inter-Process Communication (IPC)
All IPC needs a protocol: agreed syntax, semantics, synchronisation. Between hosts is Networking (Part IB); between threads is Concurrent & Distributed Systems. IPC's basic requirement is shared access to memory on the same host.
Two Fundamental Models
Communicating processes establish a region of memory both can access. Requires relaxing the usual memory-protection restriction between processes.
Processes send messages via the kernel. Design choices:
- Direct vs indirect — name each other or a shared mailbox.
- Blocking vs non-blocking — sync or async send/receive.
- Buffer size — zero, finite, or unbounded for non-blocking.
Signals
Simplest form of message passing: asynchronous notifications between processes. kill sends a signal to a process; sigaction examines or changes a handler disposition (terminate, ignore, ...); pause suspends a process until a signal is caught. Each signal maps to an integer (architecture-dependent).
| Signal | # | Meaning |
|---|---|---|
SIGHUP | 1 | Hangup detected / death of controlling process |
SIGINT | 2 | Terminal interrupt (Ctrl-C) |
SIGILL | 4 | Illegal instruction |
SIGKILL | 9 | Terminate process (cannot be caught or ignored) |
SIGSEGV | 11 | Segmentation fault — invalid memory reference |
SIGTERM | 15 | Politely terminate process |
SIGUSR1/2 | — | User-defined signals |
Pipes
Simple shared-memory IPC. pipe() returns a pair of file descriptors (fd[0], fd[1]) — read and write ends. fork() creates a child; parent and child communicate via read/write on the fd pair. Named pipes (FIFOs) extend beyond parent/child relationships by appearing as files in the filesystem.
Shared Memory Segments
Obtain a shared region via shmget, attach it with shmat, read/write via pointers, then shmdt to detach and shmctl to destroy. No built-in synchronisation — callers must protect concurrent access.
CPU-I/O Burst Cycle & Queues
Process execution interleaves CPU execution with waiting for I/O. CPU burst distribution is often (hyper-)exponential:
- I/O-bound processes — many short CPU bursts.
- CPU-bound processes — fewer, longer CPU bursts.
Maximising CPU utilisation therefore requires multiprogramming — something useful to do while waiting for I/O.
Queues
- Job Queue
- Batch processes awaiting admission to the system.
- Ready Queue
- Processes in main memory, ready and waiting to execute.
- Wait Queue(s)
- Processes waiting for specific events, e.g. I/O devices or other processes.
Scheduler Types & Idling
Short-Term vs Long-Term
- Short-term / CPU scheduler
- Selects which process runs next and allocates the CPU. Often the only scheduler. Invoked frequently (milliseconds) so must be fast.
- Long-term / Job scheduler
- Controls the degree of multiprogramming — selects which processes enter the ready queue. Invoked infrequently (seconds to minutes) so may be slower. Strives for a good mix of CPU- and I/O-bound processes.
Idling
What if there's nothing to run?
- Busy wait in the scheduler — short response time but ugly and inefficient.
- Halt CPU until interrupted — saves energy but increases latency.
- Invent an idle process — uniform structure, can do housekeeping, but consumes resources and may slow interrupt response.
Dispatcher
After the scheduler has chosen a process, the dispatcher gives it control by:
- Switching context,
- Switching to user mode,
- Executing the process from the selected location.
Dispatch latency is the time for this stop-start.
Two questions drive scheduler design:
- When to make a scheduling decision?
- How to order the ready queue — which process next?
Pre-emptive vs Non-preemptive
Scheduling decisions can happen when:
- A running process blocks (running → waiting).
- A running process terminates (running → terminated).
- A timer expires (running → ready).
- A waiting process unblocks (waiting → ready).
Non-preemptive — scheduler only invoked under 1 and 2; the running process decides when to yield.
Pre-emptive — scheduler can also be invoked under 3 and 4; OS forces entry.
| Pre-emptive | Non-preemptive | |
|---|---|---|
| Hardware | Requires regular timer interrupts | No timers needed |
| Denial-of-service | OS can force yield → immune | Malicious process can refuse to yield |
| Implementation | Complex: timers, concurrency | Simple: process holds CPU as long as desired |
Almost all modern schedulers are pre-emptive.
Scheduling Criteria
Many metrics; no single "best" scheduler.
- Turnaround time
- Time from submission to completion. Minimise across all states.
- Waiting time
- Time in the ready queue. What the scheduler directly controls. Penalises I/O-heavy processes.
- Response time
- Time to start responding. Critical for interactive systems; may penalise long sessions under load.
- CPU utilisation
- Maximise CPU busy time. Penalises I/O-heavy processes.
- Throughput
- Rate at which processes complete. Penalises long-running processes (short ones jump the queue).
Multiple-Processor Scheduling
Assume homogeneous processors within a multiprocessor.
- Asymmetric multiprocessing
- Only one processor accesses system data structures; no sharing.
- Symmetric multiprocessing (SMP)
- Most common. Each processor self-schedules. Ready queue may be shared or per-processor.
Processor Affinity
- Soft — preference; scheduler tries to keep the process on the same CPU.
- Hard — constraint; process pinned. Extensions: processor sets.
NUMA
Non-Uniform Memory Access: different CPUs have faster or slower access to parts of memory (e.g. combined CPU+memory boards). Memory placement therefore affects affinity; migrating a process to a different CPU can be much more expensive than without NUMA.
Load Balancing
SMP requires keeping all CPUs loaded evenly.
- Push migration — periodic task checks each CPU's load and pushes tasks off overloaded CPUs.
- Pull migration — idle CPUs pull waiting tasks from busy ones.
Other Trends
- Multicore — multiple CPU cores on one chip.
- Hyperthreading — multiple threads per core; one can progress while another stalls on a memory read.
- Virtualisation — hypervisor and guests all schedule against each other.
First-Come First-Served (FCFS)
The schedule depends purely on arrival order. Simplest possible algorithm; not robust.
Example
Three processes arriving in order P₁, P₂, P₃ with burst times 24, 3, 3:
| Order | Gantt chart | Waiting times | Avg |
|---|---|---|---|
| P₁, P₂, P₃ | P₁(24) | P₂(3) | P₃(3) | 0, 24, 27 | 17 |
| P₂, P₃, P₁ | P₂(3) | P₃(3) | P₁(24) | 6, 0, 3 | 3 |
Shortest Job First (SJF)
Schedule the process with the shortest next CPU burst.
Example
P₁(6), P₂(8), P₃(7), P₄(3) → order: P₄, P₁, P₃, P₂. Waiting times: 3, 16, 9, 0 → average 7.
Shortest Remaining Time First (SRTF) & Burst Prediction
Pre-emptive version of SJF: a newly arriving process with a burst shorter than the remaining time of the current process pre-empts it. Distinguish arrival time from burst length.
Example
| Process | Arrival | Burst |
|---|---|---|
| P₁ | 0 | 8 |
| P₂ | 1 | 4 |
| P₃ | 2 | 9 |
| P₄ | 3 | 5 |
Average waiting time = ((10−1) + (1−1) + (17−2) + (5−3)) / 4 = 26/4 = 6½.
Is SRTF Optimal?
No — for two reasons:
- Context switches aren't free. If short-burst processes keep arriving, the OS thrashes and does no useful work.
- More fundamentally: how can we know the length of a future burst?
Predicting Burst Lengths — Exponential Averaging
Assume the next burst isn't too different from the previous.
τn+1 = α tn + (1 − α) τn
Expanding: τn+1 = α tn + α(1−α) tn−1 + α(1−α)² tn−2 + … + (1−α)n+1 τ0.
Each earlier term has less weight — "exponential decay" into the past.
- α ≈ 1 → τn+1 ≈ tn. Past history discarded.
- α ≈ 0 → τn+1 ≈ τn. Recent history discarded.
Works well if variance is small and isn't changing too fast relative to the time slot. Must also consider system load — otherwise (counter-intuitively) priorities rise with load.
Round Robin
Pre-emptive scheme for time-sharing systems. Each process gets a quantum (time-slice), typically 10–100 ms. When the quantum elapses, the process is pre-empted and appended to the ready queue. A timer interrupt every quantum triggers the next scheduling decision.
Tuning the Quantum
| q too large | Degenerates into FCFS |
|---|---|
| q too small | Context-switch overhead dominates |
Typical values: q = 10–100 ms, context switch < 10 μs.
Properties
For n processes ready with quantum q:
- Fair — each gets 1/n CPU in chunks of at most q.
- Live — no process waits more than (n−1)q.
Compared to SRTF: higher average turnaround but better average response time.
Example
P₁(24), P₂(3), P₃(3) with q = 4 gives: P₁ | P₂ | P₃ | P₁ | P₁ | P₁ | P₁ | P₁ (across times 0–30).
Priority Scheduling
Associate an integer priority with each process; schedule the highest priority (typically lowest number).
Example
| Process | Priority | Burst |
|---|---|---|
| P₁ | 3 | 10 |
| P₂ | 1 | 1 |
| P₃ | 4 | 2 |
| P₄ | 5 | 1 |
| P₅ | 2 | 5 |
Order: P₂, P₅, P₁, P₃, P₄. Waiting: 6, 0, 16, 18, 1. Average = 41/5 = 8.2.
Observation: SJF is priority scheduling with priority = inverse of predicted burst length.
Starvation and Aging
The biggest problem with static priority. Solve with dynamic priorities:
- Aging — priority increases from a static base as time passes without scheduling.
Computed Priority — UNIX Scheduler
Priorities 0–127; user processes have base = 50; round-robin within each priority queue, quantum 100 ms; priority recomputed every 4 ticks (≈ 40 ms) if running.
Kernel-mode processes have fixed priority (non-preemptive) determined by reason for waiting (e.g. disk I/O < terminal input).
User-mode processes have dynamically-computed pre-emptive priority. Per tick (10 ms): if a higher-priority process exists, switch. Per quantum: if same-priority process exists, switch.
- basej — base priority (50 for user processes),
- nicej — user-controlled, −20 to +20, default 0,
- loadj — sampled (1-minute) average run-queue length,
- CPUj — counter incremented each tick process is observed running.
CPUj(i) = [ 2·loadj / (2·loadj + 1) ] · CPUj(i−1)
Pj(i) = basej + CPUj(i) / 4 + 2·nicej
Multilevel & Multilevel-Feedback Queues
Multilevel Queues
Partition the Ready queue into multiple queues (e.g. foreground/interactive vs background/batch). Each process is permanently assigned a queue. Each queue runs its own algorithm — foreground Round Robin, background FCFS, say. Scheduling between queues is usually fixed priority (serve higher first) or time-slice (80% to foreground, 20% to background).
Multilevel Feedback Queues
Processes can move between queues — implementing aging, for example. Defined by:
- Number of queues,
- Scheduling algorithm for each,
- Method to upgrade a process,
- Method to demote a process,
- Method to determine entry queue for new service.
- Q₀ — RR, quantum 8 ms.
- Q₁ — RR, quantum 16 ms.
- Q₂ — FCFS.
A new job enters Q₀ (FCFS-served within the queue). It gets 8 ms. If unfinished, it moves to Q₁ for 16 more ms. Still unfinished? Demote to Q₂. This gives short jobs fast response and long jobs eventual progress without overwhelming the system.
Memory Management & Hardware Protection
Many programs live in memory simultaneously. The CPU directly accesses only registers and main memory:
- Register access: single cycle.
- Memory access: many cycles (masked by multi-level cache L1/L2/L3).
We need to protect memory accesses to prevent malicious or buggy user programs from corrupting other programs — including the kernel.
Base and Limit Registers
Define the logical address space per process:
- Base — smallest legal address (e.g. 300040).
- Limit — size of the range (e.g. 120900).
- Legal addresses: [300040, 420940).
CPU checks every user-mode access against these. Out-of-range → exception, trap to OS.
Address Binding
Programs on disk are brought into memory to run — but where in memory? Program code refers to memory locations that may not match wherever the process lands. Address binding resolves this, and can happen at three points:
- Compile time
- If memory location is known a priori, compiler generates absolute code. Requires recompilation if base changes.
- Load time
- If location not known at compile time, generate relocatable code — fixed at load.
- Execution time
- Binding delayed to runtime. Required if the process can be moved during execution. Needs hardware support.
Logical vs Physical Addresses
- Logical (virtual) address
- Generated by the CPU.
- Physical address
- Seen by the memory unit.
Identical under compile-time and load-time binding; differ under execution-time binding. The set of all logical addresses is the logical address space; similarly for physical.
Memory Management Unit (MMU)
Hardware that maps logical → physical addresses at runtime. Simplest scheme: replace the base register with a relocation register. Every user-generated address has the relocation-register value added when sent to memory.
Linking & Loading
Linking
Combines object code modules into a runnable program.
- Static linking — all libraries and program code combined into the binary.
- Dynamic linking — postponed to execution time. Useful for system/shared libraries; may need version tracking.
Dynamic linking uses stubs: small pieces of code that locate the appropriate in-memory routine, replace themselves with the routine address, and then call through. The OS checks whether the routine is already in the process's address space and adds it if not.
Loading
Dynamic loading avoids loading routines until they're called. Better memory usage for infrequently-used code. Requires relocatable compilation.
Memory Allocation, Partitions, Swapping
Main memory supports both kernel and user processes — a limited resource that must be allocated efficiently. Early systems used contiguous allocation: each process in one chunk.
Partitioning
- Multiple fixed-sized partitions — limits multiprogramming.
- Variable partitioning — preferred; partitions sized to processes.
Main memory is usually split: resident kernel in low memory (alongside interrupt vectors); user processes in high memory, each contiguous. Relocation registers protect user processes from each other and the kernel code/data from being modified.
Swapping
When total requested memory exceeds physical memory, processes are temporarily swapped out to storage.
- Transfer time ∝ amount of memory swapped — very expensive.
- Swapping default off; enabled only when allocation exceeds a threshold.
- Pending I/O to/from process memory constrains swapping.
- The system keeps a ready queue of ready-to-run processes whose memory images are on disk.
Must swapped-out processes be swapped back to the same physical addresses? Depends on address binding method.
Dynamic Allocation — Free Hole Strategies
Holes — blocks of available memory — are scattered through memory. Three strategies for finding a hole of size n:
| Strategy | Approach | Pros / Cons |
|---|---|---|
| First-fit | Allocate the first hole that's big enough. | Fast, good utilisation. |
| Best-fit | Smallest hole big enough. | Needs full list search (or size-ordered). Smallest leftover. |
| Worst-fit | Largest hole. | Full-list search. Largest leftover. |
First-fit and best-fit both beat worst-fit for speed and utilisation.
Fragmentation & Compaction
Fragmentation makes memory unused and unusable.
Free memory exists to satisfy a request but not contiguously. Can eventually block process loading.
Allocated memory is slightly larger than requested — memory inside a partition that's unused.
Compaction
Reduce external fragmentation by shuffling memory so free space is one large block. Only possible if relocation is dynamic and done at execution time.
I/O problem: pin jobs in memory during I/O, or do I/O only into OS buffers. And backing store has the same fragmentation problem.
Segmentation
Memory-management scheme that supports the user's view of memory: a program as a collection of segments — logical units like the program, a procedure, an object, an array.
An address is a pair ⟨segment-number, offset⟩. The segment table maps to physical addresses via entries with:
- base — starting physical address;
- limit — segment length.
Plus hardware registers:
- STBR (Segment-Table Base Register) — points to the segment table in memory.
- STLR (Segment-Table Length Register) — number of segments used; segment number s is legal if s < STLR.
Protection and Sharing
Each segment-table entry has:
- Validation bit — legal/illegal segment.
- Read/Write/Execute privileges.
Code sharing occurs at segment granularity. Segments vary in length → dynamic storage allocation problem.
Sharing segments is subtle. Consider jumps within shared code — a branch specifies ⟨segment-number, offset⟩ with segment-number being "this one". All programs sharing the segment must refer to it by the same number. Finding a common number scales badly. Two fixes:
- Specify branches as PC-relative or relative to a register holding the current segment number.
- Read-only segments with no pointers may be shared with different numbers.
- Use System Segment Numbers (SSNs): each segment gets a unique SSN; each process's segment table maps from Process Segment Numbers to SSNs.
Non-contiguous Allocation: Pages and Frames
The goal: let a process's physical address space be non-contiguous so physical memory can be allocated as available. Avoids external fragmentation and the varying-size-chunk problem. Still has internal fragmentation.
Paging
- Divide physical memory into frames: fixed-size (power-of-two) blocks, typically 512 B – 1 GB.
- Divide logical memory into pages of the same size.
- A page table maps pages → frames.
Running an N-page program requires: finding N free frames, creating page-table entries, loading the program.
Address Translation
Each logical address is split:
┌────────────────┬─────────────┐
│ page number p │ offset d │
└────────────────┴─────────────┘
(m − n) bits n bits
For logical address space of 2m and page size 2n. The page number indexes the page table to find the frame; offset is combined with the frame base.
Pros and Cons
- ✓ No external fragmentation.
- ✗ Still has internal fragmentation — on average ½ frame per process. Smaller frames waste less, but each PTE costs memory → table grows.
- ✗ Process view and physical memory differ — OS must track free frames and remap the page table on every context switch.
Page size 2048 bytes, process size 72,766 bytes → process uses 35 pages plus 1086 bytes. Internal fragmentation = 2048 − 1086 = 962 bytes.
Page Tables
Hardware support needed for performance. The page table translates page number → frame number; the offset within a page equals the offset within a frame.
- PTBR (Page-Table Base Register) — points to the page table in main memory.
- PTLR (Page-Table Length Register) — size of the page table.
Translation Lookaside Buffer (TLB)
A hardware cache of recent translations, using associative memory. Typically 64–1024 entries.
Operation
- Translation in TLB (hit) → use it.
- TLB miss → slow two-memory-access lookup; add the entry to the TLB for next time.
- Replacement policy usually LRU; can sometimes pin entries for permanent fast access.
Performance — Effective Access Time (EAT)
Let TLB search = 20 ns, memory access = 100 ns, hit ratio h.
- h = 80%: EAT = 0.8 × 120 + 0.2 × 220 = 140 ns.
- h = 98%: EAT = 0.98 × 120 + 0.02 × 220 = 122 ns — only 13% improvement (Intel 80486 had 32 entries and ≈ 98%).
Protection & Sharing
Per-Page Protection
Each PTE has protection bits:
- Accessible in kernel mode only, or user mode.
- Read / Write / Execute allowed.
- Valid / Invalid.
Hardware checks bits during translation. Violations trap to the OS. Granularity is per page, not per byte.
Sharing
Read-only (re-entrant) code can be shared across processes — one physical copy, many virtual mappings. Similar to multiple threads sharing a process space. Read-write sharing works too (useful for IPC).
Multi-level Page Tables
Page tables can be huge. For a 32-bit logical address with 4 KB pages: 2³²/2¹² = 2²⁰ entries. At 4 B each → 4 MB per process for the page table — we don't want to allocate that contiguously.
Solution: split into levels and page out all but the outermost level.
Two-Level Paging
┌──────────┬──────────┬──────────┐
│ p₁ │ p₂ │ d │
│ 10 bits │ 10 bits │ 12 bits │
└──────────┴──────────┴──────────┘
- PTBR points to L1 page table.
- p₁ indexes into L1 → address of relevant page of the L2 table.
- p₂ indexes into L2 → frame address.
- d indexes the frame → target byte.
This is a forward-mapped page table.
Larger Address Spaces
For 64-bit address spaces a simple hierarchy is impractical — either one layer is too large or there are too many levels. Alternatives (non-examinable):
- Hashed page tables — hash the page number; follow the chain.
- Inverted page tables — one entry per frame; hash-table limits search. Trades size for lookup latency.
Real-World Examples (Non-Examinable)
- Intel IA-32
- Hybrid segmentation + paging. Segments ≤ 4 GB, ≤ 16,384 per process split into two partitions — private (LDT) and shared (GDT). Logical address ⟨selector, offset⟩ → linear address via segment unit → physical via paging unit. Two-level paging for 4 KB pages. PAE extension gives 36-bit addresses (64 GB physical) via 3-level scheme.
- Intel x86-64
- 64-bit but only 48 bits actually implemented. Page sizes 4 KB, 2 MB, 1 GB. Four levels of paging. PAE variant: 48-bit virtual, 52-bit physical.
- ARM (32-bit)
- 4 KB / 16 KB pages + 1 MB / 16 MB "sections" (one-level paging for sections, two-level for pages). Two-level TLB: outer micro-TLBs (split data/instruction, ASID-aware) + inner main TLB. Hardware page-table walk on miss.
Virtual Memory
Virtual addressing lets us introduce virtual memory. Page translations can be valid, invalid, or a new designation: non-resident — the page is on a non-volatile backing store. Processes access non-resident memory as if it were real. Implemented via demand paging / demand segmentation.
Virtual memory separates program logical memory from physical memory — logical space can be much larger than physical.
Benefits
- Portability
- Programs work regardless of physical memory size; can be larger than physical memory; can start executing before fully loaded.
- Convenience
- Less of a program needs to be in memory at once → more efficient multiprogramming, less swapping I/O, easy large sparse data structures.
- Efficiency
- No need to waste memory on unused code/data (e.g. error handling, rare routines).
Virtual Address Space Layout
Usually starts at 0 and is contiguous. Stack starts at the maximum address and grows down; heap starts above the data and grows up. Unused space between stack and heap is the hole. No physical memory is needed until heap or stack grows onto a new page.
System libraries are shared via mapping into virtual address spaces. Pages can be shared across fork (see COW below), speeding process creation.
Page Faults
When a process references an invalid page, a trap to the OS — a page fault:
- OS examines another table to decide whether the reference is valid.
- If valid but not resident: find a free frame, swap the page in, mark the PTE valid.
- If invalid: abort.
- Restart the faulting instruction.
Instruction Restart
Simple case: add two numbers in memory, store result. If the store faults, restart from instruction fetch. Locality of reference makes multiple faults per instruction unlikely.
Tricky case: an instruction that touches several locations, e.g. a memory-block move where source and destination overlap and either straddles a page boundary. The source may already be partially modified — you can't restart from scratch.
Use microcode: stride across the block first, touching every page so all are resident. Then the copy itself cannot fault.
Double fault — if the page fault handler itself faults: just give up.
Locality of Reference
Over short intervals, a process's references group into a few regions: current procedure, sub-procedures, data access, stack variables. This is what makes virtual memory practical.
Demand Paging
Rather than loading the whole process at load time, bring pages in as needed. Reduces I/O, memory, and response time. Supports more running processes.
Pure demand paging starts every page invalid — fault brings each in on first reference.
Hardware Requirements
- Page table with valid/invalid bits.
- Secondary memory (swap device with swap space).
- Ability to restart instructions.
The lazy swapper/pager never swaps a page in until needed.
Worst-Case Demand Paging Performance
Twelve steps when every request misses, queues behind others, etc. — basically: trap → save state → identify fault → disk queue → seek + transfer → reallocate CPU → interrupt on completion → save other state → identify interrupt → update tables → wait for CPU → restore state + restart. Each adds overhead.
Effective Access Time (EAT)
Let memory access = 200 ns, page-fault service = 8 ms, fault rate p.
EAT = (1 − p) × 200 + p × 8,000,000 = 200 + 7,999,800 p ns.
- p = 1/1000 → EAT ≈ 8.2 μs — a 40× slowdown.
- For < 10% degradation: need p < 0.0000025 — less than one fault per 400,000 accesses.
Copy-on-Write & Other Optimisations
- Swap space I/O faster than filesystem I/O. Allocate swap in larger chunks — less management. Copy the process image to swap at load time, then page in/out of swap.
- Demand page from the binary on disk — discard unmodified frames when freeing.
- Copy-on-Write (COW). Parent and child initially share pages. A write copies the page lazily. Efficient process creation — only modified pages cost memory.
- Zero-fill-on-demand pool. Pre-zeroed pages ready for fast page-in — faults don't have to free + zero a frame.
vfork— child created as COW of parent. Efficient if child just callsexec.
Page Replacement
Paging in requires a free frame — but physical memory is limited. Must either discard unused pages or swap out an entire process. Page-fault handler:
- Locate the desired page on disk.
- Select a free frame or pick a victim to evict:
- Write the victim back to disk.
- Mark it invalid in its process's page tables.
- Read the desired page into the freed frame.
- Restart the faulting process.
No free frames ≈ double the page-fault service time (evict, then load).
Reference String for Examples
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 (3 frames)
First-In First-Out (FIFO)
Evict the oldest page (simple queue). 15 page faults on the example above.
Optimal (OPT)
Evict the page that will not be used for the longest time in the future. 9 faults on the example — provably the best. Requires knowing the future → useful only as a benchmark.
Least Recently Used (LRU)
Approximates OPT — assume recent past predicts future. Evict the page unused for the longest time. 12 faults on the example.
LRU Implementation
Each PTE holds a clock value, updated on reference. Evict the page with smallest counter. Requires table search + memory write per access.
Maintain doubly-linked stack of page numbers. On reference, move page to top. Up to six pointer changes. Tail always points at the replacement victim.
Hardware support is essential — pure LRU is too expensive in software.
LRU Approximations — NRU, Clock, LFU, MFU
Reference Bit
Each PTE has a reference bit, initially 0. Set to 1 when the page is touched. Periodically cleared.
Not Recently Used (NRU)
Periodically (every 20 ms) clear reference bits. Pick victims by reference and dirty bits.
| Referenced? | Dirty? | Comment |
|---|---|---|
| no | no | Best type of page to evict |
| no | yes | Next best (needs writeback) |
| yes | no | Probably code in use |
| yes | yes | Bad choice of victim |
Better: use an 8-bit value, shifting in the reference bit from the left. Keeps history for 8 sweeps.
Second-Chance (Clock) Algorithm
Store pages in a circular FIFO queue with a "current" pointer. On eviction:
- If current's reference bit is 0 → evict.
- Else clear the reference bit (a "second chance") and advance.
Guaranteed to terminate after at most one full cycle. Degenerates into FIFO if all pages are referenced.
Emulating Hardware Support
If no hardware reference/dirty bits: mark page as "no access" to clear the reference bit. Any reference traps → update PTE + resume. Permissions check tells you whether the page was referenced.
Counting Algorithms
Evict smallest count. Ignores time → page from init can stick. Need periodic decay.
Evict highest count. Low count = recently brought in and deserves to stay.
Neither common — expensive and poor OPT approximation. Similarly, MRU (Most Recently Used) is rarely useful.
Page Buffering
- Keep a minimum-sized pool of free frames — always read new page into a free frame before picking the victim.
- Keep a list of modified pages — write them out when the backing store is idle.
- Keep free-frame contents intact and note what's in them — if referenced again, no re-load needed.
Frame Allocation
Policy question: how to distribute frames across processes after reserving a fraction for the OS?
- Fairness
- m/n frames each (remainder to a free pool), or proportional to process size.
- Minimise system-wide fault rate
- Allocate all memory to a few processes.
- Maximise multiprogramming
- Allocate minimum memory to many processes.
Global vs Local Replacement
Any page (even another process's) can be the victim. More common — greater overall throughput, but per-process performance can depend entirely on what other processes do. Allocation policy is implicit in page-in admission.
Each process selects only from its own frames. More consistent per-process performance, but memory can be underutilised.
Thrashing & the Working Set
A process without "enough" pages faults often — and its replaced frames tend to come back soon.
- Handling page faults → low CPU utilisation.
- OS responds by increasing multiprogramming.
- More processes → more memory pressure.
- System collapses.
Thrashing occurs when the size of the locality exceeds total memory. Limit with local or priority page replacement.
Working Set
The working set = pages required at the same time for a process to make progress. Varies between processes and during execution. Assume process shifts between phases, with spatial locality within each phase.
Let Δ = a fixed window of page references (e.g. 10,000 instructions).
- WSSi = working-set size of process i = total distinct pages referenced in the most recent Δ-window.
- Δ too small: misses the locality.
- Δ too large: spans multiple localities (entire program).
Practical implementation: approximate with an interval timer and reference bits. A page is in the working set iff its reference bit is set during the window.
Pre-paging: bring working-set pages in when restarting a process — avoids the storm of page faults that would otherwise follow.
The I/O Subsystem
Computation relies on I/O. The huge variety of devices differs in data rate, control complexity, transfer unit/direction, data representation, error handling. I/O management is a major OS component — new devices come along frequently, I/O performance is critical, and we want a uniform API.
Commonalities
- A device has at least one port (connection point).
- Devices interconnect via a bus — daisy-chained or shared.
- Devices have integrated or separate controllers (host adapters) containing processor, microcode, private memory.
Devices typically have registers — data-in, data-out, status, control — addressed either via:
- Direct I/O instructions (usually privileged), or
- Memory-mapped I/O — device registers mapped into processor address space (especially for large devices like graphics cards).
Polling vs Interrupt-Driven I/O
Polled Mode
Per byte:
- Host repeatedly reads device-busy until clear.
- Host sets read/write bit in command register; puts data in data register.
- Host sets command-ready in status register.
- Device sees command-ready, sets device-busy.
- Device performs the operation.
- Device clears command-ready and any error bit; clears device-busy.
Interrupts
More efficient when devices are accessed infrequently. Device triggers the interrupt-request line, which the CPU checks after each instruction (aligning interrupts with instruction boundaries). The interrupt vector dispatches to the correct handler — unless masked. Priorities may be applied; some interrupts are non-maskable.
Interrupt Handling — Top / Bottom Halves
Split the implementation:
- Top half — interrupt handler
- Processor-dependent. Saves more registers, establishes a language environment, demultiplexes the interrupt, invokes the correct ISR.
- Bottom half — Interrupt Service Routine (ISR)
- Device-specific (not processor-specific).
- For programmed I/O: transfer data, clear interrupt.
- For DMA: acknowledge transfer, request pending transfers, signal waiting processes, return or enter scheduler.
Subtle question: who is scheduling whom? Example pathology: network livelock — heavy interrupt load prevents any useful work.
Direct Memory Access (DMA)
For high-speed I/O devices transmitting at close to memory speeds, interrupts become too expensive per byte. Better: let devices read and write processor memory directly.
DMA Commands
- Source address + increment / decrement / no-change.
- Sink address + increment / decrement / no-change.
- Transfer size.
One interrupt per block, not per byte. DMA channels may be provided by a dedicated controller or by devices themselves (e.g. disk controller passes disk address, memory address, size, direction).
Requires the device can become a bus master — so we need bus arbitration (not just the CPU driving the bus). Involves cycle stealing: taking bus cycles away from the CPU.
Scatter/Gather DMA chains multiple requests (e.g. disk reads into a set of buffers).
Device Characteristics
- Block devices (disks, CDs)
- Commands: read, write, seek. Access: raw blocks / via filesystem ("cooked") / memory-mapped.
- Character devices (keyboards, mice, serial)
- Commands: get, put. Libraries layer on line editing, etc.
- Network devices
- Vary enough to get their own interface (Berkeley sockets on Unix and NT).
- Miscellaneous
- Current time, elapsed time, timers, clocks. Unix
ioctlcovers odd aspects.
Blocking, Non-Blocking, Asynchronous I/O
From the programmer's perspective, I/O calls exhibit one of three behaviours:
Process suspended until I/O completes. Easy to use — but insufficient for some needs.
Call returns all available data immediately (possibly 0 bytes). Usually paired with select for readiness-checking. Relies on multi-threading.
Process continues while I/O executes; the I/O subsystem explicitly signals completion. Most flexible and potentially most efficient — but most complex to use correctly.
I/O Structure Implementation
- Synchronous — control returns only on completion.
waitidles the CPU until the next interrupt; loop polling for completion; at most one I/O request outstanding at a time. - Asynchronous — control returns immediately; a device-status table tracks each device (type, address, state). OS updates the table on interrupt.
Kernel Data Structures, Buffering, Other Issues
The OS kernel maintains state for open file tables, network connections, character device states — many complex, performance-critical structures tracking buffers, memory allocation, dirty blocks.
Vectored I/O
One syscall, multiple I/O operations. Unix readv accepts a vector of buffers to read into or write from. This scatter-gather approach reduces context-switch and syscall overhead compared to multiple individual calls. Some versions provide atomicity — useful to avoid concurrent-data issues during I/O.
Buffering Strategies
- Single buffering — OS assigns one system buffer per user request.
- Double buffering — process consumes from one buffer while the system fills the next.
- Circular buffering — most useful for bursty I/O.
Details often dictated by device type: character devices buffer by line; network is bursty; block devices are the major users of I/O buffer memory.
- Process demand > data rate → process waits.
- Data rate > system capability → buffers fill, data spills.
Other Issues
- Caching
- Fast memory holding a copy of data for reads and writes — critical to I/O performance.
- Scheduling
- Order I/O requests in per-device queues; some OSes attempt fairness.
- Spooling
- Queue output for a single-user device (e.g. printer).
- Device reservation
- Syscalls to acquire/release exclusive device access (use with care).
- Error handling
- Return error number/code; log to system error log (transient failure, disk full, device unavailable).
- Protection
- All I/O instructions are privileged; memory-mapped I/O regions are memory-protected; I/O performed only via syscalls.
Mass Storage
Hard Disks (HDs)
- Stack of platters (historically 0.85″ to 14″; now mostly 3.5″, 2.5″, 1.8″).
- Typical capacity: 30 GB – 3 TB and growing.
- Transfer rate: theoretical 6 Gb/s, effective ~1 Gb/s.
- Seek time: 3–12 ms (≈ 9 ms common).
- Rotation: 7200 or 15,000 RPM.
Access latency = average seek time + average rotational latency
Average I/O time = access latency + (transfer amount / transfer rate) + controller overhead
Worked example: 4 kB block, 7200 RPM, 5 ms seek, 1 Gb/s transfer, 0.1 ms controller overhead:
- Rotational latency = 30/7200 = 4.17 ms.
- Transfer = 4096 × 8 / 1024³ = 0.031 ms.
- Total = 5 + 4.17 + 0.031 + 0.1 = 9.301 ms.
Solid State Disks (SSDs)
Non-volatile memory used like a hard drive.
| Pros | Cons |
|---|---|
| No moving parts — no seek or rotational latency | Reads/writes wear out cells |
| Much faster | More expensive per MB |
| Can be more reliable than HDs | Lower capacity |
Disk Scheduling
The disk controller receives a sequence of read/write requests to schedule — analogous to CPU scheduling but with very different mechanisms and constraints.
Example queue (cylinders): 98, 183, 37, 122, 14, 124, 65, 67 (head initially at 53).
First-Come First-Served (FCFS)
Intrinsically fair but inefficient. Services requests in arrival order — head moves back and forth wildly.
Shortest Seek-Time First (SSTF)
Service the nearest queued request. On the example: only 236 cylinders of movement — about 1/3 of FCFS. Analogous to SJF:
- Big improvement but allows starvation.
- Not optimal — from 53, moving to 37 then 14 before 65 gives only 208 cylinders.
SCAN (Elevator)
Start at one end, move to the other, service everything on the way. When the head reverses, those in the vicinity have just been served — those furthest away have waited longest.
Circular-SCAN (C-SCAN)
After reaching the end, jump back to the start without servicing. Cylinders treated as a circular list. Produces more uniform wait times.
Disk Management & On-Disk Structures
Formatting
- Low-level (physical) formatting
- Divides disk into sectors. Each sector holds header info + data (usually 512 bytes) + ECC.
- Logical formatting
- Creates the filesystem. Partitions the disk into cylinder groups treated as logical disks; groups blocks into clusters for efficiency.
Disk I/O in blocks; file I/O in clusters. Some applications (databases) prefer raw block access.
Booting from Disk
- BIOS (or similar) is "firm-coded" to read the first block of the first disk.
- First block contains the bootloader, which is executed.
- Bootloader reads the partition table, then more bootloader code, then the kernel.
- Sometimes needs chain-loading to parse complex filesystems.
- Bad-block handling via sector sparing: spare good blocks logically substitute for bad ones.
On-Disk Structures
A disk: boot block + one or more partitions. A partition is a contiguous range of N fixed-size blocks of size k, containing a filesystem.
Ordered by size: |data blocks| ≫ |inode table| ≫ |superblock|.
The superblock contains metadata — total/free blocks, start of free-block/free-inode lists. On-disk: a chain of tables whose head lives in the superblock — so superblock and inode-table are vulnerable to head crashes and must be replicated in practice.
Modern Linux supports a wide range of filesystems (ext4, XFS, ZFS, btrfs, ...).
Files, Metadata, and Operations
Files
Basic abstraction for non-volatile storage. Typically a single contiguous logical address space. Types:
- Data — numeric, character, binary.
- Program — source, object, executable.
- Documents.
Internal structure varies: none (byte/word sequence), simple records (lines, fixed/variable-length), or complex (formatted document, relocatable object).
File System Architecture
- Directory service — maps names to file identifiers and metadata; handles access and existence control.
- Storage service — stores data on disk (including directories).
Each partition is formatted with a filesystem — logically a directory plus files. The directory maps human names (e.g. hello.java) to a System File ID (SFID), typically an integer.
File Metadata
The SFID → File Control Block (FCB) mapping is filesystem-specific. Directory entries typically hold:
- Type — file or directory;
- Location — pointer to file location on device;
- Size;
- Protection — who can read/write/execute;
- Time, date, user identification — for security and usage tracking.
The OS also tracks open files in an open-file table:
- File pointer / cursor — last r/w location per process.
- File-open count — remove from table when the last process closes.
- On-disk location — cached data access info.
- Access rights — per-process access mode.
File and Directory Operations
The file is an ADT over (possibly structured) bytes.
- Create — allocate blocks.
- Open/Close — handle to the file with cursor.
- Delete — return blocks to the free list.
- Stat — return file metadata (existence + status).
- Write — data at cursor.
- Read — data at cursor into provided memory.
- Truncate — clip length at cursor.
Access Patterns
- Random access —
seekmoves cursor without I/O. - Sequential access — only
rewindmoves cursor back to start.
Opening a file: the in-memory directory structure (previously read from disk) resolves the file name to a file control block.
Reading a file: per-process open-file table → file descriptor → system-wide open-file table → FCB → data blocks on disk.
Directory Structure & Mounting
Directories provide: grouping, naming (same name for different files, multiple names for one file), efficiency.
Evolution
- Single-level
- Simplest but naming and grouping problems.
- Two-level (FAT)
- Same names for different users via paths; efficient search but no grouping.
- Tree-structured
- Naming convenience, efficient search, grouping. Introduces the current working directory (CWD). Absolute or relative paths — the latter resolved against CWD.
- Acyclic-graph (DAG)
- Share subdirectories and files; a file can have multiple absolute names (aliasing).
DAG Issues
- When to delete. Use back-references or reference counting (Unix hard links).
- Who owns the storage? For accounting and permissions.
- Cycle avoidance. Typically forbid links to subdirectories.
Hard vs Soft Links (Unix)
An instance of a file in a directory. Reference-counted in the inode; file removed when count reaches zero. Directories cannot have multiple hard links (cycle avoidance).
A normal file whose contents are a filename, interpreted by the filesystem. Can cross filesystems; can dangle.
File-System Mounting
A filesystem must be mounted at a mount point before use. Another partition's unmounted filesystem is overlaid at that directory in the host namespace.
Consistency, Efficiency, Buffer Caches
Consistency Issues
Arise even without multiple threads.
Example: deleting a file (via unlink, invoked from shell as rm):
- Check user has sufficient permissions on the file (write access).
- Check user has sufficient permissions on the directory (write access).
- Remove the directory entry.
- Decrement reference count on the inode.
- If count is zero, free data blocks and inode.
If the system crashes between steps, fsck walks the filesystem: unreferenced blocks are marked free; doubly-referenced blocks have reference counts updated.
Efficiency and Performance
Depends on:
- Disk allocation and directory algorithms.
- Metadata types in directory entries (creation vs modified vs access time).
- Pre-allocation vs on-demand allocation of metadata structures.
Performance measures:
- Keep data and metadata close together.
- Create a buffer cache of often-used blocks.
- Synchronous writes (sometimes requested by apps / needed by OS) bypass the cache.
- Asynchronous writes more common — can be buffered, faster.
Buffer Cache Architectures
Page cache caches pages (VM techniques and addresses).
Buffer (disk) cache caches disk blocks for file I/O.
Memory-mapped I/O uses page cache; routine file I/O uses buffer cache.
Single buffer cache, using the page cache for both memory-mapped I/O and regular disk I/O. Modern Linux.
UNIX / Linux History & Philosophy
UNIX (1969)
Developed in 1969 by Thompson & Ritchie at Bell Labs as a reaction to Multics's bloat. Focus on ease of use (interactive shell). In 1973, rewritten from assembly to (portable) C — a remarkable choice for performance-critical code.
Key milestones:
- 1976 — V6: source code released → features easily absorbed into other OSes.
- 1978: two main families — System V (AT&T) and BSD (Berkeley).
- POSIX standard attempted re-unification.
- 1983 — 4.2BSD: DARPA-funded TCP/IP stack.
- Virtual memory, networking added over time.
Linux (1991)
Linus Torvalds released v0.01 in May 1991 — small self-contained kernel, open source, 80386 only, Minix filesystem only.
- 1.0 (March 1994): TCP/IP + BSD socket interface, device drivers for IP on Ethernet, enhanced FS, SCSI support.
- 1.2 (March 1995): final PC-only kernel.
- Odd-numbered kernels are development; even are production.
Key UNIX Features
- Separation of kernel from user space — editors, compilers, etc. are just applications.
- Processes are the units of scheduling and protection; the shell is just another process.
- All I/O looks like file operations — "everything is a file".
Linux Design Principles
- Multiuser, multitasking with a full UNIX-compatible toolset.
- Traditional UNIX file-system semantics; standard UNIX networking.
- POSIX compliant (at least two distributions).
- Main goals: speed, efficiency, standardisation — constant tension with security.
- Supports Pthreads and a subset of POSIX real-time process control.
- Programming interface has SVR4 semantics, not BSD.
Linux Structure & Kernel Modules
Three Main Pieces
- Kernel
- Maintains OS abstractions. Executes in kernel mode with full resource access. All code and data share a single address space.
- System libraries
- Standard functions applications use to interact with the kernel. Implement much OS functionality that doesn't need privileges.
- System utilities
- User-mode programs performing specialised management tasks.
Kernel Modules
Sections of kernel code compiled, loaded, and unloaded independently. Implement device drivers, file systems, networking protocols. The module interface is stable — third parties can distribute non-GPL components.
Benefits:
- A Linux system can boot with a minimal kernel, drivers loaded on demand.
- Dynamic loading requires conflict resolution — kernel manages modules contending for hardware; resource reservation via kernel before granting access.
Process Model — Identity, Environment, Context
UNIX separates process creation from running a new program:
forkcreates a new process (clone of parent).execruns a new program (replaces memory image).
A Linux process's properties split into three groups:
Identity
- Process ID (PID) — unique identifier.
- Credentials — user ID + one or more group IDs.
- Personality — for emulating system-call semantics of other systems.
- Namespace — specific view of the FS hierarchy (typically shared, can be unique — basis for containers).
Environment
Inherited from parent as two null-terminated vectors:
- Argument vector — command-line arguments.
- Environment vector —
NAME=VALUEpairs.
Flexible way to pass information between user-mode components; per-process customisation.
Context
The (constantly changing) state of a running program:
- Scheduling context — needed to suspend/restart; also resource accounting.
- File table — array of pointers into kernel file structures; I/O syscalls use indexes (the file descriptors).
- File-system context — current root and default directories for new file searches.
- Signal-handler table — per-signal handler routine.
- Virtual-memory context — full contents of the process's private address space.
Processes vs Threads
Internally the same: a thread is just a new process that shares its parent's address space. Both called tasks by Linux; clone (rather than fork) creates a task sharing parent structures. clone gives fine control over exactly what is shared: file system, memory space, signal handlers, open files.
Task Lifecycle (Linux)
Five states:
- Running / Runnable (R)
- On CPU or in run queue.
- Interruptible Sleep (S)
- Waiting; can be woken by signal.
- Uninterruptible Sleep (D)
- Waiting; cannot be woken by signal (typically in-kernel I/O).
- Stopped (T)
- SIGSTOP received; SIGCONT resumes.
- Zombie (Z)
- Terminated; waiting for parent to reap.
CFS — The Completely Fair Scheduler
Traditional UNIX scheduling used fixed time slices and priorities with boost/penalty — 100 ms quantum, round-robin within priority levels, priority from base + run-queue length + nice. Worked OK for early time-sharing but didn't scale or give good interactive response for current systems.
CFS (since Linux 2.6.23)
No more time slices. Out of N tasks, each "should" get 1/N of the CPU. Adjust by nice value (−20 high, +19 low). A task j runs for time ∝ wj / Σ wi.
Target Latency
The interval in which every runnable task should run at least once.
- Target latency 10 ms, two equal-priority tasks → each runs 5 ms.
- Ten tasks → 1 ms each.
- 1000 tasks? Would overwhelm switching overheads — so a minimum granularity caps the shortest slice a process can get.
vruntime & Red-Black Tree
CFS maintains a per-task virtual run time variable vruntime:
- Scheduler picks the task with the lowest vruntime.
- Default case: vruntime == actual run time.
- Lower priority → higher decay rate of vruntime (runs "less" in virtual time).
Implemented as a red-black tree with the left-most bottom-most entry (lowest vruntime) cached — O(1) pick, O(log n) insert/delete.
Kernel Synchronisation
Kernel-mode execution enters in two ways:
- Process requests an OS service — explicit syscall or implicit (e.g. page fault).
- Device driver delivers a hardware interrupt.
Kernel critical sections must run without interruption by another critical section:
- Pre-2.6: kernel code non-preemptible; timer interrupts only set
need_resched. - Post-2.6: spin locks, or enable/disable pre-emption.
Top-Half / Bottom-Half ISRs
Long critical sections shouldn't need long interrupt-disable. Split ISRs:
- Top half — normal ISR, runs with recursive interrupts disabled.
- Bottom half — runs with all interrupts enabled, by a mini-scheduler that ensures bottom halves never self-interrupt.
Plus a mechanism to disable selected bottom halves while in foreground kernel code.
Linux IPC
- Signals
- Process-to-process; limited number; carry no data.
- Wait queues
- Inside the kernel. Process places itself on the queue for an event, informs scheduler it's ineligible. All waiters woken when the event completes.
- Pipes
- Just another inode in the VFS; each has a reader/writer pair of wait queues.
- Shared memory
- Fast but no built-in synchronisation. Persistent, like a small independent address space.
Linux Physical & Virtual Memory
Physical Memory
Linux splits memory into zones based on hardware characteristics: DMA, DMA32, NORMAL, HIGHMEM. Architecture-specific:
- x86_32: some devices only address the first 16 MB → that's the DMA zone. HIGHMEM is memory not mapped into kernel space; rest is NORMAL.
- x86_64: small 16 MB DMA zone for legacy; rest is ZONE_NORMAL.
- Other constraints: some devices only address the first 4 GB even with 64-bit addresses.
Page Allocation — Buddy Heap
The page allocator allocates and frees all physical pages, including ranges of physically-contiguous pages. Uses a buddy-heap algorithm:
- Each allocatable region paired with an adjacent partner.
- Two allocated partners freed together merge into a larger region.
- No small free region for a small request? Subdivide a larger one in two.
Slab Allocation
Kernel allocation happens either statically (drivers reserve contiguous memory at boot) or dynamically via the page allocator. Linux uses a slab allocator for small, frequently-allocated kernel objects — pre-allocated pools for specific types (e.g. task_struct).
Page Cache
The page cache manages physical memory as the kernel's main file cache, main I/O mechanism for block devices. Stores entire pages of file contents for both local and network file I/O.
Virtual Memory
The VM manager maintains two views of each process's address space:
- Logical view — layout as a set of non-overlapping page-aligned regions (VMAs).
- Physical view — the process's hardware page tables.
VM Regions
Characterised by:
- Backing store — where the pages come from (usually a file; demand-zero memory for anonymous).
- Reaction to writes — page sharing or copy-on-write.
The paging system uses a page-out policy to move pages to/from backing store via the paging mechanism.
VM Creation
Two reasons to create a new address space:
exec- Existing process gets a new, empty virtual-address space. Program loader populates with VM regions.
fork- New process gets a complete copy of parent's address space. Kernel copies parent's VMA descriptors, creates new page tables for the child, then copies parent's PTEs incrementing reference counts — parent and child initially share physical pages (COW).
The kernel reserves an architecture-dependent static area containing PTEs for every physical page (eases logical↔physical translation in kernel) plus an unreserved region.
Running a Program (ELF)
The kernel has a function table for program loading supporting multiple binary formats — most commonly ELF.
- ELF programs have a header plus several page-aligned sections.
- Pages initially mapped into virtual memory, then faulted in.
- Non-statically-linked programs have unresolved symbols → dynamic linker stubs map the link library into memory.
- Shared libraries are typically compiled to position-independent code (PIC) so they can be loaded anywhere.
File Systems & the VFS
To the user, Linux presents a hierarchical directory tree with UNIX semantics. Devices are special files; proc computes its contents on demand using the inode number to identify operations.
Virtual File System (VFS)
An abstraction layer hiding filesystem details. Four core objects:
- inode
- Represents an individual file.
- file
- An open file.
- superblock
- An entire file system.
- dentry
- Individual directory entry.
Manipulated via a set of operations. Example file operations:
int (*open) (struct inode *, struct file *);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
int (*mmap) (struct file *, struct vm_area_struct *);
inode-Based Implementation
UNIX filesystems use inodes (index nodes) as FCBs. An inode uses a combined scheme: it contains pointers to blocks, pointers to pointers to blocks, and so on — direct blocks, single indirect, double indirect, triple indirect.
Alternative: linked schemes — an index block points to blocks and ends with either null or a pointer to the next index block.
Directories and Links
A directory is just a file (itself pointed to by an inode) mapping filenames → inodes.
- Hard link — an instance of a file in a directory. Reference counted; file removed at zero. Directories can't have multiple hard links (cycle avoidance).
- Soft/symbolic link — a normal file whose contents is a filename, interpreted by the filesystem.
In-Memory Tables
Three-level structure:
- Per-process open-file table indexed by file descriptor.
- System-wide open-file table — multiple processes may operate on the same file (even while it's being deleted).
- In-memory inode table.
Access Control
Every object uses the same mechanism: unique numeric identifiers.
- UID — single user.
- GID — rights shared by a group of users.
Processes have one UID but can have multiple GIDs. Access check order:
- Process UID matches object UID → user/owner rights.
- Else process GID matches object GID → group rights.
- Else → world rights.
- Root UID gets automatic rights to everything.
Each inode holds a protection mask: three bits for each of owner/group/world:
- Files: read / write / execute.
- Directories: read entry / write entry / traverse.
setuid and setgid
Normally processes inherit permissions of the invoking user. setuid/setgid bits allow a user to "become" someone else for the duration of a program.
A sit-exam application owned by the examiner with permissions 0711 plus setuid, and a test-scores file also owned by the examiner with 0600. Students can run the program (which temporarily gains examiner privileges to read/write scores), but cannot access the scores file directly.
Rights can also be passed by forwarding fds down a local network socket — a print server might receive a descriptor for a file to print without needing rights to read the user's other files.
Linux I/O
Device-oriented file-system access uses two caches:
- Page cache — caches data, unified with virtual memory.
- Buffer cache — caches metadata separately, indexed by physical disk block.
Three Device Classes
- Block devices
- Random access to independent fixed-size blocks. Block buffer cache is a pool of buffers for active I/O and completed I/O cache. The request manager handles r/w using Completely Fair Queueing (CFQ).
- Character devices
- No random access; the driver passes requests on directly. Main exception: terminal devices, where a line discipline interprets device data. E.g., tty discipline glues stdin/stdout onto terminal data/output streams.
- Network devices
- Complex structure: socket interface, protocol drivers, device drivers, plus firewall management / filtering / marking.
Buffer Cache
Maintains copies of some disk parts in memory. Reads:
- Locate relevant blocks from inode.
- Check buffer cache.
- If miss: read disk → buffer cache.
- Return data from buffer cache.
Writes are the same except the final step updates the cached version. Typically prevents the majority (~85%) of implied disk transfers — but risks losing data in a crash before flush.
Periodically (every 30 s) dirty buffers are flushed to disk.
Start of Day, The Shell, Standard I/O
UNIX Start of Day
The kernel (/vmunix) is loaded from disk and executed, mounts the root filesystem. The bootloader must be able to read from disk. The first process (PID 1) — traditionally /etc/init — is hand-crafted.
init reads /etc/inittab. For each entry:
- Opens the terminal special file (e.g.
/dev/tty0), duplicates the fd twice, forks an/etc/ttyprocess. - The
ttyprocess initialises the terminal, printslogin:, waits for input. - On receiving input:
execve /bin/login.
/bin/login then:
- Prints
password:, waits for input. - Hashes the input, checks against
/etc/passwd. - If match: set UID, GID,
execvethe user's shell.
When the shell exits, the parent init resurrects the /etc/tty process and the cycle restarts.
Shell Operation
Just another process — doesn't need to understand commands, just files. CWD avoids the need for fully-qualified pathnames. Command-line parsing can be complex:
- Wildcard expansion (globbing).
- Tilde processing (
~). - Trailing
&puts the forked process into the background.
Standard I/O
Every process has three file descriptors on creation:
- stdin (0)
- Source of input.
- stdout (1)
- Output destination.
- stderr (2)
- Diagnostics destination.
Inherited from parent but redirectable:
# Redirect stdout to file
ls > listing.txt
# Redirect both stdout and stderr
ls >& listing.txt
# Redirect stdin from file
sh < commands.sh
Pipes chain processes:
# Without pipe — temp file
ls > temp.txt; wc < temp.txt > results
# With pipe — cleaner, no temp file
ls | wc > results
Unix command lines can become very complex with multiple filters chained. Redirection introduces some buffering subtleties (line-buffered vs block-buffered output).