Rust FundamentalsChapter: Basic System Concepts

How a Program Runs: CPU & Instructions

Ishtmeet SinghMarch 3, 202549 min read
#cpu#instructions#computer-architecture#systems-programming#mental-model

Every program you've ever written, and every program you ever will write, has a secret life. Whether you write elegant Python scripts, build sprawling enterprise applications in Java, or design interactive web pages with JavaScript, the code you type is just the beginning of a fascinating journey. Your carefully named variables, your clever loops, and your complex data structures all resolve into something far simpler and yet profoundly powerful. They become electrical signals, patterns of high and low voltage flashing through a labyrinth of silicon, a process happening billions of times every second.

This journey ends at the heart of your computer: a small, intricate chip called the Central Processing Unit, or CPU. This is the universal machine, the final destination for every if statement, every function call, and every single character of text you've ever processed. The code may start in different languages, with different syntaxes and philosophies, but it all ends up speaking the same fundamental tongue-the language of the processor.

This raises a profound question: How does a lifeless piece of silicon, a mere sliver of purified sand, "understand" and execute our logical instructions? How does it transform a command like print("Hello, World!") into pixels on a screen? It feels like magic, but it isn't. It's one of the most brilliant and foundational concepts in computer science.

Our goal in this chapter is to uncover this magic. We will peel back the layers of abstraction that you, as a programmer, work with every day. We won't be writing a single line of Python, Java, or Rust. Instead, we're going to build a mental model of the machine itself. By the time we're done, you'll have a clear picture of how your code-any code-is ultimately a set of simple commands for a very fast, very obedient, but fundamentally simple-minded servant: the CPU. Understanding this foundation will make every other concept, from memory management to compilation, fall into place. Let's begin.


What Is a CPU? The Brain?

At its core, a Central Processing Unit (CPU) is the component that executes all computational instructions in the computer. But calling it the "brain" is misleading. A human brain can reason, feel, and understand context. A CPU can do none of these things. It has no intuition, no creativity, and no awareness of the "meaning" behind the data it processes.

What the CPU actually does is much simpler: it's an instruction execution unit that processes commands one after another with perfect precision and incredible speed.

The CPU operates by reading from its instruction list - a sequence of extremely simple, explicit commands stored in memory. Its sole function is to fetch the first instruction, execute exactly what it specifies, then advance to the next instruction, repeating this process billions of times per second until the power is turned off.

So what fundamental operations can the CPU actually perform? First up is simple math - your basic add, subtract, multiply, and divide. That's straightforward enough. Then there's moving data around, which is really just transferring values between different storage locations. You know, taking something from main memory and putting it into the CPU's internal registers, or vice versa.

The CPU can also make comparisons between two numbers - checking if one is larger, smaller, or equal to the other. But here's where it gets interesting: based on those comparisons, the CPU can actually change which instruction it executes next. Instead of just plowing through instructions in order, it can jump to a completely different part of its instruction list. This jumping ability? That's what gives us the power to write if statements and loops.

And that's pretty much it. The CPU doesn't know it's calculating a user's bank balance, rendering a video game, or spell-checking a document. It just sees instructions like "take the number from location A, add it to the number from location B, and put the result in location C."

The magic of modern computing comes from two key facts:

  1. It is extremely fast. The CPU can perform these simple operations billions of times every single second. A modern 4-5 Gigahertz (GHz) processor can perform 4-5 billion of these fetch-and-execute cycles per second.
  1. The scale? Well, it's huge. Complex software isn't born from complex instructions. It emerges from combining millions or billions of these ridiculously simple instructions. Your beautiful graphical user interface? It's just a mountain of tiny instructions telling the CPU to move and color individual pixels. Your favorite video game? Billions of simple math operations calculating physics, lighting, and character positions.

The programmer's job-and the job of the tools like compilers that we'll discuss later-is to break down a complex human goal ("show me my friend's photos") into a very, very long list of simple instructions that the CPU can execute.


The Simplest Possible Computer

To truly understand how complexity arises from simplicity, let's build a hypothetical computer from the ground up. Forget about your laptop or smartphone for a moment. Imagine a machine far more primitive.

Let's call our machine the GEN-Z-1000.

At its heart, the GEN-Z-1000 has a CPU that understands only one instruction: ADD a, b. This takes two numbers, a and b, adds them together, and displays the result.

What can we do with this machine? We can add 5 and 3. We can add 10 and 20. It's a basic calculator, and not a very useful one. It can't even save its result. It just shows it, and then it's gone.

Adding Memory

Let's make an upgrade. We'll give our CPU two new instructions for interacting with memory. Memory is organized as a series of numbered storage locations, where each location can hold a single number.

  1. STORE address, value - Takes a value and writes it to the memory location at a specific address.

  2. LOAD register, address - Reads the value from the memory location at address and places it into a temporary holding spot inside the CPU itself.

These temporary holding spots are crucial. They serve as the CPU's internal working storage, and they have a special name: registers. For now, let's say our GEN-Z-1000 has a few registers named R1, R2, and R3.

Now, we can update our ADD instruction to use these registers:

ADD R3, R1, R2 - Takes the number in register R1, adds it to the number in register R2, and puts the result into register R3.

With this upgrade, we can write a simple "program," a sequence of instructions:

# Our first program: Calculate 5 + 10
1: LOAD R1, [address_200]   # Load the number from memory address 200 into R1
2: LOAD R2, [address_201]   # Load the number from memory address 201 into R2
3: ADD R3, R1, R2           # Add the values in R1 and R2, store result in R3
4: STORE [address_202], R3  # Store the result from R3 into memory address 202

Let's assume memory location 200 contains the number 5 and location 201 contains the number 10. After the CPU executes these four instructions, location 202 will hold the number 15. We've performed a calculation and saved the result! This is a monumental leap. We now have a machine that can not only compute but also remember.

Adding Logic

Our machine is still just a calculator. It executes instructions sequentially, one after the other, without fail. It has no way to make decisions. The most powerful tool a programmer has is the if statement-the ability to choose a path based on a condition. How can we give this power to our GEN-Z-1000?

We need two more instructions:

  1. COMPARE R1, R2 - Compares the values in two registers. It doesn't produce a result in another register. Instead, it sets a special, invisible flag inside the CPU. Let's call this the Equality Flag. If the numbers are equal, the flag is set to "true." If they aren't, it's set to "false."

  2. JUMP_IF_EQUAL address - This is the game-changer. It tells the CPU: "Check the Equality Flag (the special, invisible flag inside the CPU). If it's 'true', then your next instruction is not the one immediately below. Instead, jump to the instruction located at the specified address." If the flag is "false," it does nothing, and the CPU just proceeds to the next instruction as usual.

Now, we can write code that makes decisions. Let's write a very basic program (a one-liner in our favorite language) that checks if a number is equal to 10. If it is, put a 1 in a different memory location, if not put 0.

# Program: Check if the value at address 200 is 10.
# If it is, put a 1 at address 300. Otherwise, put a 0.

# Note each line starts with a number, that's the instruction's address

# Let's assume address 200 contains the number we want to check.
# And let's say address 201 contains the number 10.

1: LOAD R1, [address_200]   # Load our number into R1
2: LOAD R2, [address_201]   # Load the constant 10 into R2
3: COMPARE R1, R2           # Compare them. Sets the Equality Flag.

# The crucial decision point
4: JUMP_IF_EQUAL 100        # If they are equal, jump to instruction at address 100

# This code only runs if the numbers were NOT equal:
5: LOAD R3, [value_zero]    # R3 is now 0
6: STORE [address_300], R3  # Store 0 in the result location
7: JUMP 102                 # Unconditionally jump past the "equal" case

# This code only runs if we jumped from line 4:
100: LOAD R3, [value_one]   # R3 is now 1
101: STORE [address_300], R3 # Store 1 in the result location

102: HALT                   # End of program

Look at what we've just done! We've created an if/else statement. We provided two separate paths of execution, and the CPU chose which path to take based on the data it was processing. We also added an unconditional JUMP to make sure only one of the two paths was executed.

With LOAD, STORE, ADD, COMPARE, and JUMP, our GEN-Z-1000 is now surprisingly powerful. We can create loops by jumping backward in the instruction list. We can create functions by jumping to a set of instructions and then jumping back. In fact, a machine with these basic capabilities-reading/writing memory, performing arithmetic, and conditional jumping-is considered Turing complete. This is a formal way of saying that it can theoretically compute anything that any other computer can compute, from calculating pi to rendering a Hollywood movie. It might take a ridiculously long time and an enormous number of instructions, but it is possible.

Every complex program you've ever used is built upon this simple foundation.


Inside the CPU: A Closer Look at Registers

We've been using "registers" in our examples, describing them as the CPU's internal working storage. Let's formalize this concept. Registers are the most important part of the CPU for a programmer to understand.

Registers are small, extremely fast storage locations built directly into the silicon of the CPU chip itself. They are not part of the main memory (RAM). Here's what actually happens: the main memory is vast and holds all the data your program needs. But accessing main memory takes significant time - often hundreds of clock cycles. The registers, by contrast, sit right inside the CPU core. They have very limited space (typically just 16-32 of them), but any data in them is accessible in a single clock cycle.

Registers are where the CPU does its actual work.

When the CPU needs to perform an operation, like adding two numbers, it must first load those numbers from the slower main memory into registers. Once the data is in registers, the CPU can perform operations on it at maximum speed - often completing arithmetic operations in just one clock cycle.

Why So Few? The Speed vs. Space Trade-off

A typical modern CPU might have only 16 or 32 general-purpose registers, while having billions of bytes of main memory. Why so few?

It's a physical and electrical trade-off. Making registers is expensive in terms of silicon real estate. More importantly, the more registers you have, the more complex the wiring between them and the CPU's arithmetic unit becomes. This complexity introduces tiny electrical delays. To keep the CPU operating at billions of cycles per second, the distance data has to travel must be infinitesimally small. Keeping the register count low ensures they remain in this hyper-fast inner circle of the CPU.

Types of Registers

While we've mostly discussed registers for holding data, they actually come in a few different flavors, each with its own job.

First, you've got your General-Purpose Registers (GPRs) - these are the workhorses, like the R1, R2, and R3 we've been using in our examples. They temporarily store data that the program is actively manipulating. Here's something interesting: compilers work incredibly hard to keep your most frequently used variables in these registers. Why? Because every trip to main memory is painfully slow compared to register access.

Then there are Special-Purpose Registers, which aren't for storing your data at all. Instead, they manage the actual process of computation itself. The star of the show here is the Instruction Pointer (IP), sometimes called the Program Counter (PC).

The Instruction Pointer is a special register that holds the memory address of the next instruction to be executed. When the CPU finishes an instruction, it looks at the IP to know where to go next. Normally, it just increments the IP to point to the next instruction in sequence. But when it executes a JUMP instruction, the CPU's action is to change the value in the IP register, causing the flow of execution to "jump" to a new location in the program.

Other special registers exist, like the Stack Pointer and Frame Pointer, which are crucial for managing function calls. We'll touch on these in the next subchapter when we discuss memory in more detail. For now, just know that the CPU uses these special registers to keep its place and stay organized.


The Instruction Set: A CPU's Vocabulary

Every family of CPUs has its own unique "language" of instructions it understands. This language is called its Instruction Set Architecture (ISA). An instruction written for an Intel x86-64 CPU is just gibberish to an ARM CPU found in a smartphone, and vice-versa.

While the specifics differ, the types of instructions they offer are remarkably similar and fall into a few key categories. Let's formalize the hypothetical instructions we've been using.

1. Data Movement Instructions

These instructions are the logistics of computation. They don't change data; they just move it around between registers and main memory.

First, there's LOAD register_destination, [memory_address]. What this does is go to the specified memory address, grab whatever value is stored there, and place it into the destination register. So LOAD R1, [2048] copies the value from memory location 2048 into register R1. Pretty straightforward.

Then you have STORE [memory_address], register_source, which does the opposite. It takes the value currently sitting in a register and copies it into a memory location. The register keeps its value - we're just making a copy. When you write STORE [4096], R2, you're copying whatever's in R2 into memory location 4096.

There's also MOVE register_destination, register_source for when you want to copy values directly between registers. This is much faster than going through memory - no need for that slow memory bus. MOVE R1, R5 just copies R5's value into R1. Quick and simple.

2. Arithmetic and Logic Instructions

These are the instructions that actually perform calculations and manipulate data. They're often called ALU instructions because they are handled by the Arithmetic Logic Unit within the CPU.

The basic arithmetic operations follow a consistent pattern. Take ADD R_dest, R_src1, R_src2 - it adds the values in R_src1 and R_src2, then stores the result in R_dest. So ADD R3, R1, R2 calculates R1 + R2 and puts that sum into R3.

SUB R_dest, R_src1, R_src2 works the same way but with subtraction. It takes R_src1, subtracts R_src2 from it, and stores the result in R_dest. You can even overwrite one of your source registers - SUB R1, R1, R2 calculates R1 - R2 and then overwrites R1 with the result.

Multiplication and division - MUL and DIV - follow this exact same three-register pattern. The consistency here is intentional; it makes the CPU's decode logic simpler.

These instructions often include logical operations too, like AND, OR, XOR, and NOT, which operate on the individual bits of the data. These are fundamental for many low-level programming tasks.

3. Control Flow Instructions

These instructions are the most powerful. They give our programs the ability to make decisions, create loops, and call functions. They directly manipulate the Instruction Pointer.

Let's start with COMPARE R1, R2. This compares the values in two registers, but here's the interesting part - it doesn't store the result anywhere you can see. Instead, it sets internal CPU flags like the Equality Flag, Greater-Than Flag, and so on. These flags are like hidden notes the CPU keeps for itself.

Then there's JMP address - the unconditional jump. This immediately changes the Instruction Pointer to point to a new address. Write JMP 500 and boom, the next instruction executed will be whatever's at memory address 500. No questions asked.

But the real power comes from conditional jumps like JE address (Jump if Equal). This checks that Equality Flag that was set by a previous COMPARE. If the flag says "true," it changes the Instruction Pointer to jump to the new address. Otherwise? It just continues to the next instruction like nothing happened. So JE 1000 means: if the last comparison found equal values, jump to instruction 1000.

You've also got variations like JNE (Jump if Not Equal), JG (Jump if Greater), and JLT (Jump if Less Than). Each one checks different flags that COMPARE set up, giving you all the building blocks for complex control flow.

This is the key insight: Every if statement, every for loop, every while loop, and every function call in your high-level code is ultimately implemented using a combination of COMPARE and JUMP instructions. A loop is just a jump that goes backward to a previous instruction. An if statement is a jump that skips over a block of code. It's that simple, and that profound.

How CPUs Handle Complex Logic

You might wonder: "But wait, I write complex conditions all the time. What about ternary operators? What about multiple conditions with && and ||? How does the CPU know what type of data it's comparing?" Let's understand these patterns.

How Does the CPU Know What Data Type It's Working With?

Here's a profound truth: The CPU doesn't know, and it doesn't care.

When you write if (age > 18) in your code, you're thinking about age as a meaningful concept. The CPU sees this after compilation as something like:

LOAD R1, [0x4000]  # Load 4 bytes from memory address 0x4000
LOAD R2, [0x4004]  # Load the constant 18
CMP R1, R2         # Compare these bit patterns
JG adult_section   # Jump if first bit pattern represents a bigger number

The CPU has no idea that memory address 0x4000 contains someone's age. Actually, let me show you just how clueless the CPU really is about your data. Those same bits sitting in memory could represent an integer like age = 25. Or they could be a character - 'Z' is just ASCII 90 after all. Maybe they're part of a floating-point number, or a pointer to another memory location. They could be four separate byte values, or even the color values of a pixel.

So how does this interpretation actually work? Well, it's entirely determined by three things. First, the compiler generates the right instructions based on your variable types - it knows age is an integer, so it uses integer instructions. Second, the instruction itself matters - CMP treats those bits as integers, while FCMP would treat the exact same bits as floating-point numbers. And third, your program's logic is what ultimately decides what those bits mean. The CPU just follows orders.

This is why type safety in languages is so important. If you accidentally treat a pointer as an integer or a float as an integer, the CPU will happily comply, leading to nonsensical results or crashes.

The Ternary Operator

The ternary operator (condition ? true_value : false_value) feels special in high-level languages, but it's just syntactic sugar for branches:

// High-level code
let discount = age > 65 ? 0.2 : 0.0;

Becomes something like:

LOAD R1, [age_address]      # Load age
LOAD R2, #65                # Load constant 65
CMP R1, R2                  # Compare them
JG senior_discount          # Jump if age > 65

# Young person path
LOAD R3, #0.0               # Load 0.0 (no discount)
JMP store_result            # Skip the senior path

senior_discount:
LOAD R3, #0.2               # Load 0.2 (20% discount)

store_result:
STORE [discount_address], R3 # Store the result

The ternary operator doesn't exist at the CPU level-it's just a more compact way to write an if-else that returns a value. The CPU still has to evaluate the condition and jump to one of two paths.

Multiple Conditions

When you write complex conditions with multiple parts, things get interesting:

if (user != null && user.age > 18 && user.hasPermission) {
  // Allow access
}

The CPU can't evaluate all three conditions simultaneously. It must check them one by one, and modern languages use short-circuit evaluation to optimize this:

# Check user != null
LOAD R1, [user_address]
LOAD R2, #0                 # null is typically 0
CMP R1, R2
JE skip_all                 # If user IS null, skip everything!

# Check user.age > 18 (only if user wasn't null)
LOAD R3, [user_address + age_offset]  # Load user.age
LOAD R4, #18
CMP R3, R4
JLE skip_all                # If age <= 18, skip everything

# Check user.hasPermission (only if previous conditions passed)
LOAD R5, [user_address + permission_offset]
CMP R5, #0                  # Assuming 0 = false
JE skip_all                 # If no permission, skip

# All conditions passed, execute the body
[... code for allowed access ...]

skip_all:
# Continue with rest of program

Notice how each failed condition immediately jumps to skip_all. This is short-circuit evaluation in action-if user is null, we never even attempt to check user.age, which would cause a crash.

For OR operations (||), it's the opposite:

if (isAdmin || isOwner || hasSpecialPermission) {
  // Allow access
}

Becomes:

# Check isAdmin
LOAD R1, [isAdmin_address]
CMP R1, #1
JE allow_access            # If admin, immediately allow!

# Check isOwner (only if not admin)
LOAD R2, [isOwner_address]
CMP R2, #1
JE allow_access            # If owner, allow!

# Check hasSpecialPermission (only if not admin or owner)
LOAD R3, [hasSpecialPermission_address]
CMP R3, #1
JE allow_access            # If special permission, allow!

# None of the conditions were true
JMP skip_access

allow_access:
[... code for allowed access ...]

skip_access:
# Continue with program

With OR, we jump to the success path as soon as any condition is true.

Complex Conditions: Mixing && and ||

When you mix AND and OR operators, the CPU must respect precedence and grouping:

if ((isWeekend || isHoliday) && hasTicket && venueOpen) {
  // Can attend event
}

The compiler typically evaluates left to right, respecting parentheses:

# First evaluate (isWeekend || isHoliday) into a temporary result
LOAD R1, [isWeekend_address]
CMP R1, #1
JE day_ok                  # Weekend is good enough

LOAD R2, [isHoliday_address]
CMP R2, #1
JNE skip_event            # Neither weekend nor holiday, skip everything

day_ok:
# Now check hasTicket
LOAD R3, [hasTicket_address]
CMP R3, #0
JE skip_event             # No ticket, skip

# Finally check venueOpen
LOAD R4, [venueOpen_address]
CMP R4, #0
JE skip_event             # Venue closed, skip

# All conditions satisfied
[... attend event code ...]

skip_event:
# Continue

Each high-level logical operator becomes multiple compare-and-jump sequences. The compiler optimizes these patterns, sometimes reordering conditions (if they have no side effects) to check the most likely-to-fail conditions first, reducing unnecessary work.

This is why the order of conditions can matter for performance:

// Slow - always loads expensive data
if (loadExpensiveData() && simpleFlag) {
}

// Fast - might skip expensive load
if (simpleFlag && loadExpensiveData()) {
}

Understanding this translation from high-level logic to CPU instructions helps explain why certain coding patterns are more efficient than others and why branch prediction is so crucial for modern CPU performance.

The Fetch-Decode-Execute Cycle

We've talked about the CPU executing a list of instructions. But how does this happen mechanically? The process is a continuous, never-ending loop called the Fetch-Decode-Execute Cycle. It's the fundamental rhythm of the computer, and it has been ticking away since the moment you turned your device on.

This cycle is managed by a component within the CPU called the Control Unit. It coordinates all the different parts of the CPU to work in precise synchronization.

Here's how it works, step by step:

First comes the FETCH phase. The Control Unit needs to know where the next instruction is, so it checks the Instruction Pointer (IP) register. The IP holds the memory address of the next instruction to execute. The Control Unit takes this address, sends it to the memory system, and says "give me whatever instruction is stored here." Back comes the instruction - just a series of binary digits like 10110010 01100011 - which gets pulled into the CPU.

Next is DECODE. The fetched instruction, still in its raw binary form, gets placed into a special Instruction Register. At this point, we have the data but don't know what it means yet. The Control Unit's decoder circuitry kicks in and analyzes this binary pattern. Is it a LOAD? An ADD? A JUMP? The decoder figures this out and also identifies the operands - which registers or memory addresses are involved. Take ADD R3, R1, R2 - the decoder recognizes both the "add" operation and that it needs to work with registers R3, R1, and R2.

Then we EXECUTE. The Control Unit now knows what needs to happen, so it sends signals to the appropriate part of the CPU. If it's an ADD instruction, the values from the source registers (R1, R2) get routed to the Arithmetic Logic Unit (ALU), which does the actual addition and sends the result to the destination register (R3). For a LOAD instruction, the Control Unit would coordinate with the memory system to fetch the required data. And for a JUMP? The Control Unit directly modifies the Instruction Pointer register, changing where we'll fetch the next instruction from.

Finally, there's the UPDATE phase. The instruction is done, and the cycle needs to continue. By default, the Control Unit increments the Instruction Pointer to point to the next instruction in memory - unless we just executed a JUMP, in which case the IP already got updated to the jump target. Either way, the CPU is ready to fetch the next instruction, and the whole cycle starts over.

Fetch, Decode, Execute. Fetch, Decode, Execute. Billions of times per second. This is computation.

A Walkthrough Example

Let's trace our simple 5 + 10 program through the Fetch-Decode-Execute cycle.

Initial State:

  • Memory[1000] contains the number 5.

  • Memory[1001] contains the number 10.

  • The Instruction Pointer (IP) starts at address 200.

Program Instructions in Memory:

  • Address 200: LOAD R1, [1000]

  • Address 201: LOAD R2, [1001]

  • Address 202: ADD R3, R1, R2

  • Address 203: STORE [1002], R3


Cycle 1

  • FETCH: The CPU looks at the IP (200). It fetches the instruction LOAD R1, [1000].

  • DECODE: The decoder identifies this as a LOAD operation, with destination R1 and source address 1000.

  • EXECUTE: The CPU requests the data from memory address 1000. The value 5 is returned and placed into register R1.

  • UPDATE: The IP is incremented to 201.

  • State: IP=201, R1=5, R2=uninitialized, R3=uninitialized


Cycle 2

  • FETCH: The CPU looks at the IP (201). It fetches LOAD R2, [1001].

  • DECODE: This is a LOAD into R2 from address 1001.

  • EXECUTE: The value 10 is fetched from memory and placed into register R2.

  • UPDATE: The IP is incremented to 202.

  • State: IP=202, R1=5, R2=10, R3=uninitialized


Cycle 3

  • FETCH: The CPU looks at the IP (202). It fetches ADD R3, R1, R2.

  • DECODE: This is an ADD operation using R1 and R2 as sources and R3 as the destination.

  • EXECUTE: The values from R1 (5) and R2 (10) are sent to the ALU. The ALU computes the sum, 15. This result is placed into register R3.

  • UPDATE: The IP is incremented to 203.

  • State: IP=203, R1=5, R2=10, R3=15


Cycle 4

  • FETCH: The CPU looks at the IP (203). It fetches STORE [1002], R3.

  • DECODE: This is a STORE operation, with source R3 and destination address 1002.

  • EXECUTE: The value from R3 (15) is sent to the memory system with the command to write it to address 1002.

  • UPDATE: The IP is incremented to 204.

  • State: IP=204, R1=5, R2=10, R3=15. Memory[1002] now holds 15.


The program is complete. The cycle continues, fetching whatever instruction is at address 204, but our intended work is done. This detailed, mechanical process is all that is happening inside the CPU, whether it's running a simple addition or a complex video game.


From High-Level Code to Machine Instructions

At this point, you might be thinking, "This is interesting, but I write result = x + y in Python, not a sequence of LOAD and ADD instructions." That is the entire point of programming languages! They provide a human-readable abstraction over the tedious, low-level reality of the machine.

The process of converting your high-level code into the CPU's native instructions is called compilation (or, in the case of languages like Python and JavaScript, interpretation, which is a form of just-in-time compilation). We will dedicate a whole subchapter to this process later, but for now, let's establish the conceptual link.

Consider this single, simple line of high-level code:

var result = x + y;

Let's assume x, y, and result are variables stored in memory. For the CPU to execute this line, it must be translated into a sequence of machine instructions, just like our example.

High-Level ConceptMachine Instruction Sequence
var result = x + y;LOAD R1, [address_of_x]
LOAD R2, [address_of_y]
ADD R3, R1, R2
STORE [address_of_result], R3

This one-to-many relationship is fundamental. One line of your code can expand into two, four, ten, or even hundreds of machine instructions. A seemingly simple operation like print("Hello") is incredibly complex at the CPU level.

Think about what actually has to happen here. First, you need to loop through each character - 'H', 'e', 'l', 'l', 'o'. For each one, the CPU has to look up the font data, which tells it how to draw the actual shape of that letter. Then it needs to calculate the correct position on the screen where this character should appear. Finally, it writes the pixel color data into a special area of memory called the video buffer, which the graphics hardware reads to display on your screen.

Each of these steps involves dozens of its own low-level instructions. The tools (compilers, interpreters, operating systems) do the hard work of this translation for you. But understanding that this translation happens is key to becoming a more effective programmer. It explains why some operations are "expensive" (take more time) than others-because they expand into a larger number of machine instructions.


Why Modern CPUs Are So Fast

The Fetch-Decode-Execute model we've described is a simplified, classical view. If modern CPUs operated exactly like this, they would be much slower. Real-world processors use several brilliant engineering tricks to speed things up dramatically.

Clock Speed

The "speed" of a processor, measured in Gigahertz (GHz), refers to its clock speed-but what exactly is this clock, and why is it so crucial?

Deep inside the CPU, there's a tiny crystal oscillator, often made of quartz, that vibrates at an incredibly precise frequency. These vibrations generate electrical pulses-the fundamental timing mechanism of the CPU. Each pulse is called a clock cycle or clock tick. The clock signal propagates through the entire chip, synchronizing every operation by ensuring all components advance their work at exactly the same moment.

When we say a processor runs at 5 GHz (5 gigahertz), we mean this crystal is generating 5 billion pulses every second. To put that in perspective, one clock cycle at 5 GHz lasts just 0.2 nanoseconds. In that impossibly brief moment, light itself travels only about 10 centimeters. Think about that - light, the fastest thing in the universe, barely makes it across your hand in the time it takes for one clock tick. In fact, a clock cycle is actually faster than the time it takes electricity to travel across the chip itself.

But here's the crucial part: not every instruction completes in one clock cycle. Simple operations like adding two numbers already in registers might take 1 cycle. Loading data from memory might take 4-400 cycles depending on where it's stored. Complex operations like division could take 10-40 cycles.

The clock doesn't directly "execute" instructions-it coordinates the timing. Here's what actually happens: with each tick, all components advance to their next operation. Tick, advance. Tick, advance. Like a perfectly synchronized dance where everyone moves to the next position at exactly the same moment.

Without this precise timing, different parts of the CPU would get out of sync. The ALU might try to add numbers before they've been fetched from registers. The decoder might try to decode an instruction that hasn't fully arrived. It would be chaos.

This is why overclocking (forcing the clock to run faster) is both possible and dangerous. You're literally making the crystal vibrate faster, forcing all operations to speed up. But if you push too hard, operations can't complete in time before the next tick, causing errors and crashes. The timing requirements become impossible to meet - circuits need a minimum time to stabilize their electrical signals, and pushing beyond that causes computation failures.

Pipelining

Remember our Fetch-Decode-Execute cycle? In a naive CPU design, we'd complete all stages for instruction #1, then start instruction #2, then #3, and so on. This is painfully inefficient-it means most of the CPU's components sit idle while waiting for other parts to finish. Only one stage is active at any given moment, wasting the computational potential of all the other stages.

Pipelining revolutionized CPU design by applying assembly-line thinking to instruction processing. Instead of waiting for an instruction to complete its entire journey, we overlap the stages of multiple instructions.

Let's break down what's really happening in a pipelined CPU. Imagine we have five stages (modern CPUs actually have varied pipeline depths, but let's keep it simple for now).

First is IF (Instruction Fetch), where we read the instruction from memory. Then comes ID (Instruction Decode), where we figure out what that instruction actually means. Next is EX (Execute), where we perform the actual operation - the adding, subtracting, or whatever the instruction calls for. The MEM (Memory Access) stage handles any loading or storing of data that needs to happen. Finally, WB (Write Back) saves any results back to the registers.

Without pipelining, executing 5 instructions would take 25 clock cycles (5 stages × 5 instructions). But watch what happens with pipelining:

ClockIFIDEXMEMWBNote
1Ins1----
2Ins2Ins1---
3Ins3Ins2Ins1--
4Ins4Ins3Ins2Ins1-
5Ins5Ins4Ins3Ins2Ins1← Pipeline full!
6Ins6Ins5Ins4Ins3Ins2← Completing 1 per cycle
7Ins7Ins6Ins5Ins4Ins3
8Ins8Ins7Ins6Ins5Ins4
9-Ins8Ins7Ins6Ins5

After the initial "fill-up" period, we're completing one instruction every clock cycle! Those 5 instructions now take only 9 cycles total instead of 25. With hundreds of instructions, the speedup approaches 5x (the number of pipeline stages).

But here's where it gets tricky-pipeline hazards can ruin this beautiful choreography:

Data Hazards: What if instruction 2 needs the result from instruction 1?

ADD R1, R2, R3    # R1 = R2 + R3
MUL R4, R1, R5    # R4 = R1 × R5 (needs R1 from previous instruction!)

The MUL instruction enters the decode stage while ADD is still executing. R1 doesn't have its new value yet! The CPU must detect this and insert a "bubble" (a wait cycle) or use clever forwarding circuits to pass the result directly from one stage to another.

Control Hazards: Conditional jumps are pipeline killers.

CMP R1, R2        # Compare R1 and R2
JE somewhere      # Jump if equal
ADD R3, R4, R5    # Should we execute this or not?

By the time we know whether to take the jump (after the Execute stage), we've already fetched and started decoding the ADD instruction. If we jump, we've wasted work and must "flush" the pipeline-throwing away partially processed instructions.

Modern CPUs use branch prediction to guess which way a jump will go based on past behavior. If your loop usually runs 1000 times, the CPU will bet on "don't exit the loop" 999 times and be right. But when it guesses wrong, the penalty is severe - modern CPUs might waste 15-20 cycles flushing and refilling the pipeline.

This is why seemingly innocent code can have dramatic performance implications:

# This code is slow due to unpredictable branches
for item in huge_array:
    if random() > 0.5:  # CPU can't predict this!
        process(item)
# This is much faster - predictable pattern
for item in sorted_array:  # Sorted = predictable branches
    if item > threshold:
        process(item)

The sorted array creates a predictable pattern (all small values first, then all large values), letting the branch predictor work its magic. The random check is unpredictable, causing constant pipeline flushes. We'll learn about branch prediction in a later sub-chapter.

Pipelining transforms the CPU from a sequential processor into a parallel instruction factory, but it requires incredible engineering to handle the hazards and dependencies that naturally arise. Every modern performance optimization-from compiler reordering to CPU speculation-exists to keep this pipeline flowing smoothly.

Multiple Cores

For a long time, the primary way to make CPUs faster was to increase the clock speed. But we've hit physical limits due to heat and power consumption. The solution was to put multiple independent CPUs-called cores-onto a single chip.

A dual-core CPU is literally two processors working side-by-side. A quad-core CPU has four. Each core has its own Fetch-Decode-Execute cycle and can run its own stream of instructions. This allows a computer to do truly parallel work, like running your web browser on one core while your music player runs on another.

Caching

We established that accessing main memory (RAM) is slow compared to accessing registers. To bridge this gap, CPUs use a cache. A cache is a small amount of much faster (and more expensive) memory that sits between the CPU and the main RAM.

When the CPU needs a piece of data from memory, it first checks the cache. If the data is there (a "cache hit"), it gets it almost instantaneously. If it's not there (a "cache miss"), the CPU has to make the slow trip to RAM. When it fetches the data from RAM, it also loads it, and some of the surrounding data, into the cache, assuming it might be needed again soon. This is a topic we'll explore in depth in the next subchapter on memory.


The Abstraction Tower

As programmers, we are fortunate to not have to think about electrical signals, transistors, or even machine code. We work at the top of a tall "Abstraction Tower," where each layer uses the one below it to create a more powerful and convenient environment.

+------------------------------------------+
| Your Code (Python, JavaScript, Rust, etc.) |  <-- You are here
+------------------------------------------+

+------------------------------------------+
|      Standard Library / Runtime          |
|  (e.g., Python's list object, Node.js)   |
+------------------------------------------+

+------------------------------------------+
|       System Calls / OS Interface        |
|      (e.g., open file, get network)      |
+------------------------------------------+

+------------------------------------------+
|              Assembly Language           |
|  (Human-readable machine instructions)   |
+------------------------------------------+

+------------------------------------------+
|          Machine Code (Binary)           |
| (The 1s and 0s the CPU actually reads)   |
+------------------------------------------+

+------------------------------------------+
|       Microarchitecture (CPU Logic)      |
|  (Fetch-Decode-Execute, Pipelining)      |
+------------------------------------------+

+------------------------------------------+
|           Electrical Signals             |
|          (Voltages changing)             |
+------------------------------------------+

+------------------------------------------+
|          Transistors Switching           |
|      (The fundamental physical reality)    |
+------------------------------------------+

Each layer provides a service to the one above it while hiding the complexity of the one below it. Your Rust code doesn't need to know how to open a file; it just asks the Operating System (OS). The OS doesn't need to know the specific machine code for your CPU; it relies on drivers and the compiler. This tower of abstraction is what makes modern software development possible. My goal in these early chapters is to give you a clear view all the way down to the bottom.


Real CPUs

The hypothetical Instruction Set Architecture (ISA) we've been using is designed to be simple and clear. In the real world, two ISAs dominate the landscape: x86-64 and ARM.

x86-64 (Intel/AMD)

This is the architecture you'll find in most desktop PCs, laptops, and servers. It's a direct descendant of the Intel 8086 processor from 1978.

The philosophy here is CISC - Complex Instruction Set Computer. The CISC approach is to create specialized, powerful instructions that can accomplish complex tasks in a single step. You might have one instruction that copies an entire block of memory from one location to another. Pretty powerful stuff.

The x86-64 instruction set is vast and irregular, with thousands of instructions. This is the legacy of decades of added features while maintaining backward compatibility - they never threw anything away, just kept adding more.

ARM

This architecture dominates the mobile world-it's in virtually every smartphone and tablet. Apple's M1, M2, M3 and M4 chips have also brought ARM to laptops and desktops with great success.

ARM follows the RISC philosophy - Reduced Instruction Set Computer. This is the opposite of CISC. Instead of complex, do-everything instructions, RISC provides a small, highly optimized set of simple instructions. Complex tasks? You build them by combining these simple pieces.

ARM instructions are more regular and predictable than x86. This simplicity makes it easier to build fast, power-efficient processors - which is why your phone can run all day on a tiny battery. The "load/store" model we've been using throughout this chapter, where arithmetic can only be done on registers? That's classic RISC design.

The key takeaway is this: while the specific instructions and register names are different, the fundamental principles are the same. Both architectures use registers, interact with memory, and perform a Fetch-Decode-Execute cycle. A program compiled for x86-64 will not run on an ARM chip, but the high-level logic of the program can be expressed on either.


The Illusion of Simultaneity

Look at your computer right now. You might have a web browser open, a music app playing, a clock ticking in the corner, and your operating system is managing network connections in the background. It feels like dozens of things are happening at the exact same moment.

This is a carefully crafted illusion.

A single CPU core, as we've established, can only execute one instruction at a time. It is a serial, sequential processor. The illusion of simultaneity is created by the Operating System (OS).

The OS scheduler rapidly switches between programs to create this illusion. It lets one program (say, your web browser) run its instructions for a tiny fraction of a second-a timeslice, perhaps just a few milliseconds. Then, it forcibly interrupts the browser. It carefully saves the browser's entire state (the values in all its registers, including the Instruction Pointer) to memory.

Next, it loads the saved state of another program (your music player). It restores its register values, sets the Instruction Pointer to where the music player left off, and lets it run for a few milliseconds. Then it swaps again, perhaps to the OS's own background tasks, then back to the browser.

This process, called a context switch, happens hundreds of times per second. It happens so fast that to a human observer, everything appears to be running smoothly and simultaneously. But at the silicon level, the CPU core is only ever doing one thing at any given microsecond.

This is why having multiple cores is so significant. With four cores, your computer can run four instruction streams in true physical parallelism, dramatically reducing the need for the rapid context switching that creates the illusion on a single core.


Limitations and Specializations

For all their power, it's important to be honest about what CPUs are and what they are not. They are not intelligent. They are powerful calculators bound by physical laws.

First off, CPUs don't understand data types. To a CPU, a number is just a pattern of bits. It has no idea if that pattern represents a color, a character of text ('A' is just the number 65 in ASCII encoding), or a bank balance. That meaning is imposed entirely by the program.

They're also limited by physics. The speed of electricity and the generation of heat place hard caps on how fast a single core can run. We can't simply make the clock tick faster indefinitely - at some point, the chip would literally melt.

And critically, CPUs rely completely on memory. The CPU can't hold an entire program or its data internally. It's constantly in dialogue with memory, fetching instructions and data, storing results. As we'll see in the next sub-chapter, this memory interaction is actually a major performance bottleneck.

Because general-purpose CPUs can't be the best at everything, modern processors have added specialized instructions to accelerate common, demanding tasks.

SIMD

SIMD (Single Instruction, Multiple Data) represents one of the most elegant solutions to a fundamental problem: how do we process large amounts of data faster without making the CPU clock faster?

The Problem SIMD Solves

Let's say you're editing a photo with 10 million pixels. You want to increase the brightness by 20%. In traditional (scalar) processing, you'd do something like:

for each pixel in image:
    LOAD pixel_value
    ADD brightness_adjustment
    STORE pixel_value

That's 30 million instructions (3 per pixel × 10 million pixels). Even at 5 billion instructions per second, that's 6 milliseconds just for this simple operation. And modern image processing involves dozens of such operations.

The SIMD Solution

SIMD introduces special vector registers that are much wider than normal registers. While a regular register might hold one 32-bit number, a SIMD register can hold multiple values.

SSE registers are 128 bits wide, so they can hold 4 floats or 16 bytes all at once. AVX registers double that to 256 bits - that's 8 floats or 32 bytes. And if you're working with AVX-512, you get massive 512-bit registers that can hold 16 floats or 64 bytes. That's a lot of data in a single register!

Now that same brightness operation looks like:

for each group of 8 pixels:
    LOAD 8 pixels into vector_register_1     # One instruction loads 8 values!
    ADD vector_register_2 (contains 8x brightness value)  # Adds all 8 at once
    STORE 8 pixels from result                # Stores all 8 at once

We've just reduced 24 million instructions (8 pixels × 3 instructions × 1.25 million groups) to 3.75 million instructions-an 8x improvement!

Real-World SIMD Example

Let's make this concrete. Say you have an array of temperatures in Celsius that you want to convert to Fahrenheit:

Traditional scalar approach

// Process one temperature at a time
for (int i = 0; i < 1000; i++) {
    fahrenheit[i] = celsius[i] * 9.0/5.0 + 32;
}

SIMD approach (conceptually)

// Process 4 temperatures simultaneously
for (int i = 0; i < 1000; i += 4) {
    // Load 4 Celsius values at once
    vector_load(celsius[i] to celsius[i+3]) → SIMD_REG1

    // Multiply all 4 by 9/5 simultaneously
    SIMD_MUL SIMD_REG1, [9/5, 9/5, 9/5, 9/5] → SIMD_REG2

    // Add 32 to all 4 simultaneously
    SIMD_ADD SIMD_REG2, [32, 32, 32, 32] → SIMD_REG3

    // Store all 4 Fahrenheit values at once
    vector_store(SIMD_REG3 → fahrenheit[i] to fahrenheit[i+3])
}

The CPU executes the same instruction on multiple data points in parallel-hence "Single Instruction, Multiple Data."

Where SIMD Shines

So where does SIMD really shine? First, you need the same operation applying to many data points - think image filters, audio processing, or physics simulations where you're doing the same calculation over and over. Second, your data needs to be contiguous in memory - arrays, matrices, buffers all work great. Third, there can't be branching within the operation - every element has to get the same treatment. And fourth, the operations themselves need to be simple - add, multiply, compare, min/max. When all these conditions align, SIMD is incredibly powerful.

You'll find SIMD everywhere in modern computing. Graphics operations use it constantly - transforming vertices, pixel shading, applying image filters. Audio processing leverages SIMD for applying effects, mixing channels, and FFT operations. Machine learning? It's all matrix multiplications and activation functions, perfect for SIMD. Cryptography uses it for block cipher operations and hash functions. Scientific computing relies on it for linear algebra and signal processing. And games? Physics simulations and collision detection would be impossibly slow without SIMD.

SIMD's Limitations

But SIMD isn't magic-it has real constraints you need to work around. Data often needs to be aligned to specific boundaries - 16, 32, or 64 bytes. It's all or nothing too - if you need to process 1001 items in groups of 4, that last lonely item needs special handling. Branching absolutely kills performance since different operations for different elements require complex masking or separate passes. And don't forget, you're still limited by memory bandwidth - it doesn't matter how fast your SIMD units are if you can't feed them data quickly enough.

Consider this problematic case:

# SIMD unfriendly - different operation per element
for i in range(1000):
    if data[i] > threshold:
        data[i] = data[i] * 2  # Some elements get doubled
    else:
        data[i] = data[i] + 1  # Others get incremented

SIMD can handle this with "masking" (selectively applying operations), but it's less efficient than uniform operations.

Modern SIMD in Practice

In practice, you rarely write SIMD instructions directly. Modern compilers can automatically vectorize simple loops, detecting patterns where SIMD would help. If you need more control, there are intrinsics - C/C++ functions that map directly to SIMD instructions. Most of the time though, you're using SIMD through libraries - NumPy, OpenCV, PyTorch, game engines, and modern ML frameworks all use it heavily under the hood. And if you really need massive parallelism? GPU computing takes the SIMD concept to its extreme with thousands of parallel units.

The key insight is that SIMD bridges the gap between the "one instruction at a time" CPU model we've discussed and the massively parallel processing that modern applications demand. It's a crucial optimization that makes real-time graphics, video streaming, and many other compute-intensive tasks possible on general-purpose CPUs.

Atomic Operations

When multiple cores or threads try to modify the same piece of memory, chaos can ensue. Imagine two cores trying to increment a counter at the same time. Both might LOAD the value 5, both ADD 1 to get 6, and both STORE 6. The counter should be 7, but it's 6. This is a race condition.

To solve this, CPUs provide atomic operations. These are special instructions that are guaranteed to execute all at once, without interruption. An instruction like COMPARE_AND_SWAP will read a memory location, compare it to an expected value, and write a new value in a single, indivisible step. Other cores are locked out of that memory location for the brief moment the atomic instruction is executing, ensuring data integrity.

Setting the Stage for What's Next

We have built a foundational model of the CPU: a simple but incredibly fast machine that executes a list of instructions by constantly cycling through a Fetch-Decode-Execute loop. We've seen how it uses registers as a scratchpad and how instructions for moving data, performing arithmetic, and controlling program flow are the building blocks of all software.

This understanding is powerful, but it also raises new questions.

  • We've talked a lot about "memory addresses" and "mailboxes." But what is memory? How is it organized? Why is it so much slower than the CPU, and how do we manage that difference? This brings us to the crucial topic of our next subchapter: the memory hierarchy.

  • We've also seen the massive gap between a line of Python and a sequence of machine code. How does that translation actually happen? What is a compiler, and what are the steps it takes to turn human logic into the CPU's native language? We will explore this in detail after we understand memory.

You now possess the essential mental model of the engine that drives every piece of software. Every variable you declare needs a place to live, either in a register or in a memory location. Every function you call requires the CPU to JUMP to a new instruction address and save the old one to come back to. Every loop you write is, at its heart, just a COMPARE instruction followed by a conditional JUMP that goes backward.

The arbitrary rules of programming languages will begin to feel like natural consequences of how the machine actually works. You are no longer just a writer of code; you are beginning to understand the machine itself.

Table of Contents

39 sections • ESC to close