Building a Lexer from Scratch
Building a Lexer from Scratch
Feel free to follow along with the programming language of your choice.
Forget frameworks, forget libraries for a moment. We're going right to the core of how programming languages actually work. We're about to build one of the most fundamental pieces of software imaginable.
Tokenizer Fundamentals
If you've already seen my previous post about Compilers, you may skip a couple of sections and head direct to the implementation challenges. But I suggest you to stay, and read from top to bottom. A recap won't harm.
If not, I suggest you to read that chapter. It will clear up a lot of concepts that we're going to glance past quickly in this chapter.
A compiler is a translator. Its entire job is to take the code you write in one language (say, JavaScript) and translate it into a different language (like Python, or even the raw binary instructions your processor understands).
We're going to build a specific kind of compiler called a transpiler, which translates from one high-level language (JavaScript) to another (Python). But before any compiler can do its fancy translation work, it has to learn how to read.
When you read a sentence, your brain doesn't see it as an unbroken stream of characters. You instantly group letters like T-h-i-n-k
into the word "Think," see the space as a divider, and move on to the next word. You're performing pattern-matching - identifying words, punctuation, and the spaces that separate them.
This process, in the world of compilers, is called lexical analysis, or more commonly, tokenization. The part of the compiler that does this is called a tokenizer (or a lexer). Its job is to take the raw source code - which is just one big long string to the computer - and chop it up into a list of meaningful "words." We call these words tokens.
You can think of these tokens as the very same "words" your favorite LLM uses to count exactly how much to charge you. Yes, even the commas.
For example, if our compiler looks at this line of JavaScript -
let score = 100;
It doesn't see 17 individual characters. A well-built tokenizer sees five distinct, meaningful pieces:
- A keyword with the value
let
. - An identifier (which is just a technical name for a variable or function name) with the value
score
. - A number with the value
100
. - A piece of punctuation with the value
;
.
In this chapter, we (actually, you) are going to build the eyes of our compiler. We're going to write a tokenizer from scratch. We'll start with an empty file and, step by step, write the code that can read a string of JavaScript and produce a clean list of tokens.
The principles we're about to put into practice are the very same ones used inside the V8 engine that powers Chrome and Node.js, the Roslyn compiler for C#, and countless other professional-grade tools.
The entire journey of our compiler will follow a clear path:
- We take the raw JS string and produce a list of tokens, which is called Tokenization. (That's this chapter!)
- We take that list of tokens and build a structure called an Abstract Syntax Tree (AST), which represents the code's grammatical relationships. This phase is called Parsing.
- An optional but incredibly powerful step where we can modify that AST to make optimizations or changes. This phase has a name too - Transformation.
- We "walk" over the final AST and generate an equivalent string of Python code i.e Code Generation.
Compiler Architecture and Project Setup
Before we dive into the code, we need to think like architects. You wouldn't start building a house by nailing random boards together, right? You'd draw up a blueprint.
A clean project structure and a clear plan for our data serve as the foundation that will keep our project from becoming a tangled mess later on.
The Compiler Family Tree
You're going to hear a few terms thrown around, and it's best we clear them up right now so you know exactly where we stand.
Compiler is a program that translates an entire codebase from a high-level language (like C++) into a lower-level language (like machine code) before you run it. It produces a standalone file (like an .exe
on Windows) that you can execute. The translation happens all at once, up front.
Interpreter is a program that reads source code and executes it line by line, on the fly. It doesn't produce a separate compiled file. When you run a Python script, the Python interpreter is reading, translating, and executing your code simultaneously.
Transpiler (Source-to-Source Compiler) is a specific type of compiler that translates from one high-level language to another high-level language. This is exactly what we are building. We are transpiling JavaScript code into Python code. Famous examples you've probably used are Babel (which transpiles modern JS to older JS) and TypeScript (which transpiles TypeScript to JS).
Our project is technically a transpiler, but we'll often use the general term "compiler" because the internal stages - the pipeline we're building - are identical.
The Compilation Pipeline
Every serious compiler on the planet follows a similar high-level architecture. All our data will flow through these stages in the following order -
-
Lexical Analysis or Tokenization. The raw source code string goes in. A flat list of tokens comes out.
let x = 10;
becomes[LET, IDENTIFIER(x), EQUALS, NUMBER(10), SEMICOLON]
. -
Syntactic Analysis or Parsing. The list of tokens goes in. The parser checks if the tokens form a valid "sentence" according to the language's grammar rules and builds a tree-like data structure called an Abstract Syntax Tree (AST) that represents the code.
-
Transformation. The AST goes in, and a modified AST comes out. This is where you might perform optimizations or tweaks to prepare the code for its new language.
-
Code Generation. The final AST goes in. The code generator walks through the tree and builds the final output string of code in the target language.
This modular design separates the problem beautifully. The part of our compiler that understands JavaScript doesn't need to know anything about how to write Python.
Section 1.1
Your Project Structure
Create a top-level folder, maybe js-to-py-compiler
. I'm calling this compyler
Inside that, create a src
directory for our source code and a tests
directory. We are going to rely on it heavily to make sure every piece of our machine works perfectly.
Challenge: Building the Scanner's Foundation
You've got the theory, you've seen the blueprints, and now it's time to actually build something. And I mean really build it - not just copy-paste from a tutorial, but write code that you understand down to its bones.
What You're Actually Building Right Now
We're not going to try to build the entire tokenizer in one shot. Instead, we're going to be surgical about this. We're building just the scanning machinery first - the part that can walk through a string character by character, keeping track of exactly where it is at all times.
Your Task: Create the Scanner Skeleton
Start with the Token
class. This is going to be dead simple for now, and that's exactly what we want. Create a class with these four properties:
type
: For now, just two possibilities:"CHAR"
for any character we find, and"EOF"
for end-of-filevalue
: The actual character we found (ornull
forEOF
)line
: Which line number we found it on (humans count from 1, remember?)col
: Which column position on that line (also starting from 1)
Make this class immutable if your language supports it. Once a token is created, its data should never change. This prevents a whole category of weird bugs where some code accidentally modifies a token after it's been created.
Now for the main event - the Tokenizer
class. You need:
The Constructor Takes one parameter - the source code as a string. Store it in an instance variable because you'll be referring to it constantly.
State Tracking Variables
currentIndex
- Where are we in the string? Starts at 0 because that's how string indices work.line
- What line are we on? Starts at 1 because that's how humans count lines.col
- What column are we at on this line? Also starts at 1.
The Helper Methods
First, peek()
. This method is absolutely critical: it lets you look at the current character without committing to consuming it. You're going to use this constantly to make decisions about what to do next. It should -
- Return the character at
currentIndex
- If we're past the end of the string, return
'\0'
(the null character) or some other sentinel value - Most importantly: DO NOT modify
currentIndex
,line
, orcol
Next, advance()
. This is peek's partner It does three things:
- Gets the character at the current position (just like peek)
- Increments
currentIndex
by 1 - Increments
col
by 1 - Returns the character it just consumed
One more helper - isAtEnd()
. Super simple but you'll use it everywhere:
- Returns true if
currentIndex >= length of input string
- That's it. That's the whole method.
The Main Event: tokenize()
This is where everything comes together. Here's exactly what it needs to do, step by step.
- Create an empty array to collect tokens.
- Start a loop that continues until
isAtEnd()
returns true. - Inside the loop, grab the current character using
peek()
. - Now here's where we handle different characters:
- If it's a space or tab: Just call
advance()
to skip it. Don't create a token. - If it's a newline (
'\n'
) - This is special! Calladvance()
to consume it, then incrementline
by 1 and resetcol
back to 1. Don't create a token. You might need to think deeply on how to achieve this. - For any other character - Create a new Token with type
"CHAR"
, the character as the value, and the current line and col. Add it to your token array. Then calladvance()
to move forward.
- If it's a space or tab: Just call
- After the loop ends, create one final token with type "EOF", value null, and the current line/col position. Add it to the array.
- Return the array of tokens.
Test It Out
Once you've written all this, test it with something simple:
const tokenizer = new Tokenizer("hi \nyo");
const tokens = tokenizer.tokenize();
console.log(tokens);
You should see:
[
{ type: "CHAR", value: "h", line: 1, col: 1 },
{ type: "CHAR", value: "i", line: 1, col: 2 },
{ type: "CHAR", value: "y", line: 2, col: 1 },
{ type: "CHAR", value: "o", line: 2, col: 2 },
{ type: "EOF", value: null, line: 2, col: 3 },
];
Notice how the space disappeared? How the newline caused the line number to increment and the column to reset? That's your position tracking working perfectly.
Common Mistakes (So You Can Avoid Them)
The number one bug? Forgetting to call advance()
somewhere, which causes an infinite loop. Your program just sits there, forever looking at the same character, never moving forward. If your test hangs, that's probably why.
Another classic: forgetting to reset col
to 1 when you hit a newline. Then your column numbers on line 2 and beyond are completely wrong, and your error messages become lies.
Oh, and here's a subtle one: make sure you create the EOF token after the main loop ends, using whatever the current position is at that point. Don't try to calculate where it should be - just use the actual current values of line
and col
.
What You've Actually Accomplished
This might seem basic - we're just walking through a string and keeping track of where we are. But this is the foundation that everything else builds on. That position tracking you just implemented? It's what makes the difference between "Error somewhere in your code"
and "Error on line 42, column 15"
. Those helper methods you wrote? You're going to use them dozens of times in the next sections.
Most importantly, you now have a working, testable piece of code. You can feed it any string and watch it walk through character by character, maintaining perfect position awareness. That's the beating heart of a compiler.
Before You Move On
Make sure your code actually works. Don't just read through this and nod along - write the code, run it, debug it when it inevitably doesn't work the first time (mine never does). Even if you're reading on a mobile - just open a new tab and go to any online ide/compiler and build along.
Try different inputs. What happens with an empty string? A string that's just spaces? Multiple newlines in a row?
Once you've got this working smoothly, you're ready for Section 1.2
, where we'll teach this scanner to recognize its first real pattern: numbers. And not just simple integers - we're going full scientific notation.
But first, get this working. Really working. The rest of the tokenizer depends on this foundation being rock solid. Take your time, test thoroughly, and when you see that EOF
token appearing at exactly the right position, you've just built the eyes of your compiler.
API Design Decisions
-
Why track line, column, and index? The
line
andcolumn
are for humans. They exist solely to produce beautiful, readable error messages. ThecurrentIndex
is for the machine. It's the fastest, most efficient way to manage our position and slice substrings from the source code. Having both gives us the best of both worlds: human-friendly errors and machine-friendly performance. -
When you create a
Token
object, make it immutable. Its properties should be set once in its constructor and never change. This is a simple defensive move that prevents a whole category of bizarre bugs where some later part of the compiler accidentally messes with a token's data. It guarantees data integrity. -
What happens if our tokenizer bumps into a character it doesn't recognize, like
~
? For now, our strategy will be simple and effective: create anILLEGAL
token containing the weird character and just keep going. This is much more user-friendly than crashing on the first error. It allows us to report all the syntax errors in a file in a single pass.
We've got our blueprints. We've set up our workshop. Now, let's teach this thing its first real word.
Section 1.2
We've got our basic machinery set up, the part that walks through a file character by character. Now, we need to teach it to recognize its first real pattern. And we're going to start with something that seems dead simple on the surface but has a surprising amount of depth: numbers.
Sure, you see 42
and think, "Easy." But what about 3.14159
? Or the physicist's favorite, 6.022e23
? Or even just .5
? All of these are valid numbers, but their shapes are completely different. Our tokenizer needs to be smart enough to consume all of these forms correctly, and just as importantly, it needs to know when to stop. It has to correctly identify 1.2
as a number in the expression 1.2.3
and not get confused.
Tackling numbers is the perfect way to learn the single most important mental model for this kind of work: the state machine. Once you internalize this concept, you can write a scanner for practically any pattern imaginable.
Thinking in States
A state machine is just a formal way of describing a process that can be in one of several distinct states. Based on the next piece of input it receives, the process can either stay in its current state or transition to a new one.
When you start reading a number like 123.45e67
, your brain is working like this:
Initial State - You see a '1'
. You're now in the "reading the whole number part" state.
Stay in State - You see a '2'
, then a '3'
. These are both digits, so you stay in the "reading the whole number part" state, accumulating the digits.
Transition - Then you see a '.'
. This is a valid transition. You now move into the "reading the fractional part" state.
Stay in State - You see a '4'
, then a '5'
. Digits are expected here, so you stay in the "reading the fractional part" state.
Transition - Then you see an 'e'
. Another valid transition. You are now in the "reading the exponent part" state.
Stay in State - You see a '6'
, then a '7'
. You stay in this final state, consuming the exponent's digits.
End - You see a semicolon or a space. This character can't possibly be part of the number. The process ends. You've successfully identified the complete number.
That's all a state machine is. We're just going to write code that makes these states and transitions explicit. No magic, just a series of clear, simple questions and actions.
Building the Number Scanner
We're going to add a new private method to our Tokenizer
class. We'll call it scanNumber()
. The job of our main tokenize()
loop will be to look at the current character, and if it's a digit, it will hand over control to this new specialist method.
Step 1: Just Plain Integers
First things first, let's handle the simplest case: whole numbers like 123
.
Inside your new scanNumber()
method, the first thing you must do is record the startIndex
from your Tokenizer
's state. We need to remember where the number started so we can slice it out of the source string later.
Next, you'll write a simple loop. The condition is straightforward: while
the character I'm peek()
ing at is a digit, I'm going to stay in this loop. And what do we do inside the loop? Just one thing: advance()
. That's it. We consume the digit and move our currentIndex
forward.
When that loop finally ends, it's because peek()
is looking at something that isn't a digit (or we're at the end of the file). At this point, we know we've consumed a complete integer. The text of our number is the slice of the original source code from our startIndex
to the current currentIndex
. Now you have everything you need: the type is NUMBER
, the value is the string you just sliced, and you know the starting line and column. You create a new Token
and you're done. Integer support, check.
Step 2: Adding Decimal Support
Now for floating-point numbers like 42.5
or .5
. This represents our first real state transition.
Right after your integer-scanning loop from Step 1 finishes, we'll add a check. peek()
at the current character. Is it a .
? But wait, there's a catch. A single .
at the end of a number (like in 123.
) could be a property accessor in JavaScript. So we need to be more specific: is the current character a .
and is the character after that a digit? You'll need to look ahead two characters for this, so you might need a peekNext()
helper method, or you can just peek at currentIndex + 1
.
If both conditions are true, we've officially transitioned into the "reading the fractional part" state. First, advance()
to consume the .
; then, you start a second loop, identical in logic to the first one, that just consumes all the digits that follow.
When this second loop finishes, the string you slice from your original startIndex
will now correctly contain the entire floating-point number. Beautiful.
What Could Go Wrong?
Let's talk about that 1.2.3
example. With the logic we just designed, our scanNumber()
method will greedily consume 1.2
. It will then stop, because the next character it sees is another .
, which cannot be part of the fractional component it was just parsing. It will emit a NUMBER
token with the value "1.2"
. The main tokenize()
loop will then resume control, see the second .
, and probably treat it as a DOT
operator or maybe an ILLEGAL
token. This is exactly the correct behavior. The tokenizer's only job is to find the longest possible valid token from its current position. It doesn't judge whether the sequence of tokens makes sense. That's the parser's job, and we'll get to that later. Keeping these concerns separate is the secret to clean compiler architecture.
Step 3: Conquering Scientific Notation
This is the final boss for our number scanner: handling scientific notation like 1e10
or 3.14E-5
. This is our third and final state.
After you've handled the integer and the potential fractional parts, peek()
at the current character. Is it an 'e'
or an 'E'
?
If it is, you're entering the exponent state. The sequence is very precise:
-
You see the
'e'
or'E'
.advance()
past it. -
Now
peek()
again, or use thepeekNext()
method. The character that follows thee
could optionally be a sign, either'+'
or'-'
. If you see one, it's part of the number, soadvance()
past it. -
Finally, there must be at least one digit here. An
e
followed by nothing is not a valid number. You'll start one last loop to consume all the digits of the exponent.
Here's the crucial edge case: if you see an 'e'
, but it's not followed by any digits (or a sign and then digits), then that 'e'
isn't part of the number. Your logic should stop, emit the number you've seen so far, and let the 'e'
be treated as the beginning of the next token (which will most likely be an identifier, like the variable name e
).
And that's it. You've built a robust, three-state number parser.
A Quick Word on API Design
-
Should tokens store the numeric value or the string? This is a classic question with a clear answer for our purposes. You might be tempted to convert
"3.14"
into the floating-point number3.14
right here in the tokenizer. I strongly advise against it. Always store the string representation of the value. Why? It's simpler, it completely sidesteps any weird floating-point precision issues down the line, and it perfectly preserves the user's original code. It remembers the difference between1
and1.0
, which can be important. We'll stick with storing the raw stringvalue
. It's just more robust. -
What about
NaN
orInfinity
? Not our problem! At least, not the tokenizer's problem. To our tokenizer,Infinity
is a sequence of letters. It looks exactly like an identifier (a variable name), and that's what it should be tokenized as. It's the job of a later phase, like a semantic analyzer, to know that this particular identifier has a special numeric meaning. The tokenizer's job is to keep it simple and focus only on recognizing the basic shape of things.
A final pro-tip: you might think, "I could do all of this with a single, complex regular expression." You could, but for a high-performance compiler, that's almost never the right call. The hand-written, character-by-character state machine you are building right now is dramatically faster and gives you far more fine-grained control and better error-reporting capabilities than a generic regex engine ever could. You're building it the professional way.
You just conquered what is arguably the most complex single token type in most programming languages. Everything else from here on out will feel like a walk in the park
Section 1.3
We've taught our machine how to read numbers, which was surprisingly complex. Now, let's teach it how to read words. This is where our code really starts to take shape, because at its heart, code is all about naming things.
We're constantly giving names to things. We name our variables, like score
or userName
. We name our functions, like calculateTotal
or fetchUserData
. In compiler-speak, these are all called Identifiers. They are the unique names you invent.
But there's another class of words - the special, sacred ones that are already part of the language itself. Words like let
, if
, for
, and return
. These are the Keywords. They form the very grammar of JavaScript, the structural beams upon which we build our logic. You can't name a variable if
for the same reason you can't just decide the word "and" means "banana" in English - it would cause chaos.
Our tokenizer needs a dead-simple, rock-solid way to tell the difference between a name you just made up and a word that has a special, reserved meaning.
Separation of Concerns
How do we tackle this? The most brilliant solutions in software are often the simplest, and this is a perfect example. A rookie might try to build two different scanners: one that specifically looks for the keyword "if", another that looks for "for", and then a third, general-purpose one for any other word. That way lies madness and a tangled mess of code.
We're not going to do that. We're going to use a much more elegant strategy.
-
First, we'll write a single, simple scanner that recognizes anything that just looks like a word, according to the language's rules. It won't care what the word is, only its shape.
-
Once we've scanned the entire word and have it as a string (say,
"myVar"
or"let"
), we do a final, simple check. -
We'll look up that string in a predefined dictionary of all known keywords.
-
If we find a match in our dictionary, we know we have a
KEYWORD
token (likeKEYWORD_LET
). -
If there's no match, then by definition, it must be a regular, user-defined
IDENTIFIER
token.
This is beautiful because it cleanly separates the problem of recognizing the shape of a word from the problem of understanding its special meaning. One simple scanner, one simple lookup. That's it.
A quick note on the rules for a "word's shape": JavaScript is incredibly permissive and allows a huge range of Unicode characters in identifiers so you can name variables in different languages. For our project, that's a level of complexity we just don't need right now. We'll stick to the classic and most common rule that covers 99.9% of code you'll ever see: an identifier must start with a letter (a-z, A-Z), an underscore (_
), or a dollar sign ($
). After that first character, it can be followed by any number of letters, numbers, underscores, or dollar signs.
Giving our Compiler a Vocabulary
Time to write the code. We'll build a new specialist method, scanIdentifierOrKeyword()
. Back in your main tokenize()
loop, you'll add a check. After you check for digits, you'll add another one: if the current character can start an identifier (i.e., it's a letter, _
, or $
), you'll call this new method.
1. Create Your Keyword Dictionary
Before we can scan, we need that dictionary. This is our language's official vocabulary list. The perfect data structure for this is a Map
in JavaScript (or a dict
in Python, a HashMap
in Java, you get the idea). It's built for exactly this kind of high-speed lookup.
Go ahead and create this map somewhere accessible in your Tokenizer
. The keys will be the keyword strings ("if"
, "else"
, "return"
), and the values will be their corresponding token types (TokenType.IF
, TokenType.ELSE
, TokenType.RETURN
). You don't need to add all 60+ JavaScript keywords right now. Just start with a dozen of the most common ones to get the system working. I suggest you to add the following -
const
let
var
if
else
for
continue
break
while
switch
case
break
default
function
return
2. Implement the Scanning Logic
The muscle memory you built writing scanNumber()
is going to pay off right here. The logic inside your new scanIdentifierOrKeyword()
method is going to feel very familiar.
- First, as always, record your
startIndex
. - Next, start a loop. The condition will be
while
the character I'mpeek()
ing at is a valid character for the rest of an identifier (letters, numbers,_
,$
). You'll absolutely want to write a tiny helper function, likeisAlphaNumeric()
, to keep this condition clean and readable. - Inside the loop, what do we do? The same thing we always do:
advance()
. Consume the character and move on. - When the loop ends, slice the source string from
startIndex
to the currentcurrentIndex
. You now have the complete word. - Now for the punchline. Take that word and look it up as a key in your keyword map.
- If it exists, you've found a keyword! Create a new
Token
using the specific token type you pulled from the map. - If it doesn't exist in the map, it's just a regular old identifier. Create a generic
IDENTIFIER
token. - Return the token you created. Mission accomplished.
There's a good chance of you messing up the previous logic with this addition. That's totally okay. If you don't find any issues - awesome job. It's common for new features to break existing logic, so don't be concerned if that happens here. This is precisely the kind of scenario where Test-Driven Development (TDD) excels, as it helps find and isolate these bugs as you work.
Design Decisions
-
Why a Map? You could, in theory, write a giant
switch
statement or a horrible chain ofif/else if
statements. AMap
is the professional's choice. It's cleaner, infinitely easier to read, and you can add new keywords later without touching your core logic. Plus, with its O(1) average lookup time, it's screamingly fast. -
Case Sensitivity Comes for Free. JavaScript is a case-sensitive language. The keyword is
let
, butLet
is just a valid variable name. How do we handle this? The beautiful part of our map-based approach is that we don't have to. The lookupkeywords.get("Let")
will simply fail because our map only contains the lowercase"let"
. The logic correctly classifiesLet
as an identifier with zero extra effort from us. It's always a great feeling when a clean design choice solves problems for you automatically. -
A Glimpse into the Deep End - Contextual Keywords. This is a more advanced topic, but it's interesting. In some languages, a word might only be a keyword in a certain situation. A great example comes from Python, where
match
is a "soft keyword." It's only treated as a keyword when it begins a pattern matching block. Outside of that context,match
is a perfectly valid variable name.So, how would you handle that? The simple and totally acceptable answer for our compiler is to just treat
match
as a keyword everywhere. The more complex answer is to have the tokenizer always emit anIDENTIFIER
formatch
and let the parser (the next stage) be smart enough to check the context. We're going to keep it simple and treat all our keywords as reserved everywhere. It's the right call for this project.
You've just given your compiler a vocabulary. It can now tell the difference between the names you invent and the special words that give the language its fundamental structure. Next up, we'll handle the punctuation that ties it all together.
Section 1.4
If you've followed along this far in one go, I recommend taking a small break. Get up and walk for 1-2 minutes before continuing.
You just handled numbers and identifiers, which contain some of the trickiest logic in the entire tokenizer. We're past the hard part. What's left is the grammatical glue that holds everything together: the operators and punctuation.
I'm talking about characters like +
, -
, =
, and the structural markers like (
, )
, {
, and }
. These are the symbols that form the skeleton of our code. Recognizing them is often the most straightforward part of tokenization, but doing it cleanly and correctly is absolutely essential for the parser we'll build next.
Our Tokenizer is a Specialist, Not a Philosopher
From our tokenizer's point of view, there is absolutely no difference between an "operator" like +
and a piece of "punctuation" like (
. Its job isn't to understand meaning; its job is to recognize shapes.
When it sees the +
character, it doesn't ask, "Is this for adding numbers or for concatenating strings?" That's a deep, philosophical question for another part of our compiler. Our tokenizer's job is simple: see the character +
, create a token with a very specific type like PLUS
, and move on. All the heavy lifting of figuring out what that +
actually means in context belongs to the parser.
For now, we're only going to handle single-character tokens. A multi-character operator like ==
or !=
is a challenge for another day. Right now, our tokenizer will see ==
and simply produce two separate =
tokens, one after the other. And you know what? That is perfectly fine for this stage. We'll add the logic to recognize two-character tokens later. One step at a time.
Implementation: Switch Statement
This will be the most direct and satisfying code you'll write in this whole chapter. When your problem is "I have a single, unique character, and I need to map it to a specific outcome," the switch
statement is the perfect tool. It's practically designed for this exact task.
Here's what you're going to do. Back in your main tokenize()
loop, this logic should come after your checks for numbers and identifiers. This is the final section that handles everything else.
-
You'll write a
switch
statement that operates on the current character you get frompeek()
. -
Create a
case
for every single-character token you want to support. I'm talking+
,-
,*
,/
,=
,(
,)
,{
,}
,[
,]
,;
,,
,.
, and so on. Add a bunch of them. -
Inside each
case
, the job is dead simple. You'll create a newToken
with the correct type (e.g.,TokenType.PLUS
) and value (e.g.,"+"
), add it to your list of tokens, and then - and this is the part people always forget - you absolutely must calladvance()
to move past the character you just processed. If you forget, you get an infinite loop, and nobody wants that. I am pretty sure you have athis.advance()
call somehwere at the end of the loop, that's going to skip this headache. -
Finally, you must include a
default
case in your switch. This is your safety net. This is what makes your tokenizer robust. If it encounters a character it doesn't recognize - like~
or^
- it won't crash. Instead, it will hit thedefault
case, where you will create and add anILLEGAL
token. This is professional-grade design. It allows our compiler to report all the syntax errors in a file in a single pass instead of giving up on the first one.
Why Not Just Use a Map or Object?
Look, I see where you're going: why not create a lookup table like {'+': TokenType.PLUS, '-': TokenType.MINUS, ...}
and just check if the character exists? Here's the thing: that approach can work, and in some languages it's actually pretty clean. In Python or JavaScript, a dictionary/object lookup is fast and readable.
In C++ or Rust, you could use a hash map or even a fixed-size array indexed by character code for blazing speed. So why am I steering you toward a switch statement instead? Two reasons.
First, the switch makes your tokenizer's logic crystal clear - each case explicitly shows what happens for that character, making debugging easy.
Second, and more importantly, you're about to need lookahead for multi-character operators like ==
, <=
, or //
comments. When you hit a =
, you'll need to peek at the next character to decide if it's ASSIGN
or EQUAL
. That conditional logic fits naturally inside a switch case. With a lookup table, you'd need separate handling for these cases anyway, so you end up with a hybrid approach that's messier than just using a switch from the start.
If you want to use a map for simple single-character tokens and handle everything else separately, that's valid - just know you'll be maintaining two different systems.
Design Decisios
Pay close attention to this part. These decisions seem small now, but they will have a massive impact on how enjoyable (or painful) it is to write the next stage of our compiler.
-
Why recognize operators here? We do it in the tokenizer because it's a "context-free" job. To know that
+
is a plus sign, you don't need to know anything else about the code around it. This perfectly follows our core design principle: separation of concerns. The tokenizer is the expert on character-level shapes; the parser will be the expert on grammar and structure. We let each part do what it's best at. -
Token Type Granularity. I cannot stress this enough. It is incredibly tempting to be lazy here and create one generic
OPERATOR
token type. Resist this temptation with every fiber of your being. Create specific, granular types:PLUS
,MINUS
,SLASH
,STAR
,LPAREN
(for Left Parenthesis),RPAREN
, and so on.
Why am I so passionate about this? Because I'm trying to save you from a world of pain later. If you use granular types, your future parser's code will be clean and readable, asking simple boolean questions like, "Is the next token a LPAREN
?" If you cheap out and use a generic OPERATOR
type, your parser will become a convoluted mess, forced to ask ugly questions like, "Is the next token an OPERATOR
and is its string value equal to '('
?". One of these paths leads to clean, maintainable code. The other leads to spaghetti. Choose wisely.
Our tokenizer is getting seriously smart now. It understands numbers, it has a vocabulary of keywords and identifiers, and now it recognizes all the punctuation that holds it together. But it's still blind to one last thing: the invisible characters that give our code its layout and are the absolute key to great error reporting.
Section 1.5
You may have already built this part, i.e whitespacing and new line tracking in the Section 1.1. Feel free to glance over this section quickly if you know your implementation is solid.
Whitespace - spaces, tabs, newlines - is completely ignored by the JavaScript engine when it runs your code. But for us, the compiler writers, it is critically important. Correctly handling whitespace, especially newlines, is important for tracking our position accurately.
Concepts: The Invisible Characters
The main thing to know is that whitespace isn't just spaces. We have to account for tabs (\t
) and, most importantly, the different ways operating systems represent a new line. Unix-style systems (like macOS and Linux) use a single character, the Line Feed (\n
). Windows, traditionally, uses two characters: a Carriage Return followed by a Line Feed (\r\n
). Our tokenizer needs to handle all of these gracefully.
Our goal is simple but has two parts:
- We need to skip over all whitespace characters. They should not become tokens.
- While doing so, we must use the newline characters to accurately update our
currentLine
andcurrentCol
properties.
Implementation Requirements
The perfect place to handle this is at the very top of the main loop in your tokenize()
method. Before you even think about checking for digits, letters, or operators, you should have a dedicated loop that just consumes any and all whitespace.
Here's how to build it:
-
Use the main while loop inside the
tokenizer
method or create awhile
loop that continues as long as the current character is a whitespace character (you can make a little helperisWhitespace()
for this that checks for space,\t
,\n
, and\r
). -
Inside the loop, check if the character you are about to consume is a newline (
\n
). If it is, you must do two things: incrementcurrentLine
and, crucially, resetcurrentCol
back to 1. -
After the check, simply call
advance()
to move past the whitespace character.
This simple logic elegantly handles the Windows \r\n
situation without any extra code. The loop will see the \r
, treat it as whitespace, and advance()
past it. On its next iteration, it will see the \n
, update the line and column counters correctly, and then advance()
past that. It just works.
After this whitespace-consuming loop finishes, you have a powerful guarantee: the very next character you peek()
at is guaranteed to be the start of a meaningful token.
Design Decisions
-
Zero vs. One-Based Indexing? You've noticed we're using 1-based indexing for lines and columns. This is a deliberate, user-focused decision. Every text editor, IDE, and developer tool on the planet shows line 1, column 1 as the very beginning of a file. While we programmers are comfortable with 0-based arrays, our compiler's error messages should speak a human language.
-
Source Map Generation? In web development, you're often running code in the browser that has been transpiled from TypeScript or a newer version of JavaScript. Source maps are the magic that lets you debug that code by showing you your original source, not the transpiled mess. Those source maps are complex files that essentially contain a giant list of mappings: "the code at this position in the generated file corresponds to that position in the original file." The hyper-accurate line and column information you are tracking right now is the absolute, non-negotiable prerequisite for generating a source map. You're building a professional-grade feature without even knowing it.
You've Built a Lexer!
Stop for a moment and appreciate what you've just done. You have built the foundational component of every compiler, interpreter, linter, and syntax highlighter in existence. You've taught a program how to read code. It understands numbers in their various forms, it knows the difference between a reserved keyword and a variable name you made up, it recognizes a whole suite of operators, and it intelligently handles the invisible structure of whitespace to keep track of its location. You have built a complete, robust lexical analyzer.
The stream of tokens you can now produce is the fuel for the next grand stage: Parsing. In the next chapter, we're going to take this flat list of "words" and teach our compiler how to understand "sentences" by building a parser that transforms them into a beautiful, hierarchical Abstract Syntax Tree.
Challenges for the Eager
Feeling confident? Here are a few ways to make your tokenizer even more powerful. Don't worry, we'll build everything in the upcoming chapter.
-
Extend your whitespace-skipping loop to recognize and ignore single-line (
// ...
) and multi-line (/* ... */
) comments. -
Implement a
scanString()
method. It should be triggered by a"
or'
and consume everything until it finds the matching closing quote. Don't forget to handle escaped quotes like\"
inside the string! -
This is a fun one. Modify your operator logic to handle two-character tokens like
==
,!=
,+=
, and=>
. (Hint: this is where yourpeek()
method becomes your absolute best friend. If you see an=
, you'll need topeek()
ahead to see if the next character is also an=
).