Lexical Analysis Specification
This document defines the lexical analysis phase of the Phaser compiler, detailing token types, lexical rules, and implementation requirements.
This is a compiler implementation document. For language design and user-facing features, see the docs directory.
Overview
The lexer transforms source text into a stream of tokens, handling:
- Token recognition and classification
- Source position tracking for error reporting
- Comment processing and whitespace handling
- String and numeric literal parsing
- Keyword identification
Token Types
Core Token Categories
#[derive(Debug, Clone, PartialEq)]
pub enum TokenType {
// Literals
IntegerLiteral(IntegerValue),
FloatLiteral(f64),
StringLiteral(String),
CharLiteral(char),
BooleanLiteral(bool),
// Identifiers and Keywords
Identifier(String),
Keyword(Keyword),
// Operators
Operator(Operator),
// Delimiters
Delimiter(Delimiter),
// Special
Newline,
Eof,
// Comments (preserved for documentation tools)
LineComment(String),
BlockComment(String),
}
#[derive(Debug, Clone, PartialEq)]
pub enum IntegerValue {
Int8(i8), Int16(i16), Int32(i32), Int64(i64), Int128(i128),
UInt8(u8), UInt16(u16), UInt32(u32), UInt64(u64), UInt128(u128),
Isize(isize), Usize(usize),
Untyped(u128), // For literals without explicit type
}Keywords
#[derive(Debug, Clone, PartialEq)]
pub enum Keyword {
// Control Flow
If, Else, Match, While, For, Loop, Break, Continue, Return,
// Declarations
Fn, Let, Mutable, Const, Static, Type, Struct, Enum, Trait, Impl,
// Modules and Visibility
Module, Use, Public, Private, Extern, Self_, Super, Final,
// Types
Int8, Int16, Int32, Int64, Int128, Isize,
UInt8, UInt16, UInt32, UInt64, UInt128, Usize,
F32, F64, Bool, Char, Str,
// Memory and Safety
Unsafe, Ref, Move,
// Async/Await
// Async, Await,
// Meta-programming
Meta,
// Literals
True, False,
// Special
Where, As, In, Virtual,
}Operators
#[derive(Debug, Clone, PartialEq)]
pub enum Operator {
// Arithmetic
Plus, Minus, Star, Slash, Percent,
// Comparison
Equal, NotEqual, Less, Greater, LessEqual, GreaterEqual,
// Logical
And, Or, Not,
// Bitwise
BitAnd, BitOr, BitXor, LeftShift, RightShift,
// Assignment
Assign,
PlusAssign, MinusAssign, StarAssign, SlashAssign, PercentAssign,
BitAndAssign, BitOrAssign, BitXorAssign, LeftShiftAssign, RightShiftAssign,
// Special
Arrow, // ->
FatArrow, // =>
DoubleColon, // ::
Dot, // .
DotDot, // ..
DotDotDot, // ...
Question, // ?
At, // @
}Delimiters
#[derive(Debug, Clone, PartialEq)]
pub enum Delimiter {
LeftParen, RightParen, // ( )
LeftBracket, RightBracket, // [ ]
LeftBrace, RightBrace, // { }
Comma, Semicolon, Colon, // , ; :
}Token Structure
#[derive(Debug, Clone, PartialEq)]
pub struct Token {
pub token_type: TokenType,
pub span: Span,
pub leading_trivia: Vec<Trivia>,
pub trailing_trivia: Vec<Trivia>,
}
#[derive(Debug, Clone, PartialEq)]
pub struct Span {
pub start: Position,
pub end: Position,
pub source_id: SourceId,
}
#[derive(Debug, Clone, PartialEq)]
pub struct Position {
pub line: u32, // 1-based
pub column: u32, // 1-based
pub offset: u32, // 0-based byte offset
}
#[derive(Debug, Clone, PartialEq)]
pub enum Trivia {
Whitespace(String),
LineComment(String),
BlockComment(String),
}Lexical Rules
Identifiers
- Start with letter or underscore
- Followed by letters, digits, or underscores
- Case-sensitive
- Cannot be keywords (use raw identifiers
@keywordfor edge cases) - TODO: extend to support unicode and emojis
identifier := [a-zA-Z_][a-zA-Z0-9_]*
raw_identifier := r#[a-zA-Z_][a-zA-Z0-9_]*
Integer Literals
decimal := [0-9][0-9_]*
hexadecimal := 0[xX][0-9a-fA-F][0-9a-fA-F_]*
binary := 0[bB][01][01_]*
octal := 0[oO][0-7][0-7_]*
integer_suffix := (i8|i16|i32|i64|i128|isize|u8|u16|u32|u64|u128|usize)?
integer_literal := (decimal|hexadecimal|binary|octal) integer_suffix
Float Literals
decimal_float := [0-9][0-9_]* \. [0-9][0-9_]* exponent?
| [0-9][0-9_]* exponent
exponent := [eE][+-]?[0-9][0-9_]*
float_suffix := (f32|f64)?
float_literal := decimal_float float_suffix
String Literals
string_literal := " string_content* "
string_content := escape_sequence | [^"\\]
escape_sequence := \\ ( n | t | r | \\ | " | ' | 0 | x[0-9a-fA-F]{2} | u{[0-9a-fA-F]{1,6}} )
Supported escape sequences:
\n- newline\t- tab\r- carriage return\\- backslash\"- double quote\'- single quote\0- null character\x##- ASCII character (hex)\u{######}- Unicode character (hex)
Character Literals
char_literal := ' char_content '
char_content := escape_sequence | [^'\\]
Comments
line_comment := // [^\n]* \n? block_comment := /* (block_comment | [^] | *[^/]) */ Block comments can be nested.
Lexer Implementation Requirements
Error Handling
The lexer must produce PhaserResult<Token> and handle:
#[derive(Debug, Clone, PartialEq)]
pub enum LexError {
UnexpectedCharacter { char: char, position: Position },
UnterminatedString { start: Position },
UnterminatedBlockComment { start: Position },
InvalidEscapeSequence { sequence: String, position: Position },
InvalidNumericLiteral { literal: String, position: Position },
InvalidUnicodeEscape { escape: String, position: Position },
IntegerOverflow { literal: String, position: Position },
}Position Tracking
- Maintain accurate line/column information
- Handle different line ending styles (LF, CRLF, CR)
- Track byte offsets for efficient source mapping
- Support Unicode characters correctly
Performance Considerations
- Use efficient string scanning techniques
- Minimize allocations during tokenization
- Consider using string interning for identifiers
- Implement lookahead efficiently for multi-character operators
Trivia Handling
- Preserve whitespace and comments as trivia
- Attach trivia to appropriate tokens
- Support documentation comment extraction
- Handle mixed whitespace/comment sequences
Integration with Parser
The lexer provides a token stream interface:
pub trait TokenStream {
fn next_token(&mut self) -> PhaserResult<Token>;
fn peek_token(&mut self) -> PhaserResult<&Token>;
fn current_position(&self) -> Position;
fn is_at_end(&self) -> bool;
}Testing Requirements
Unit Tests Required
- All token types recognition
- All escape sequences
- Numeric literal parsing (all bases and suffixes)
- Error conditions and recovery
- Position tracking accuracy
- Unicode handling
- Nested comment parsing
- Trivia attachment correctness
Test Cases
// Basic tokens
let x = 42;
fn main() -> i32 { return 0; }
// Numeric literals
0x1A2B_u32
0b1010_1010_i8
123.456_f64
1e-10_f32
// String literals
"Hello, world!"
"Line 1\nLine 2"
"Unicode: \u{1F680}"
// Comments
// Line comment
/* Block comment */
/* Nested /* comment */ */Future Extensions
- Raw string literals:
r"no escapes here" - Byte string literals:
b"bytes" - Format string literals:
f"Hello {name}" - Custom numeric literal suffixes for user types