Lexical Analysis Specification

This document defines the lexical analysis phase of the Phaser compiler, detailing token types, lexical rules, and implementation requirements.

This is a compiler implementation document. For language design and user-facing features, see the docs directory.

Overview

The lexer transforms source text into a stream of tokens, handling:

Token recognition and classification
Source position tracking for error reporting
Comment processing and whitespace handling
String and numeric literal parsing
Keyword identification

Token Types

Core Token Categories

#[derive(Debug, Clone, PartialEq)]
pub enum TokenType {
    // Literals
    IntegerLiteral(IntegerValue),
    FloatLiteral(f64),
    StringLiteral(String),
    CharLiteral(char),
    BooleanLiteral(bool),
  
    // Identifiers and Keywords
    Identifier(String),
    Keyword(Keyword),
  
    // Operators
    Operator(Operator),
  
    // Delimiters
    Delimiter(Delimiter),
  
    // Special
    Newline,
    Eof,
  
    // Comments (preserved for documentation tools)
    LineComment(String),
    BlockComment(String),
}
 
#[derive(Debug, Clone, PartialEq)]
pub enum IntegerValue {
    Int8(i8), Int16(i16), Int32(i32), Int64(i64), Int128(i128),
    UInt8(u8), UInt16(u16), UInt32(u32), UInt64(u64), UInt128(u128),
    Isize(isize), Usize(usize),
    Untyped(u128), // For literals without explicit type
}

Keywords

#[derive(Debug, Clone, PartialEq)]
pub enum Keyword {
    // Control Flow
    If, Else, Match, While, For, Loop, Break, Continue, Return,
  
    // Declarations
    Fn, Let, Mutable, Const, Static, Type, Struct, Enum, Trait, Impl,
  
    // Modules and Visibility
    Module, Use, Public, Private, Extern, Self_, Super, Final,
  
    // Types
    Int8, Int16, Int32, Int64, Int128, Isize,
    UInt8, UInt16, UInt32, UInt64, UInt128, Usize,
    F32, F64, Bool, Char, Str,
  
    // Memory and Safety
 
    Unsafe, Ref, Move, 
  
    // Async/Await
    // Async, Await,
  
    // Meta-programming
    Meta,
  
    // Literals
    True, False,
  
    // Special
    Where, As, In, Virtual,
}

Operators

#[derive(Debug, Clone, PartialEq)]
pub enum Operator {
    // Arithmetic
    Plus, Minus, Star, Slash, Percent,
  
    // Comparison
    Equal, NotEqual, Less, Greater, LessEqual, GreaterEqual,
  
    // Logical
    And, Or, Not,
  
    // Bitwise
    BitAnd, BitOr, BitXor, LeftShift, RightShift,
  
    // Assignment
    Assign,
    PlusAssign, MinusAssign, StarAssign, SlashAssign, PercentAssign,
    BitAndAssign, BitOrAssign, BitXorAssign, LeftShiftAssign, RightShiftAssign,
  
    // Special
    Arrow,        // ->
    FatArrow,     // =>
    DoubleColon,  // ::
    Dot,          // .
    DotDot,       // ..
    DotDotDot,    // ...
    Question,     // ?
    At,           // @
}

Delimiters

#[derive(Debug, Clone, PartialEq)]
pub enum Delimiter {
    LeftParen, RightParen,       // ( )
    LeftBracket, RightBracket,   // [ ]
    LeftBrace, RightBrace,       // { }
    Comma, Semicolon, Colon,     // , ; :
}

Token Structure

#[derive(Debug, Clone, PartialEq)]
pub struct Token {
    pub token_type: TokenType,
    pub span: Span,
    pub leading_trivia: Vec<Trivia>,
    pub trailing_trivia: Vec<Trivia>,
}
 
#[derive(Debug, Clone, PartialEq)]
pub struct Span {
    pub start: Position,
    pub end: Position,
    pub source_id: SourceId,
}
 
#[derive(Debug, Clone, PartialEq)]
pub struct Position {
    pub line: u32,    // 1-based
    pub column: u32,  // 1-based
    pub offset: u32,  // 0-based byte offset
}
 
#[derive(Debug, Clone, PartialEq)]
pub enum Trivia {
    Whitespace(String),
    LineComment(String),
    BlockComment(String),
}

Lexical Rules

Identifiers

Start with letter or underscore
Followed by letters, digits, or underscores
Case-sensitive
Cannot be keywords (use raw identifiers @keyword for edge cases)
TODO: extend to support unicode and emojis

identifier := [a-zA-Z_][a-zA-Z0-9_]*
raw_identifier := r#[a-zA-Z_][a-zA-Z0-9_]*

Integer Literals

decimal := [0-9][0-9_]*
hexadecimal := 0[xX][0-9a-fA-F][0-9a-fA-F_]*
binary := 0[bB][01][01_]*
octal := 0[oO][0-7][0-7_]*

integer_suffix := (i8|i16|i32|i64|i128|isize|u8|u16|u32|u64|u128|usize)?
integer_literal := (decimal|hexadecimal|binary|octal) integer_suffix

Float Literals

decimal_float := [0-9][0-9_]* \. [0-9][0-9_]* exponent?
                | [0-9][0-9_]* exponent

exponent := [eE][+-]?[0-9][0-9_]*
float_suffix := (f32|f64)?
float_literal := decimal_float float_suffix

String Literals

string_literal := " string_content* "
string_content := escape_sequence | [^"\\]

escape_sequence := \\ ( n | t | r | \\ | " | ' | 0 | x[0-9a-fA-F]{2} | u{[0-9a-fA-F]{1,6}} )

Supported escape sequences:

\n - newline
\t - tab
\r - carriage return
\\ - backslash
\" - double quote
\' - single quote
\0 - null character
\x## - ASCII character (hex)
\u{######} - Unicode character (hex)

Character Literals

char_literal := ' char_content '
char_content := escape_sequence | [^'\\]

Comments

line_comment := // [^\n]* \n? block_comment := /* (block_comment | [^] | *[^/]) */ Block comments can be nested.

Lexer Implementation Requirements

Error Handling

The lexer must produce PhaserResult<Token> and handle:

#[derive(Debug, Clone, PartialEq)]
pub enum LexError {
    UnexpectedCharacter { char: char, position: Position },
    UnterminatedString { start: Position },
    UnterminatedBlockComment { start: Position },
    InvalidEscapeSequence { sequence: String, position: Position },
    InvalidNumericLiteral { literal: String, position: Position },
    InvalidUnicodeEscape { escape: String, position: Position },
    IntegerOverflow { literal: String, position: Position },
}

Position Tracking

Maintain accurate line/column information
Handle different line ending styles (LF, CRLF, CR)
Track byte offsets for efficient source mapping
Support Unicode characters correctly

Performance Considerations

Use efficient string scanning techniques
Minimize allocations during tokenization
Consider using string interning for identifiers
Implement lookahead efficiently for multi-character operators

Trivia Handling

Preserve whitespace and comments as trivia
Attach trivia to appropriate tokens
Support documentation comment extraction
Handle mixed whitespace/comment sequences

Integration with Parser

The lexer provides a token stream interface:

pub trait TokenStream {
    fn next_token(&mut self) -> PhaserResult<Token>;
    fn peek_token(&mut self) -> PhaserResult<&Token>;
    fn current_position(&self) -> Position;
    fn is_at_end(&self) -> bool;
}

Testing Requirements

Unit Tests Required

All token types recognition
All escape sequences
Numeric literal parsing (all bases and suffixes)
Error conditions and recovery
Position tracking accuracy
Unicode handling
Nested comment parsing
Trivia attachment correctness

Test Cases

// Basic tokens
let x = 42;
fn main() -> i32 { return 0; }
 
// Numeric literals
0x1A2B_u32
0b1010_1010_i8
123.456_f64
1e-10_f32
 
// String literals
"Hello, world!"
"Line 1\nLine 2"
"Unicode: \u{1F680}"
 
// Comments
// Line comment
/* Block comment */
/* Nested /* comment */ */

Future Extensions

Raw string literals: r"no escapes here"
Byte string literals: b"bytes"
Format string literals: f"Hello {name}"
Custom numeric literal suffixes for user types

Photon

Explorer

Recent Notes

AST Specification

Code Organization

Compilation Pipeline

Error Handling