Eucalypt Architecture

This document provides a comprehensive overview of Eucalypt's design and implementation architecture.

Overview

Eucalypt is a functional programming language and tool for generating, templating, rendering, and processing structured data formats like YAML, JSON, and TOML. Written in Rust (~44,000 lines), it features a classic multi-phase compiler design with an STG (Spineless Tagless G-machine) runtime for lazy evaluation.

System Architecture

High-Level Pipeline

Source Code (*.eu files)
        │
        ▼
┌───────────────────────┐
│    Parsing Phase      │  src/syntax/
│  Lexer → Parser → AST │
└───────────────────────┘
        │
        ▼
┌───────────────────────┐
│     Core Phase        │  src/core/
│  Desugar → Cook →     │
│  Transform → Verify   │
└───────────────────────┘
        │
        ▼
┌───────────────────────┐
│   Evaluation Phase    │  src/eval/
│  STG Compile → VM →   │
│  Memory Management    │
└───────────────────────┘
        │
        ▼
┌───────────────────────┐
│    Export Phase       │  src/export/
│  JSON/YAML/TOML/etc   │
└───────────────────────┘

Module Structure

eucalypt/
├── src/
│   ├── bin/eu.rs           # CLI entry point
│   ├── lib.rs              # Library root
│   ├── common/             # Shared utilities
│   ├── syntax/             # Parsing and AST
│   │   └── rowan/          # Rowan-based incremental parser
│   ├── core/               # Core expression representation
│   │   ├── desugar/        # AST to core transformation
│   │   ├── cook/           # Operator fixity resolution
│   │   ├── transform/      # Expression transformations
│   │   ├── simplify/       # Optimisation passes
│   │   ├── inline/         # Inlining passes
│   │   ├── verify/         # Validation
│   │   └── analyse/        # Program analysis
│   ├── eval/               # Evaluation engine
│   │   ├── stg/            # STG syntax and compiler
│   │   ├── machine/        # Virtual machine
│   │   └── memory/         # Heap and garbage collection
│   ├── driver/             # CLI orchestration
│   ├── export/             # Output format generation
│   └── import/             # Input format parsing
├── lib/                    # Standard library (eucalypt source)
│   ├── prelude.eu          # Core prelude
│   ├── test.eu             # Test framework
│   └── markup.eu           # Markup utilities
└── docs/                   # Documentation

Parsing Pipeline

The parsing pipeline transforms source text into a structured AST using Rowan, an incremental parsing library that preserves full source fidelity including whitespace and comments.

Lexer

Implementation: src/syntax/rowan/lex.rs, src/syntax/rowan/string_lex.rs

The lexer (Lexer<C>) tokenises source text into a stream of SyntaxKind tokens:

pub struct Lexer<C: Iterator<Item = char>> {
    chars: Peekable<C>,           // Character stream with lookahead
    location: ByteIndex,           // Source position tracking
    last_token: Option<SyntaxKind>, // Context for disambiguation
    whitespace_since_last_token: bool,
    token_buffer: VecDeque<(SyntaxKind, Span)>,
}

Key features: - Unicode-aware identifier and operator recognition - Context-sensitive tokenisation (distinguishes OPEN_PAREN from OPEN_PAREN_APPLY) - String pattern lexing with interpolation support ("Hello {name}") - Preserves trivia (whitespace, comments) for full-fidelity AST

Token categories: - Delimiters: { } ( ) [ ] : , \`` - Identifiers:foo,'quoted name',+,&&- Literals: Numbers, strings, symbols (:keyword) - Annotations: Whitespace, comments (#`)

Parser

Implementation: src/syntax/rowan/parse.rs

The parser uses an event-driven recursive descent approach:

pub struct Parser<'text> {
    tokens: Vec<(SyntaxKind, &'text str)>,
    next_token: usize,
    sink_stack: Vec<Box<dyn EventSink>>,
    errors: Vec<ParseError>,
}

Parse events:

enum ParseEvent {
    StartNode(SyntaxKind),  // Begin syntax node
    Finish,                  // Complete node
    Token(SyntaxKind),      // Include token
}

Key parsing methods: - parse_unit() - Top-level file (no braces required) - parse_expression() / parse_soup() - Expression sequences - parse_block_expression() - { ... } blocks with declarations - parse_string_pattern() - Interpolated strings

The parser maintains error recovery for LSP support, collecting errors while continuing to parse.

AST

Implementation: src/syntax/rowan/ast.rs

The AST uses a two-layer design: 1. SyntaxNode (Rowan) - Rich, source-preserving tree 2. AST Nodes - Typed wrappers via macros

macro_rules! ast_node {
    ($ast:ident, $kind:ident) => {
        pub struct $ast(SyntaxNode);
        impl AstNode for $ast { ... }
    }
}

Key AST types:

Type	Purpose
`Unit`	Top-level file structure
`Block`	Enclosed `{ ... }` block
`Declaration`	Property/function declaration
`DeclHead`	Declaration name (before `:`)
`DeclBody`	Declaration value (after `:`)
`Soup`	Unordered expression sequence
`List`	List expression `[a, b, c]`
`Name`	Identifier reference
`Literal`	Literal value
`StringPattern`	Interpolated string

Core Expression Representation

The core representation is an intermediate language that facilitates powerful transformations while maintaining source information for error reporting.

Expression Type

Implementation: src/core/expr.rs

The Expr<T> enum (where T: BoundTerm<String>) represents all expression forms:

pub enum Expr<T> {
    // Variables
    Var(Smid, Var<String>),           // Free or bound variable

    // Primitives
    Literal(Smid, Primitive),          // Number, string, symbol, bool, null

    // Binding forms
    Let(Smid, LetScope<T>, LetType),   // Recursive let binding
    Lam(Smid, bool, LamScope<T>),      // Lambda abstraction

    // Application
    App(Smid, T, Vec<T>),              // Function application

    // Data structures
    List(Smid, Vec<T>),                // List literal
    Block(Smid, BlockMap<T>),          // Object/record literal

    // Operators (pre-cooking)
    Operator(Smid, Fixity, Precedence, T),
    Soup(Smid, Vec<T>),                // Unresolved operator soup

    // Anaphora (implicit parameters)
    BlockAnaphor(Smid, ...),
    ExprAnaphor(Smid, ...),

    // Access
    Lookup(Smid, T, String, Option<T>), // Property access

    // Metadata
    Meta(Smid, T, T),                   // Expression with metadata

    // Intrinsics
    Intrinsic(Smid, String),           // Built-in function reference

    // Error nodes
    ErrUnresolved, ErrRedeclaration, ...
}

Primitive types:

pub enum Primitive {
    Str(String),
    Sym(String),
    Num(Number),
    Bool(bool),
    Null,
}

Standard wrapper: RcExpr provides reference-counted immutable expressions with substitution and transformation methods.

Transformation Pipeline

The core pipeline transforms expressions through several phases:

AST → Desugar → Cook → Simplify → Inline → Verify → STG

Desugaring

Implementation: src/core/desugar/

Transforms parsed AST into core expressions: - Converts block declarations into recursive let bindings - Extracts targets and documentation metadata - Handles imports and cross-file references - Processes both native AST and embedded data (JSON/YAML)

Cooking

Implementation: src/core/cook/

Resolves operator precedence and anaphora: 1. Fixity distribution - Propagate operator precedence info 2. Anaphor filling - Infer missing implicit parameters ((+ 10) → (_ + 10)) 3. Shunting yard - Apply precedence climbing to linearise operator soup 4. Anaphor processing - Wrap lambda abstractions around anaphoric expressions

Example transformation:

(+ 10)        →  (λ _ . (_ + 10))
a + b * c     →  (+ a (* b c))      // with standard precedence

Simplification and Inlining

Implementation: src/core/simplify/, src/core/inline/

Compression - Remove eliminated bindings
Pruning - Dead code elimination
Inlining - Inline marked expressions

Verification

Implementation: src/core/verify/

Validates transformed expressions before STG compilation: - Binding verification - Content validation

STG Compilation and Evaluation Model

Eucalypt uses a Spineless Tagless G-machine (STG) as its evaluation model, providing lazy evaluation with memoisation.

STG Syntax

Implementation: src/eval/stg/syntax.rs

The STG syntax represents executable code:

pub enum StgSyn {
    Atom { evaluand: Ref },              // Value or reference
    Case { scrutinee, branches, fallback }, // Pattern matching (evaluation point)
    Cons { tag: Tag, args: Vec<Ref> },   // Data constructor
    App { callable: Ref, args: Vec<Ref> }, // Function application
    Bif { intrinsic: u8, args: Vec<Ref> }, // Built-in intrinsic
    Let { bindings: Vec<LambdaForm>, body }, // Non-recursive let
    LetRec { bindings: Vec<LambdaForm>, body }, // Recursive let
    Ann { smid: Smid, body },            // Source annotation
    Meta { meta: Ref, body: Ref },       // Metadata wrapper
    DeMeta { scrutinee, handler, or_else }, // Metadata destructure
    BlackHole,                            // Uninitialized marker
}

Reference types:

pub enum Reference<T> {
    L(usize),  // Local environment index
    G(usize),  // Global environment index
    V(T),      // Direct value (Native)
}

Lambda forms control laziness: - Lambda - Function with explicit arity - Thunk - Lazy expression (evaluated and updated in-place) - Value - Already in WHNF (no update needed)

STG Compiler

Implementation: src/eval/stg/compiler.rs

The compiler transforms core expressions to STG syntax:

impl Compiler {
    fn compile_body(&mut self, expr: &RcExpr) -> ProtoSyntax;
    fn compile_binding(&mut self, expr: &RcExpr) -> ProtoBinding;
    fn compile_lambda(&mut self, expr: &RcExpr) -> ProtoSyntax;
    fn compile_application(&mut self, f: &RcExpr, args: &[RcExpr]) -> ProtoSyntax;
}

Key decisions: - Thunk creation: Expressions not in WHNF and used more than once become thunks - WHNF detection: Constructors, native values, and metadata wrappers are WHNF - Deferred compilation: ProtoSyntax allows deferring binding construction until environment size is known

Virtual Machine

Implementation: src/eval/machine/vm.rs

The STG machine is a state machine executing closures:

pub struct MachineState {
    root_env: SynEnvPtr,           // Empty root environment
    closure: SynClosure,           // Current (code, environment) pair
    globals: SynEnvPtr,            // Global bindings
    stack: Vec<Continuation>,      // Continuation stack
    terminated: bool,
    annotation: Smid,              // Current source location
}

Execution loop:

fn run(&mut self) {
    while !self.terminated {
        if self.gc_check_needed() {
            self.collect_garbage();
        }
        self.step();
    }
}

Instruction dispatch (handle_instruction):

Code Form	Action
`Atom`	Resolve reference; push Update continuation if thunk
`Case`	Push Branch continuation, evaluate scrutinee
`Cons`	Return data constructor
`App`	Push ApplyTo continuation, evaluate callable
`Bif`	Execute intrinsic directly
`Let`	Allocate environment frame, continue in body
`LetRec`	Allocate frame with backfilled recursive references

Continuations

Implementation: src/eval/machine/cont.rs

Four continuation types manage control flow:

Branch - Pattern matching branches for CASE
Update - Deferred thunk update (memoisation)
ApplyTo - Pending arguments for function application
DeMeta - Metadata destructuring handler

Lazy Evaluation

Laziness is achieved through thunks and updates:

Thunk creation (compile time): Non-WHNF expressions become LambdaForm::Thunk
Thunk evaluation (runtime): When a thunk is entered, push Update continuation
Memoisation: After evaluation, update the environment slot with the result

// When entering a local reference
if closure.update() {
    stack.push(Continuation::Update { environment, index });
}

// After evaluation completes
Continuation::Update { environment, index } => {
    self.update(environment, index);  // Replace thunk with result
}

Memory Management and Garbage Collection

Eucalypt uses an Immix-inspired memory layout with mark-and-sweep collection.

Memory Layout

Implementation: src/eval/memory/heap.rs, src/eval/memory/bump.rs

Block (32KB)
├── Line 0 (128B) ┐
├── Line 1 (128B) │ 256 lines per block
├── ...           │
└── Line 255      ┘

Size classes: - Small (< 128 bytes) - Single line - Medium (128B - 32KB) - Multiple lines within block - Large (> 32KB) - Dedicated Large Object Block

Heap state:

pub struct HeapState {
    head: Option<BumpBlock>,           // Active small allocation
    overflow: Option<BumpBlock>,       // Active medium allocation
    recycled: VecDeque<BumpBlock>,     // Blocks with reusable holes
    rest: VecDeque<BumpBlock>,         // Used blocks pending collection
    lobs: Vec<LargeObjectBlock>,       // Large objects
}

Object Headers

Implementation: src/eval/memory/header.rs

Every object has a 16-byte header:

pub struct AllocHeader {
    bits: HeaderBits,                  // Mark bit + forwarded flag
    alloc_length: u32,                 // Object size
    forwarded_to: Option<NonNull<()>>, // For potential evacuation
}

Garbage Collection

Implementation: src/eval/memory/collect.rs

Mark phase: 1. Reset line maps across all blocks 2. Breadth-first root scanning from machine state 3. Transitive closure following object references 4. Mark lines containing live objects

Sweep phase: 1. Scan line maps in each block 2. Identify holes (2+ consecutive free lines) 3. Move recyclable blocks to recycled list

Collection triggering: - When --heap-limit-mib is set and limit exceeded - Check performed every 500 VM execution steps - Emergency collection on allocation failure

See gc-implementation.md for detailed analysis.

The Prelude and Standard Library

Intrinsic Functions

Implementation: src/eval/intrinsics.rs, src/eval/stg/

Built-in functions are implemented in Rust and indexed by position:

Categories: - Control flow: __IF, __PANIC, __TRUE, __FALSE, __NULL - Lists: __CONS, __HEAD, __TAIL, __NIL, __REVERSE - Blocks: __MERGE, __DEEPMERGE, __ELEMENTS, __BLOCK, __LOOKUP - Arithmetic: __ADD, __SUB, __MUL, __DIV, __MOD, comparisons - Strings: __STR, __SPLIT, __JOIN, __MATCH, __FMT - Metadata: __META, __WITHMETA - Time: __ZDT, __ZDT.PARSE, __ZDT.FORMAT - I/O: __io.ENV, __io.EPOCHTIME - Emission: __RENDER, __EMIT* family

Each intrinsic implements the StgIntrinsic trait with direct access to machine state.

Prelude

Implementation: lib/prelude.eu

The prelude (~29KB) is written entirely in eucalypt, wrapping intrinsics with ergonomic functions:

List functions: - take, drop, nth (!!), fold/foldr, map, filter - append (++), concat, reverse, zip, group-by, qsort

Block functions: - merge-all, keys, values, map-kv, map-keys, map-values - lookup-path, alter-value, update-value

Combinators: - identity, const, compose (∘, ;), flip, curry, uncurry

String functions: - str.split, str.join, str.match, str.fmt, str.letters

Loading: - Prelude is embedded in the binary at compile time - Loaded by default unless --no-prelude / -Q flag is used

Driver and CLI Architecture

Entry Point

Implementation: src/bin/eu.rs, src/driver/options.rs

The CLI uses clap v4 with derive macros:

#[derive(Parser)]
struct EucalyptCli {
    #[command(subcommand)]
    command: Option<Commands>,
    files: Vec<String>,
    // Global options...
}

enum Commands {
    Run(RunArgs),
    Test(TestArgs),
    Dump(DumpArgs),
    Fmt(FmtArgs),
    Explain(ExplainArgs),
    ListTargets(ListTargetsArgs),
    Version,
}

Modes of Operation

Run (default):

eu file.eu                    # Implicit run
eu -e "expression" file.eu    # With evaluand
eu -x json file.eu            # JSON output
eu -t target file.eu          # Select target

Test:

eu test tests/              # Run all tests in directory
eu test -t specific file.eu  # Run specific test target

Format:

eu fmt file.eu              # Print formatted to stdout
eu fmt --write file.eu      # Modify in place
eu fmt --check file.eu      # Check formatting (exit 1 if needs format)

Dump:

eu dump ast file.eu          # Dump AST
eu dump desugared file.eu    # Dump after desugaring
eu dump stg file.eu          # Dump STG syntax
eu dump runtime              # Dump intrinsic definitions

Output Formats

Implementation: src/export/

Emitters for each format implement the Emitter trait: - YAML (yaml.rs) - Uses yaml_rust, supports tags from metadata - JSON (json.rs) - Uses serde_json - TOML (toml.rs) - Structured output - Text (text.rs) - Plain text - EDN (edn.rs) - Clojure-like format - HTML (html.rs) - Markup with serialisation

Input Handling

Inputs are merged in order: 1. Prologue: Prelude, config files, build metadata, IO block 2. Explicit: Files and options (-c/--collect-as) 3. Epilogue: CLI evaluand (-e)

Names from earlier inputs are available to later inputs.

Key Design Decisions and Trade-offs

Why STG?

The STG machine provides a well-defined reference point for lazy functional language implementation: - Clear semantics for lazy evaluation with memoisation - Established compilation strategies - Potential for future optimisations

Trade-off: More complex than direct interpretation but provides a solid foundation.

Why Rowan for Parsing?

Rowan provides: - Incremental parsing for IDE support - Full source fidelity (preserves whitespace and comments) - Error recovery for partial parsing

Trade-off: More complex API than traditional parser generators.

Core as Intermediate Language

The core representation enables powerful transformations: - User-definable operator precedence resolved by syntax transformation - Block semantics (binding vs structuring) separated cleanly - Source information preserved through Smid for error reporting

Trade-off: Additional compilation phase, but enables experimentation.

Block Duality

Eucalypt blocks serve two roles: 1. Name binding - Like let expressions 2. Data structuring - Like objects/records

The core phase separates these into distinct Let and Block expressions.

Trade-off: Elegant surface syntax at cost of semantic complexity.

Immix-Inspired GC

The memory layout provides: - Efficient bump allocation - Cache-friendly organisation - Effective hole reuse

Current limitation: No evacuation/compaction (mark-sweep only).