Eucalypt Architecture
This document provides a comprehensive overview of Eucalypt's design and implementation architecture.
Overview
Eucalypt is a functional programming language and tool for generating, templating, rendering, and processing structured data formats like YAML, JSON, and TOML. Written in Rust (~44,000 lines), it features a classic multi-phase compiler design with an STG (Spineless Tagless G-machine) runtime for lazy evaluation.
System Architecture
High-Level Pipeline
Source Code (*.eu files)
│
▼
┌───────────────────────┐
│ Parsing Phase │ src/syntax/
│ Lexer → Parser → AST │
└───────────────────────┘
│
▼
┌───────────────────────┐
│ Core Phase │ src/core/
│ Desugar → Cook → │
│ Transform → Verify │
└───────────────────────┘
│
▼
┌───────────────────────┐
│ Evaluation Phase │ src/eval/
│ STG Compile → VM → │
│ Memory Management │
└───────────────────────┘
│
▼
┌───────────────────────┐
│ Export Phase │ src/export/
│ JSON/YAML/TOML/etc │
└───────────────────────┘
Module Structure
eucalypt/
├── src/
│ ├── bin/eu.rs # CLI entry point
│ ├── lib.rs # Library root
│ ├── common/ # Shared utilities
│ ├── syntax/ # Parsing and AST
│ │ └── rowan/ # Rowan-based incremental parser
│ ├── core/ # Core expression representation
│ │ ├── desugar/ # AST to core transformation
│ │ ├── cook/ # Operator fixity resolution
│ │ ├── transform/ # Expression transformations
│ │ ├── simplify/ # Optimisation passes
│ │ ├── inline/ # Inlining passes
│ │ ├── verify/ # Validation
│ │ └── analyse/ # Program analysis
│ ├── eval/ # Evaluation engine
│ │ ├── stg/ # STG syntax and compiler
│ │ ├── machine/ # Virtual machine
│ │ └── memory/ # Heap and garbage collection
│ ├── driver/ # CLI orchestration
│ ├── export/ # Output format generation
│ └── import/ # Input format parsing
├── lib/ # Standard library (eucalypt source)
│ ├── prelude.eu # Core prelude
│ ├── test.eu # Test framework
│ └── markup.eu # Markup utilities
└── docs/ # Documentation
Parsing Pipeline
The parsing pipeline transforms source text into a structured AST using Rowan, an incremental parsing library that preserves full source fidelity including whitespace and comments.
Lexer
Implementation: src/syntax/rowan/lex.rs, src/syntax/rowan/string_lex.rs
The lexer (Lexer<C>) tokenises source text into a stream of SyntaxKind tokens:
pub struct Lexer<C: Iterator<Item = char>> {
chars: Peekable<C>, // Character stream with lookahead
location: ByteIndex, // Source position tracking
last_token: Option<SyntaxKind>, // Context for disambiguation
whitespace_since_last_token: bool,
token_buffer: VecDeque<(SyntaxKind, Span)>,
}
Key features:
- Unicode-aware identifier and operator recognition
- Context-sensitive tokenisation (distinguishes OPEN_PAREN from OPEN_PAREN_APPLY)
- String pattern lexing with interpolation support ("Hello {name}")
- Preserves trivia (whitespace, comments) for full-fidelity AST
Token categories:
- Delimiters: { } ( ) [ ] : , \``
- Identifiers:foo,'quoted name',+,&&- Literals: Numbers, strings, symbols (:keyword)
- Annotations: Whitespace, comments (#`)
Parser
Implementation: src/syntax/rowan/parse.rs
The parser uses an event-driven recursive descent approach:
pub struct Parser<'text> {
tokens: Vec<(SyntaxKind, &'text str)>,
next_token: usize,
sink_stack: Vec<Box<dyn EventSink>>,
errors: Vec<ParseError>,
}
Parse events:
enum ParseEvent {
StartNode(SyntaxKind), // Begin syntax node
Finish, // Complete node
Token(SyntaxKind), // Include token
}
Key parsing methods:
- parse_unit() - Top-level file (no braces required)
- parse_expression() / parse_soup() - Expression sequences
- parse_block_expression() - { ... } blocks with declarations
- parse_string_pattern() - Interpolated strings
The parser maintains error recovery for LSP support, collecting errors while continuing to parse.
AST
Implementation: src/syntax/rowan/ast.rs
The AST uses a two-layer design: 1. SyntaxNode (Rowan) - Rich, source-preserving tree 2. AST Nodes - Typed wrappers via macros
macro_rules! ast_node {
($ast:ident, $kind:ident) => {
pub struct $ast(SyntaxNode);
impl AstNode for $ast { ... }
}
}
Key AST types:
| Type | Purpose |
|---|---|
Unit |
Top-level file structure |
Block |
Enclosed { ... } block |
Declaration |
Property/function declaration |
DeclHead |
Declaration name (before :) |
DeclBody |
Declaration value (after :) |
Soup |
Unordered expression sequence |
List |
List expression [a, b, c] |
Name |
Identifier reference |
Literal |
Literal value |
StringPattern |
Interpolated string |
Core Expression Representation
The core representation is an intermediate language that facilitates powerful transformations while maintaining source information for error reporting.
Expression Type
Implementation: src/core/expr.rs
The Expr<T> enum (where T: BoundTerm<String>) represents all expression forms:
pub enum Expr<T> {
// Variables
Var(Smid, Var<String>), // Free or bound variable
// Primitives
Literal(Smid, Primitive), // Number, string, symbol, bool, null
// Binding forms
Let(Smid, LetScope<T>, LetType), // Recursive let binding
Lam(Smid, bool, LamScope<T>), // Lambda abstraction
// Application
App(Smid, T, Vec<T>), // Function application
// Data structures
List(Smid, Vec<T>), // List literal
Block(Smid, BlockMap<T>), // Object/record literal
// Operators (pre-cooking)
Operator(Smid, Fixity, Precedence, T),
Soup(Smid, Vec<T>), // Unresolved operator soup
// Anaphora (implicit parameters)
BlockAnaphor(Smid, ...),
ExprAnaphor(Smid, ...),
// Access
Lookup(Smid, T, String, Option<T>), // Property access
// Metadata
Meta(Smid, T, T), // Expression with metadata
// Intrinsics
Intrinsic(Smid, String), // Built-in function reference
// Error nodes
ErrUnresolved, ErrRedeclaration, ...
}
Primitive types:
pub enum Primitive {
Str(String),
Sym(String),
Num(Number),
Bool(bool),
Null,
}
Standard wrapper: RcExpr provides reference-counted immutable expressions with substitution and transformation methods.
Transformation Pipeline
The core pipeline transforms expressions through several phases:
AST → Desugar → Cook → Simplify → Inline → Verify → STG
Desugaring
Implementation: src/core/desugar/
Transforms parsed AST into core expressions: - Converts block declarations into recursive let bindings - Extracts targets and documentation metadata - Handles imports and cross-file references - Processes both native AST and embedded data (JSON/YAML)
Cooking
Implementation: src/core/cook/
Resolves operator precedence and anaphora:
1. Fixity distribution - Propagate operator precedence info
2. Anaphor filling - Infer missing implicit parameters ((+ 10) → (_ + 10))
3. Shunting yard - Apply precedence climbing to linearise operator soup
4. Anaphor processing - Wrap lambda abstractions around anaphoric expressions
Example transformation:
(+ 10) → (λ _ . (_ + 10))
a + b * c → (+ a (* b c)) // with standard precedence
Simplification and Inlining
Implementation: src/core/simplify/, src/core/inline/
- Compression - Remove eliminated bindings
- Pruning - Dead code elimination
- Inlining - Inline marked expressions
Verification
Implementation: src/core/verify/
Validates transformed expressions before STG compilation: - Binding verification - Content validation
STG Compilation and Evaluation Model
Eucalypt uses a Spineless Tagless G-machine (STG) as its evaluation model, providing lazy evaluation with memoisation.
STG Syntax
Implementation: src/eval/stg/syntax.rs
The STG syntax represents executable code:
pub enum StgSyn {
Atom { evaluand: Ref }, // Value or reference
Case { scrutinee, branches, fallback }, // Pattern matching (evaluation point)
Cons { tag: Tag, args: Vec<Ref> }, // Data constructor
App { callable: Ref, args: Vec<Ref> }, // Function application
Bif { intrinsic: u8, args: Vec<Ref> }, // Built-in intrinsic
Let { bindings: Vec<LambdaForm>, body }, // Non-recursive let
LetRec { bindings: Vec<LambdaForm>, body }, // Recursive let
Ann { smid: Smid, body }, // Source annotation
Meta { meta: Ref, body: Ref }, // Metadata wrapper
DeMeta { scrutinee, handler, or_else }, // Metadata destructure
BlackHole, // Uninitialized marker
}
Reference types:
pub enum Reference<T> {
L(usize), // Local environment index
G(usize), // Global environment index
V(T), // Direct value (Native)
}
Lambda forms control laziness:
- Lambda - Function with explicit arity
- Thunk - Lazy expression (evaluated and updated in-place)
- Value - Already in WHNF (no update needed)
STG Compiler
Implementation: src/eval/stg/compiler.rs
The compiler transforms core expressions to STG syntax:
impl Compiler {
fn compile_body(&mut self, expr: &RcExpr) -> ProtoSyntax;
fn compile_binding(&mut self, expr: &RcExpr) -> ProtoBinding;
fn compile_lambda(&mut self, expr: &RcExpr) -> ProtoSyntax;
fn compile_application(&mut self, f: &RcExpr, args: &[RcExpr]) -> ProtoSyntax;
}
Key decisions:
- Thunk creation: Expressions not in WHNF and used more than once become thunks
- WHNF detection: Constructors, native values, and metadata wrappers are WHNF
- Deferred compilation: ProtoSyntax allows deferring binding construction until environment size is known
Virtual Machine
Implementation: src/eval/machine/vm.rs
The STG machine is a state machine executing closures:
pub struct MachineState {
root_env: SynEnvPtr, // Empty root environment
closure: SynClosure, // Current (code, environment) pair
globals: SynEnvPtr, // Global bindings
stack: Vec<Continuation>, // Continuation stack
terminated: bool,
annotation: Smid, // Current source location
}
Execution loop:
fn run(&mut self) {
while !self.terminated {
if self.gc_check_needed() {
self.collect_garbage();
}
self.step();
}
}
Instruction dispatch (handle_instruction):
| Code Form | Action |
|---|---|
Atom |
Resolve reference; push Update continuation if thunk |
Case |
Push Branch continuation, evaluate scrutinee |
Cons |
Return data constructor |
App |
Push ApplyTo continuation, evaluate callable |
Bif |
Execute intrinsic directly |
Let |
Allocate environment frame, continue in body |
LetRec |
Allocate frame with backfilled recursive references |
Continuations
Implementation: src/eval/machine/cont.rs
Four continuation types manage control flow:
- Branch - Pattern matching branches for CASE
- Update - Deferred thunk update (memoisation)
- ApplyTo - Pending arguments for function application
- DeMeta - Metadata destructuring handler
Lazy Evaluation
Laziness is achieved through thunks and updates:
- Thunk creation (compile time): Non-WHNF expressions become
LambdaForm::Thunk - Thunk evaluation (runtime): When a thunk is entered, push Update continuation
- Memoisation: After evaluation, update the environment slot with the result
// When entering a local reference
if closure.update() {
stack.push(Continuation::Update { environment, index });
}
// After evaluation completes
Continuation::Update { environment, index } => {
self.update(environment, index); // Replace thunk with result
}
Memory Management and Garbage Collection
Eucalypt uses an Immix-inspired memory layout with mark-and-sweep collection.
Memory Layout
Implementation: src/eval/memory/heap.rs, src/eval/memory/bump.rs
Block (32KB)
├── Line 0 (128B) ┐
├── Line 1 (128B) │ 256 lines per block
├── ... │
└── Line 255 ┘
Size classes: - Small (< 128 bytes) - Single line - Medium (128B - 32KB) - Multiple lines within block - Large (> 32KB) - Dedicated Large Object Block
Heap state:
pub struct HeapState {
head: Option<BumpBlock>, // Active small allocation
overflow: Option<BumpBlock>, // Active medium allocation
recycled: VecDeque<BumpBlock>, // Blocks with reusable holes
rest: VecDeque<BumpBlock>, // Used blocks pending collection
lobs: Vec<LargeObjectBlock>, // Large objects
}
Object Headers
Implementation: src/eval/memory/header.rs
Every object has a 16-byte header:
pub struct AllocHeader {
bits: HeaderBits, // Mark bit + forwarded flag
alloc_length: u32, // Object size
forwarded_to: Option<NonNull<()>>, // For potential evacuation
}
Garbage Collection
Implementation: src/eval/memory/collect.rs
Mark phase: 1. Reset line maps across all blocks 2. Breadth-first root scanning from machine state 3. Transitive closure following object references 4. Mark lines containing live objects
Sweep phase: 1. Scan line maps in each block 2. Identify holes (2+ consecutive free lines) 3. Move recyclable blocks to recycled list
Collection triggering:
- When --heap-limit-mib is set and limit exceeded
- Check performed every 500 VM execution steps
- Emergency collection on allocation failure
See gc-implementation.md for detailed analysis.
The Prelude and Standard Library
Intrinsic Functions
Implementation: src/eval/intrinsics.rs, src/eval/stg/
Built-in functions are implemented in Rust and indexed by position:
Categories:
- Control flow: __IF, __PANIC, __TRUE, __FALSE, __NULL
- Lists: __CONS, __HEAD, __TAIL, __NIL, __REVERSE
- Blocks: __MERGE, __DEEPMERGE, __ELEMENTS, __BLOCK, __LOOKUP
- Arithmetic: __ADD, __SUB, __MUL, __DIV, __MOD, comparisons
- Strings: __STR, __SPLIT, __JOIN, __MATCH, __FMT
- Metadata: __META, __WITHMETA
- Time: __ZDT, __ZDT.PARSE, __ZDT.FORMAT
- I/O: __io.ENV, __io.EPOCHTIME
- Emission: __RENDER, __EMIT* family
Each intrinsic implements the StgIntrinsic trait with direct access to machine state.
Prelude
Implementation: lib/prelude.eu
The prelude (~29KB) is written entirely in eucalypt, wrapping intrinsics with ergonomic functions:
List functions:
- take, drop, nth (!!), fold/foldr, map, filter
- append (++), concat, reverse, zip, group-by, qsort
Block functions:
- merge-all, keys, values, map-kv, map-keys, map-values
- lookup-path, alter-value, update-value
Combinators:
- identity, const, compose (∘, ;), flip, curry, uncurry
String functions:
- str.split, str.join, str.match, str.fmt, str.letters
Loading:
- Prelude is embedded in the binary at compile time
- Loaded by default unless --no-prelude / -Q flag is used
Driver and CLI Architecture
Entry Point
Implementation: src/bin/eu.rs, src/driver/options.rs
The CLI uses clap v4 with derive macros:
#[derive(Parser)]
struct EucalyptCli {
#[command(subcommand)]
command: Option<Commands>,
files: Vec<String>,
// Global options...
}
enum Commands {
Run(RunArgs),
Test(TestArgs),
Dump(DumpArgs),
Fmt(FmtArgs),
Explain(ExplainArgs),
ListTargets(ListTargetsArgs),
Version,
}
Modes of Operation
Run (default):
eu file.eu # Implicit run
eu -e "expression" file.eu # With evaluand
eu -x json file.eu # JSON output
eu -t target file.eu # Select target
Test:
eu test tests/ # Run all tests in directory
eu test -t specific file.eu # Run specific test target
Format:
eu fmt file.eu # Print formatted to stdout
eu fmt --write file.eu # Modify in place
eu fmt --check file.eu # Check formatting (exit 1 if needs format)
Dump:
eu dump ast file.eu # Dump AST
eu dump desugared file.eu # Dump after desugaring
eu dump stg file.eu # Dump STG syntax
eu dump runtime # Dump intrinsic definitions
Output Formats
Implementation: src/export/
Emitters for each format implement the Emitter trait:
- YAML (yaml.rs) - Uses yaml_rust, supports tags from metadata
- JSON (json.rs) - Uses serde_json
- TOML (toml.rs) - Structured output
- Text (text.rs) - Plain text
- EDN (edn.rs) - Clojure-like format
- HTML (html.rs) - Markup with serialisation
Input Handling
Inputs are merged in order:
1. Prologue: Prelude, config files, build metadata, IO block
2. Explicit: Files and options (-c/--collect-as)
3. Epilogue: CLI evaluand (-e)
Names from earlier inputs are available to later inputs.
Key Design Decisions and Trade-offs
Why STG?
The STG machine provides a well-defined reference point for lazy functional language implementation: - Clear semantics for lazy evaluation with memoisation - Established compilation strategies - Potential for future optimisations
Trade-off: More complex than direct interpretation but provides a solid foundation.
Why Rowan for Parsing?
Rowan provides: - Incremental parsing for IDE support - Full source fidelity (preserves whitespace and comments) - Error recovery for partial parsing
Trade-off: More complex API than traditional parser generators.
Core as Intermediate Language
The core representation enables powerful transformations:
- User-definable operator precedence resolved by syntax transformation
- Block semantics (binding vs structuring) separated cleanly
- Source information preserved through Smid for error reporting
Trade-off: Additional compilation phase, but enables experimentation.
Block Duality
Eucalypt blocks serve two roles:
1. Name binding - Like let expressions
2. Data structuring - Like objects/records
The core phase separates these into distinct Let and Block expressions.
Trade-off: Elegant surface syntax at cost of semantic complexity.
Immix-Inspired GC
The memory layout provides: - Efficient bump allocation - Cache-friendly organisation - Effective hole reuse
Current limitation: No evacuation/compaction (mark-sweep only).
See gc-implementation.md for detailed analysis.
Embedded Prelude
The prelude is: - Written in eucalypt itself - Compiled into the binary - Loaded unless explicitly disabled
Benefit: Dogfooding, consistent semantics Trade-off: Slower startup if prelude is large
Performance Characteristics
Compilation
- Parsing: O(n) with incremental support
- Desugaring: O(n) pass over AST
- Cooking: O(n log n) for operator precedence resolution
- STG compilation: O(n) with binding analysis
Execution
- Bump allocation: O(1) for most objects
- Thunk updates: O(1) memoisation
- GC: O(live objects) for marking, O(total blocks) for sweeping
Memory
- 16-byte header overhead per object
- Block-level allocation granularity
- Large objects get dedicated blocks
Code Organisation Summary
| Directory | Purpose | Key Files |
|---|---|---|
src/syntax/rowan/ |
Incremental parsing | lex.rs, parse.rs, ast.rs |
src/core/ |
Expression representation | expr.rs |
src/core/desugar/ |
AST to core | desugarer.rs |
src/core/cook/ |
Operator resolution | shunt.rs, fixity.rs |
src/eval/stg/ |
STG syntax and compiler | syntax.rs, compiler.rs |
src/eval/machine/ |
Virtual machine | vm.rs, cont.rs, env.rs |
src/eval/memory/ |
Heap and GC | heap.rs, collect.rs |
src/driver/ |
CLI orchestration | options.rs, eval.rs |
src/export/ |
Output formats | yaml.rs, json.rs |
lib/ |
Standard library | prelude.eu |
Further Reading
implementation.md- Brief implementation overviewgc-implementation.md- Detailed GC documentationcommand-line.md- CLI usage guidesyntax.md- Language syntax referenceoperators-and-identifiers.md- Operator definitionsanaphora-and-lambdas.md- Implicit parameter handling