GC Benchmarking Workflow
This document describes the procedure for measuring GC performance when making changes to the garbage collector or memory management code.
Prerequisites
Build the release binary (benchmarks should always run against optimised builds):
cargo build --release
Tools
gc-bench.sh — End-to-end benchmarking
The scripts/gc-bench.sh script runs three GC stress benchmark programs
multiple times, collects statistics via --statistics-file, computes
medians, and compares against a saved baseline.
Benchmark programs (in harness/test/bench/):
| File | Target | What it tests |
|---|---|---|
007_short_lived.eu |
bench-short-lived |
High allocation churn — many temporary lists created and immediately discarded |
008_long_lived_graph.eu |
bench-long-lived-graph |
Long-lived persistent structure surviving alongside temporary garbage |
009_fragmentation.eu |
bench-fragmentation |
Interleaved retained/discarded allocations producing heap fragmentation |
Metrics collected per benchmark:
machine_ticks— VM instruction count (deterministic)machine_allocs— heap allocation count (deterministic)collections_count— number of GC collections triggeredtotal_mark_time_secs— wall-clock time in mark phasetotal_sweep_time_secs— wall-clock time in sweep phasepeak_heap_blocks— maximum heap blocks in use at any point
Criterion benchmarks
cargo bench runs Criterion microbenchmarks defined in benches/gc.rs.
These measure low-level GC operations:
- Allocate then collect — allocation + full collection cycle
- Collect with survivors — collection with varying survival rates (0% to 100%)
- Allocate into recycled — allocation into previously-collected blocks
Step-by-step procedure
1. Establish baseline
Before making any GC changes, on a clean working tree:
scripts/gc-bench.sh baseline --runs 5
This saves median statistics to gc-bench-baseline.json (git-ignored).
2. Make your change
Implement the GC modification, then rebuild:
cargo build --release
3. Run tests
Verify correctness is not broken:
cargo test --lib
cargo test --test harness_test
4. Compare against baseline
scripts/gc-bench.sh compare --runs 5
This runs the same benchmarks and reports percentage change for each metric. Regressions above the threshold (default 5%) are flagged.
Example output:
--- bench-short-lived ---
machine_ticks baseline=167037166 current=167037166 change=0.0%
machine_allocs baseline=11178642 current=11178642 change=0.0%
total_mark_time_secs baseline=0.308 current=0.295 change=-4.2%
total_sweep_time_secs baseline=0.073 current=0.068 change=-6.8%
5. Investigate with Criterion if needed
If gc-bench.sh shows unexpected results, use Criterion for detailed analysis:
# Run all GC benchmarks
cargo bench -- gc
# Run a specific benchmark group
cargo bench -- gc_alloc_then_collect
# Compare against saved Criterion baseline
cargo bench -- gc --save-baseline before
# ... make change, rebuild ...
cargo bench -- gc --baseline before
6. Include results in commit message
When committing GC changes, include a summary of benchmark results in the commit message body. For example:
feat(eu-xxx): implement lazy sweeping
gc-bench.sh results (5 runs, median):
bench-short-lived: mark -12%, sweep -45%, ticks 0%
bench-long-lived: mark -8%, sweep -38%, ticks 0%
bench-fragmentation: mark -5%, sweep -42%, ticks 0%
Options reference
gc-bench.sh
Usage: scripts/gc-bench.sh {baseline|compare} [OPTIONS]
Options:
--runs N Number of runs per benchmark (default: 5)
--heap-limit MIB Heap limit in MiB (default: no limit)
--threshold PCT Regression threshold percentage (default: 5)
Environment:
EU_BIN Path to eu binary (default: target/release/eu)
Interpreting results
- Deterministic metrics (ticks, allocs) should show 0% change unless the GC change affects the allocation path or triggers collections at different points.
- Timing metrics (mark, sweep) naturally vary by a few percent
between runs. Use
--runs 5or higher for reliable medians. - collections_count changes indicate the GC is triggering at different points, which may be expected for threshold or heuristic changes.
- peak_heap_blocks changes indicate different peak memory usage.
Notes
- Always benchmark release builds — debug builds are 10-50x slower.
- The
--heap-limit-mibflag forces a smaller heap, which can trigger more frequent collections. However, there is a known GC assertion issue under very small heap limits, so use this flag with caution. - The baseline file (
gc-bench-baseline.json) is git-ignored and local to each worktree.