Performance Guide

Audience: Performance Engineer, Library Integrator Purpose: Explain what makes Orbit fast, how to measure performance, and what optimizations are planned.

The Meta-Engine Architecture

Orbit analyzes each pattern at compile time and assigns an EngineHint. MetaEngine reads that hint at match time and dispatches to the appropriate engine.

Current engine routing

EngineHint	Engine used
`DFA_SAFE`	LazyDfaEngine
`ONE_PASS_SAFE`	OnePassDfaEngine
`PIKEVM_ONLY`	PikeVmEngine
`NEEDS_BACKTRACKER`	BoundedBacktrackEngine
`GRAMMAR_RULE`	PikeVmEngine

DFA_SAFE and ONE_PASS_SAFE patterns run in O(n) time. MetaEngine.getEngine routes them to LazyDfaEngine and OnePassDfaEngine respectively.

Time complexity

Engine	Time complexity	Notes
LazyDfaEngine	O(n) amortised	1,024-state cache; falls back to PikeVM on saturation
OnePassDfaEngine	O(n) strict	No forking, no second pass for captures
PikeVmEngine	O(n × \|NFA\|)	Universal fallback; handles all instruction types
BoundedBacktrackEngine	O(budget)	Default budget: 1,000,000 operations

Prefilter

MetaEngine runs the prefilter before invoking any engine. If the prefilter finds no candidate position, MetaEngine returns no-match without calling an engine.

Literal count	Prefilter	Algorithm
0	`NoopPrefilter`	Skipped at match time
1–10	`VectorLiteralPrefilter`	SIMD via `jdk.incubator.vector` (`ShortVector.SPECIES_PREFERRED`): broadcasts the literal’s first character across lanes and calls `regionMatches` on hits; falls back to `String.indexOf` for inputs shorter than 32 chars
11–500	`AhoCorasickPrefilter` (NFA)	Multi-pattern automaton
> 500	`AhoCorasickPrefilter` (DFA)	Determinised Aho-Corasick

NoopPrefilter is used when CASE_INSENSITIVE is active or when the pattern’s minimum match length is 0 — cases where literal prefix scanning is not sound.

SIMD scope: prefilter only

VectorLiteralPrefilter benefits from SIMD because its task is strictly data-parallel: scan a buffer for a fixed byte pattern, applying the same operation to N characters per cycle. That regularity is what SIMD requires.

The PikeVM cannot benefit. It runs a Thompson NFA simulation: for each input character, up to |NFA| threads may be active, each carrying its own program counter and capture registers. Every step involves data-dependent opcode dispatch, per-thread state updates, and irregular memory access — structurally incompatible with lockstep SIMD computation.

The LazyDFA inner loop is a flat table walk (O(1) per character). The JIT’s own auto-vectorisation handles this well; explicit Vector API use would likely prevent the JIT from applying its own optimisations.

SIMD investment in Orbit belongs at the prefilter boundary, not inside the engine cores.

Inspecting the assigned hint

import com.orbit.api.Pattern;
import com.orbit.util.EngineHint;

Pattern p = Pattern.compile("[a-z]+@[a-z]+\\.[a-z]{2,4}");
EngineHint hint = p.engineHint(); // DFA_SAFE

Pattern q = Pattern.compile("(\\w+)\\1");
EngineHint hint2 = q.engineHint(); // NEEDS_BACKTRACKER (backreference)

Compile-time optimizations

The HIR analysis phase performs five passes before compilation:

Build HIR tree
Extract literal prefix/suffix → LiteralSet
One-pass safety check
Output acyclicity and bounded-length check
Engine classification + prefilter selection

Five HIR rewrite rules run as dead-code elimination and simplification passes before the compiler produces Prog. Min/max match length and anchor positions (startAnchored, endAnchored) are computed during analysis and stored in Metadata. MetaEngine uses anchor positions to short-circuit matching when the pattern is start-anchored and the candidate position is not position 0.

ProgOptimiser: EpsilonJump chain folding

ProgOptimiser.foldEpsilonChains runs once inside Pattern.buildCompileResult, after the compiler produces its Prog and after Metadata is assembled. It rewrites the instruction array in place of the original before the final Prog is constructed. No engine code changes.

What it does. The Thompson compiler inserts EpsilonJump instructions as connective tissue between segments: after a quantifier body, around group boundaries, and between alternation arms. Each EpsilonJump does nothing but advance pc to its next field, but it still pays the full BBE dispatch cost: a bounds check, a getInstruction call, and a megamorphic switch arm. ProgOptimiser eliminates these from the execution path by replacing every PC field in every other instruction with the first non-EpsilonJump address reachable from that field. The EpsilonJump nodes remain in the array at their original indices — the array length is unchanged — but no reachable instruction points to them after the pass.

The resolve function. For a given starting PC, resolve follows the EpsilonJump chain until it reaches either a non-epsilon instruction or a cycle. Cycle detection uses a per-call boolean[] visited array; a cycle head is returned as its own target (the engine budget exhausts on such degenerate programs before any infinite loop occurs). resolve runs in O(n) time per call where n is the instruction count; the full fold pass is O(n²) worst case, which is acceptable for patterns compiled once and cached.

Sub-program recursion. Lookahead, LookaheadNeg, LookbehindPos, LookbehindNeg, and ConditionalBranchInstr each carry an embedded Prog body — an independently compiled sub-program for the assertion. ProgOptimiser recurses into each non-null body via foldProg, applying the same pass. This means lookahead and lookbehind sub-programs benefit from the same reduction even though they are executed by runSubProg, not the outer BBE rec() loop.

Patterns that benefit most.

Pattern structure	Why it benefits
Quantifiers (`a+`, `(ab)+`, `\w{3,8}`)	Quantifier bodies compile to loop structures connected by `EpsilonJump` chains
Alternation (`a\\|b\\|c\\|d`)	Each arm is separated from the next by `EpsilonJump` links
Nested groups	Group open/close boundaries emit `EpsilonJump` connectors
Lookahead / lookbehind bodies	Sub-program chains are folded recursively

Patterns without these structures — bare literals, single character classes — produce few or no EpsilonJump instructions and see no measurable change.

Expected gain. The SPECIFICATION.md acceptance criterion requires at least two of the four BacktrackerBenchmark scenarios to improve by ≥ 5% against the Session 3 baseline, with no scenario regressing by more than 5%. The backref_short and backref_sentence patterns (backreference with quantifier) are the primary targets; the spec projects 15–30% fewer dispatch iterations for such patterns.

Idempotence. Running foldEpsilonChains twice on its own output produces the same result. Pattern.compile caches its output in a ConcurrentHashMap keyed by (regex, flags), so the pass executes at most once per unique pattern regardless.

JMH Benchmarks

The orbit-benchmarks module provides a JMH harness. Build once, then run any combination of benchmark class and scenario:

mvn package -pl orbit-benchmarks -am

Benchmark classes

`PerformanceBenchmark` — latency, short inputs (~100–200 chars)

Measures average nanoseconds per find() / findAll() call. Compares Orbit, JDK, and re2j. Matchers are created in @Setup; only execution is timed.

Scenario	Pattern	Engine hint	Engine
`literal`	`hello`	`DFA_SAFE`	LazyDfaEngine
`phone`	`\d{3}-\d{3}-\d{4}`	`ONE_PASS_SAFE`	OnePassDfaEngine
`email`	`[a-zA-Z0-9._%+\-]+@…`	`DFA_SAFE`	LazyDfaEngine
`alternation`	`error\\|warn\\|fatal\\|…`	`DFA_SAFE`	LazyDfaEngine + AhoCorasick prefilter
`ip_address`	`(\d{1,3}\.){3}\d{1,3}`	`PIKEVM_ONLY`	PikeVmEngine

`ThroughputBenchmark` — throughput, large corpus (~1 MB)

Measures ops/sec scanning the full corpus with findAll(). Compares Orbit, JDK, and re2j. Matchers are created in @Setup; corpus generation is excluded from measurement.

Scenario	Pattern	Engine hint	Engine
`digits`	`\d+`	`ONE_PASS_SAFE`	OnePassDfaEngine
`quoted`	`"[^"]*"`	`DFA_SAFE`	LazyDfaEngine + literal prefilter
`ip_address`	`(\d{1,3}\.){3}\d{1,3}`	`PIKEVM_ONLY`	PikeVmEngine
`log_request`	`(GET\\|POST\\|…) \S+ HTTP/…`	`PIKEVM_ONLY`	PikeVmEngine
`dna_motif`	`ACGT+AC`	`DFA_SAFE`	LazyDfaEngine

`MatchingBenchmark` — Orbit only, various operations

Covers matches(), find(), findAll(), replaceAll(), split(). Note: this class creates a new Matcher inside each @Benchmark method, so numbers include Matcher allocation overhead — do not use for engine execution comparisons.

`BacktrackerBenchmark` — BoundedBacktrackEngine vs PikeVmEngine, backtracker patterns

Directly compares BoundedBacktrackEngine and PikeVmEngine on patterns that require backtracking semantics. JDK java.util.regex is included as a reference baseline. Engines are instantiated once in @Setup; only execute() is timed.

Scenario	Pattern	Input	Expected winner
`backref_short`	`(\w+)\s+\1`	`"hello hello"` (11 chars)	BoundedBacktrackEngine
`backref_sentence`	`(\w+)\s+\1`	Sentence with `"the the"` near end	BoundedBacktrackEngine
`backref_long`	`(\w+)\s+\1`	~10 KB access-log corpus + seeded match	PikeVM expected to tie or win
`possessive`	`\w++\s++\w++`	`"hello world"`	BoundedBacktrackEngine; JDK skipped (unsupported syntax)

Session 3 baseline (2026-03-26, JMH AverageTime ns/op, single fork, 5 warmup + 10 × 1 s):

Scenario	BBE	PikeVM	JDK
`backref_short`	316.5 ± 15	1492.8 ± 183	43.8 ± 2.6
`backref_sentence`	16987.7 ± 1461	17551.4 ± 2235	794.9 ± 92
`backref_long`	423132 ± 39960	428237 ± 39173	33675 ± 3013
`possessive`	492.8 ± 48	1556.1 ± 110	N/A†

†JDK possessive executes bh.consume(0); jdkPattern is null for that scenario.

backref_short and backref_sentence exercise the BBE sweet spot: short inputs where DFS avoids PikeVM’s per-step thread management overhead. backref_long reveals the crossover point where corpus size dominates. possessive is the only scenario that exercises the PossessiveSplit / AtomicCommit instruction path — neither PerformanceBenchmark nor ThroughputBenchmark cover it.

`EngineBenchmark` — engine dispatch overhead

Measures MetaEngine.getEngine(), engine.execute(), and the full pipeline. All scenarios hardcode engineHint = DFA_SAFE — PikeVM is never exercised here. Useful only for measuring LazyDfaEngine dispatch and MetaEngine overhead.

Running benchmarks

Standard run (all scenarios in a class)

java --enable-preview --add-modules jdk.incubator.vector \
  -jar orbit-benchmarks/target/orbit-benchmarks.jar \
  ThroughputBenchmark

Single scenario

java --enable-preview --add-modules jdk.incubator.vector \
  -jar orbit-benchmarks/target/orbit-benchmarks.jar \
  ThroughputBenchmark -p scenario=ip_address

With async-profiler (flamegraph output)

java --enable-preview \
  -jar orbit-benchmarks/target/orbit-benchmarks.jar \
  PerformanceBenchmark \
  -prof async:output=flamegraph \
  -p scenario=literal \
  -wi 1 -w 1m \
  -i 1 -r 1s \
  -f 1 \
  2>&1

-w 1m gives the JIT a full minute to stabilise before measurement begins. Flamegraph HTML files are written to the working directory alongside summary-cpu.txt.

Session 1 gate — PikeVM baseline

Session 1 optimises PikeVmEngine. The relevant scenarios are those with PIKEVM_ONLY classification. Capture the following before starting:

JAR=orbit-benchmarks/target/orbit-benchmarks.jar

# Latency baseline (short input)
java --enable-preview \
  -jar $JAR PerformanceBenchmark \
  -prof async:output=flamegraph \
  -p scenario=ip_address \
  -wi 1 -w 1m -i 10 -r 10s -f 1 2>&1

# Throughput baseline (large corpus — ip_address and log_request both exercise PikeVM)
java --enable-preview \
  -jar $JAR ThroughputBenchmark \
  -prof async:output=flamegraph \
  -p scenario=ip_address \
  -wi 1 -w 1m -i 10 -r 10s -f 1 2>&1

java --enable-preview \
  -jar $JAR ThroughputBenchmark \
  -prof async:output=flamegraph \
  -p scenario=log_request \
  -wi 1 -w 1m -i 10 -r 10s -f 1 2>&1

Commit the flamegraph HTML and summary-cpu.txt output before applying any changes.

Session 3 gate — BoundedBacktrackEngine baseline

Status: complete. Baseline captured 2026-03-26. Optimisations OPT-1, OPT-2, and OPT-3 applied. See benchmark-backtracker-baseline.txt in the project root for the full JMH output (JMH 1.37, JDK 21.0.10, single fork, 5 warmup + 10 measurement iterations × 1 s).

The baseline commands used were:

JAR=orbit-benchmarks/target/orbit-benchmarks.jar

# Baseline — all four backtracker scenarios
java --enable-preview \
  -jar $JAR BacktrackerBenchmark \
  -prof async:output=flamegraph \
  2>&1 | tee benchmark-backtracker-baseline.txt

# Single scenario with flamegraph (repeat for each scenario of interest)
java --enable-preview \
  -jar $JAR BacktrackerBenchmark \
  -prof async:output=flamegraph \
  -p scenario=backref_short \
  2>&1 | tee benchmark-backtracker-backref_short.txt

Session 3 results

All figures are JMH AverageTime (ns/op).

Scenario	BBE baseline	PikeVM baseline	JDK baseline	BBE target	Minimum improvement
`backref_short`	316.5 ± 15	1492.8 ± 183	43.8 ± 2.6	≤ 220	1.4×
`backref_sentence`	16987.7 ± 1461	17551.4 ± 2235	794.9 ± 92	≤ 11000	1.5×
`backref_long`	423132 ± 39960	428237 ± 39173	33675 ± 3013	≤ 340000	1.25×
`possessive`	492.8 ± 48	1556.1 ± 110	N/A†	≤ 540	no regression

†The JDK possessive figure (~0.36 ns/op) is not a real measurement: jdkPattern is null for that scenario and the benchmark body executes bh.consume(0).

Three optimisations were applied:

OPT-3: ArrayDeque pre-sized to splitCount + 1 (count of Split and PossessiveSplit instructions), eliminating resize copies for normal patterns.
OPT-1: Copy-on-write flag (capturesDirty) eliminates the capture-array clone at every backtrack restore. The clone is deferred until the first SaveCapture write after a restore. Reduces capture-array allocations by approximately 50% on backtrack-heavy paths.
OPT-2: initialCaptures int[], backtrack stack, and balance stacks are hoisted outside the start-position loop and reset (Arrays.fill / clear()) at the top of each iteration, eliminating O(n) per-start-position allocation for backref_sentence (~43 positions) and backref_long (>10,000 positions).

Planned optimizations

Thread pooling and capture array reuse in PikeVM
ByteMatcher for byte-array inputs
Lookahead/lookbehind sub-engine allocation: runSubProg allocates a fresh BoundedBacktrackEngine per assertion evaluation; fixing this requires a context-passing design and is deferred to a later session.
Capture-array pooling across execute() calls in BoundedBacktrackEngine: OPT-2 eliminated per-start-position allocation within a single execute() call; pooling the arrays between calls would reduce pressure further for callers that invoke execute() in a tight loop.

Common anti-patterns

Creating a new Pattern per request for the same regex string — compile once at startup and reuse.
Sharing a Matcher across threads — Matcher is not thread-safe; create one per thread from a shared Pattern.
Using NEEDS_BACKTRACKER patterns on adversarial user input without a timeout strategy — MatchTimeoutException is thrown when the budget is exceeded, but callers must catch it.