Performance Guide
Audience: Performance Engineer, Library Integrator Purpose: Explain what makes Orbit fast, how to measure performance, and what optimizations are planned.
The Meta-Engine Architecture
Orbit analyzes each pattern at compile time and assigns an EngineHint. MetaEngine reads that hint at match time and dispatches to the appropriate engine.
Current engine routing
| EngineHint | Engine used |
|---|---|
DFA_SAFE | LazyDfaEngine |
ONE_PASS_SAFE | OnePassDfaEngine |
PIKEVM_ONLY | PikeVmEngine |
NEEDS_BACKTRACKER | BoundedBacktrackEngine |
GRAMMAR_RULE | PikeVmEngine |
DFA_SAFE and ONE_PASS_SAFE patterns run in O(n) time. MetaEngine.getEngine routes them to LazyDfaEngine and OnePassDfaEngine respectively.
Time complexity
| Engine | Time complexity | Notes |
|---|---|---|
| LazyDfaEngine | O(n) amortised | 1,024-state cache; falls back to PikeVM on saturation |
| OnePassDfaEngine | O(n) strict | No forking, no second pass for captures |
| PikeVmEngine | O(n × |NFA|) | Universal fallback; handles all instruction types |
| BoundedBacktrackEngine | O(budget) | Default budget: 1,000,000 operations |
Prefilter
MetaEngine runs the prefilter before invoking any engine. If the prefilter finds no candidate position, MetaEngine returns no-match without calling an engine.
| Literal count | Prefilter | Algorithm |
|---|---|---|
| 0 | NoopPrefilter | Skipped at match time |
| 1–10 | VectorLiteralPrefilter | SIMD via jdk.incubator.vector (ShortVector.SPECIES_PREFERRED): broadcasts the literal’s first character across lanes and calls regionMatches on hits; falls back to String.indexOf for inputs shorter than 32 chars |
| 11–500 | AhoCorasickPrefilter (NFA) | Multi-pattern automaton |
| > 500 | AhoCorasickPrefilter (DFA) | Determinised Aho-Corasick |
NoopPrefilter is used when CASE_INSENSITIVE is active or when the pattern’s minimum match length is 0 — cases where literal prefix scanning is not sound.
SIMD scope: prefilter only
VectorLiteralPrefilter benefits from SIMD because its task is strictly data-parallel: scan a buffer for a fixed byte pattern, applying the same operation to N characters per cycle. That regularity is what SIMD requires.
The PikeVM cannot benefit. It runs a Thompson NFA simulation: for each input character, up to |NFA| threads may be active, each carrying its own program counter and capture registers. Every step involves data-dependent opcode dispatch, per-thread state updates, and irregular memory access — structurally incompatible with lockstep SIMD computation.
The LazyDFA inner loop is a flat table walk (O(1) per character). The JIT’s own auto-vectorisation handles this well; explicit Vector API use would likely prevent the JIT from applying its own optimisations.
SIMD investment in Orbit belongs at the prefilter boundary, not inside the engine cores.
Inspecting the assigned hint
import com.orbit.api.Pattern;
import com.orbit.util.EngineHint;
Pattern p = Pattern.compile("[a-z]+@[a-z]+\\.[a-z]{2,4}");
EngineHint hint = p.engineHint(); // DFA_SAFE
Pattern q = Pattern.compile("(\\w+)\\1");
EngineHint hint2 = q.engineHint(); // NEEDS_BACKTRACKER (backreference)
Compile-time optimizations
The HIR analysis phase performs five passes before compilation:
- Build HIR tree
- Extract literal prefix/suffix →
LiteralSet - One-pass safety check
- Output acyclicity and bounded-length check
- Engine classification + prefilter selection
Five HIR rewrite rules run as dead-code elimination and simplification passes before the compiler produces Prog. Min/max match length and anchor positions (startAnchored, endAnchored) are computed during analysis and stored in Metadata. MetaEngine uses anchor positions to short-circuit matching when the pattern is start-anchored and the candidate position is not position 0.
ProgOptimiser: EpsilonJump chain folding
ProgOptimiser.foldEpsilonChains runs once inside Pattern.buildCompileResult, after the compiler produces its Prog and after Metadata is assembled. It rewrites the instruction array in place of the original before the final Prog is constructed. No engine code changes.
What it does. The Thompson compiler inserts EpsilonJump instructions as connective tissue between segments: after a quantifier body, around group boundaries, and between alternation arms. Each EpsilonJump does nothing but advance pc to its next field, but it still pays the full BBE dispatch cost: a bounds check, a getInstruction call, and a megamorphic switch arm. ProgOptimiser eliminates these from the execution path by replacing every PC field in every other instruction with the first non-EpsilonJump address reachable from that field. The EpsilonJump nodes remain in the array at their original indices — the array length is unchanged — but no reachable instruction points to them after the pass.
The resolve function. For a given starting PC, resolve follows the EpsilonJump chain until it reaches either a non-epsilon instruction or a cycle. Cycle detection uses a per-call boolean[] visited array; a cycle head is returned as its own target (the engine budget exhausts on such degenerate programs before any infinite loop occurs). resolve runs in O(n) time per call where n is the instruction count; the full fold pass is O(n²) worst case, which is acceptable for patterns compiled once and cached.
Sub-program recursion. Lookahead, LookaheadNeg, LookbehindPos, LookbehindNeg, and ConditionalBranchInstr each carry an embedded Prog body — an independently compiled sub-program for the assertion. ProgOptimiser recurses into each non-null body via foldProg, applying the same pass. This means lookahead and lookbehind sub-programs benefit from the same reduction even though they are executed by runSubProg, not the outer BBE rec() loop.
Patterns that benefit most.
| Pattern structure | Why it benefits |
|---|---|
Quantifiers (a+, (ab)+, \w{3,8}) | Quantifier bodies compile to loop structures connected by EpsilonJump chains |
Alternation (a\|b\|c\|d) | Each arm is separated from the next by EpsilonJump links |
| Nested groups | Group open/close boundaries emit EpsilonJump connectors |
| Lookahead / lookbehind bodies | Sub-program chains are folded recursively |
Patterns without these structures — bare literals, single character classes — produce few or no EpsilonJump instructions and see no measurable change.
Expected gain. The SPECIFICATION.md acceptance criterion requires at least two of the four BacktrackerBenchmark scenarios to improve by ≥ 5% against the Session 3 baseline, with no scenario regressing by more than 5%. The backref_short and backref_sentence patterns (backreference with quantifier) are the primary targets; the spec projects 15–30% fewer dispatch iterations for such patterns.
Idempotence. Running foldEpsilonChains twice on its own output produces the same result. Pattern.compile caches its output in a ConcurrentHashMap keyed by (regex, flags), so the pass executes at most once per unique pattern regardless.
JMH Benchmarks
The orbit-benchmarks module provides a JMH harness. Build once, then run any combination of benchmark class and scenario:
mvn package -pl orbit-benchmarks -am
Benchmark classes
PerformanceBenchmark — latency, short inputs (~100–200 chars)
Measures average nanoseconds per find() / findAll() call. Compares Orbit, JDK, and re2j. Matchers are created in @Setup; only execution is timed.
| Scenario | Pattern | Engine hint | Engine |
|---|---|---|---|
literal | hello | DFA_SAFE | LazyDfaEngine |
phone | \d{3}-\d{3}-\d{4} | ONE_PASS_SAFE | OnePassDfaEngine |
email | [a-zA-Z0-9._%+\-]+@… | DFA_SAFE | LazyDfaEngine |
alternation | error\|warn\|fatal\|… | DFA_SAFE | LazyDfaEngine + AhoCorasick prefilter |
ip_address | (\d{1,3}\.){3}\d{1,3} | PIKEVM_ONLY | PikeVmEngine |
ThroughputBenchmark — throughput, large corpus (~1 MB)
Measures ops/sec scanning the full corpus with findAll(). Compares Orbit, JDK, and re2j. Matchers are created in @Setup; corpus generation is excluded from measurement.
| Scenario | Pattern | Engine hint | Engine |
|---|---|---|---|
digits | \d+ | ONE_PASS_SAFE | OnePassDfaEngine |
quoted | "[^"]*" | DFA_SAFE | LazyDfaEngine + literal prefilter |
ip_address | (\d{1,3}\.){3}\d{1,3} | PIKEVM_ONLY | PikeVmEngine |
log_request | (GET\|POST\|…) \S+ HTTP/… | PIKEVM_ONLY | PikeVmEngine |
dna_motif | ACGT+AC | DFA_SAFE | LazyDfaEngine |
MatchingBenchmark — Orbit only, various operations
Covers matches(), find(), findAll(), replaceAll(), split(). Note: this class creates a new Matcher inside each @Benchmark method, so numbers include Matcher allocation overhead — do not use for engine execution comparisons.
BacktrackerBenchmark — BoundedBacktrackEngine vs PikeVmEngine, backtracker patterns
Directly compares BoundedBacktrackEngine and PikeVmEngine on patterns that require backtracking semantics. JDK java.util.regex is included as a reference baseline. Engines are instantiated once in @Setup; only execute() is timed.
| Scenario | Pattern | Input | Expected winner |
|---|---|---|---|
backref_short | (\w+)\s+\1 | "hello hello" (11 chars) | BoundedBacktrackEngine |
backref_sentence | (\w+)\s+\1 | Sentence with "the the" near end | BoundedBacktrackEngine |
backref_long | (\w+)\s+\1 | ~10 KB access-log corpus + seeded match | PikeVM expected to tie or win |
possessive | \w++\s++\w++ | "hello world" | BoundedBacktrackEngine; JDK skipped (unsupported syntax) |
Session 3 baseline (2026-03-26, JMH AverageTime ns/op, single fork, 5 warmup + 10 × 1 s):
| Scenario | BBE | PikeVM | JDK |
|---|---|---|---|
backref_short | 316.5 ± 15 | 1492.8 ± 183 | 43.8 ± 2.6 |
backref_sentence | 16987.7 ± 1461 | 17551.4 ± 2235 | 794.9 ± 92 |
backref_long | 423132 ± 39960 | 428237 ± 39173 | 33675 ± 3013 |
possessive | 492.8 ± 48 | 1556.1 ± 110 | N/A† |
†JDK possessive executes bh.consume(0); jdkPattern is null for that scenario.
backref_short and backref_sentence exercise the BBE sweet spot: short inputs where DFS avoids PikeVM’s per-step thread management overhead. backref_long reveals the crossover point where corpus size dominates. possessive is the only scenario that exercises the PossessiveSplit / AtomicCommit instruction path — neither PerformanceBenchmark nor ThroughputBenchmark cover it.
EngineBenchmark — engine dispatch overhead
Measures MetaEngine.getEngine(), engine.execute(), and the full pipeline. All scenarios hardcode engineHint = DFA_SAFE — PikeVM is never exercised here. Useful only for measuring LazyDfaEngine dispatch and MetaEngine overhead.
Running benchmarks
Standard run (all scenarios in a class)
java --enable-preview --add-modules jdk.incubator.vector \
-jar orbit-benchmarks/target/orbit-benchmarks.jar \
ThroughputBenchmark
Single scenario
java --enable-preview --add-modules jdk.incubator.vector \
-jar orbit-benchmarks/target/orbit-benchmarks.jar \
ThroughputBenchmark -p scenario=ip_address
With async-profiler (flamegraph output)
java --enable-preview \
-jar orbit-benchmarks/target/orbit-benchmarks.jar \
PerformanceBenchmark \
-prof async:output=flamegraph \
-p scenario=literal \
-wi 1 -w 1m \
-i 1 -r 1s \
-f 1 \
2>&1
-w 1m gives the JIT a full minute to stabilise before measurement begins. Flamegraph HTML files are written to the working directory alongside summary-cpu.txt.
Session 1 gate — PikeVM baseline
Session 1 optimises PikeVmEngine. The relevant scenarios are those with PIKEVM_ONLY classification. Capture the following before starting:
JAR=orbit-benchmarks/target/orbit-benchmarks.jar
# Latency baseline (short input)
java --enable-preview \
-jar $JAR PerformanceBenchmark \
-prof async:output=flamegraph \
-p scenario=ip_address \
-wi 1 -w 1m -i 10 -r 10s -f 1 2>&1
# Throughput baseline (large corpus — ip_address and log_request both exercise PikeVM)
java --enable-preview \
-jar $JAR ThroughputBenchmark \
-prof async:output=flamegraph \
-p scenario=ip_address \
-wi 1 -w 1m -i 10 -r 10s -f 1 2>&1
java --enable-preview \
-jar $JAR ThroughputBenchmark \
-prof async:output=flamegraph \
-p scenario=log_request \
-wi 1 -w 1m -i 10 -r 10s -f 1 2>&1
Commit the flamegraph HTML and summary-cpu.txt output before applying any changes.
Session 3 gate — BoundedBacktrackEngine baseline
Status: complete. Baseline captured 2026-03-26. Optimisations OPT-1, OPT-2, and OPT-3 applied. See benchmark-backtracker-baseline.txt in the project root for the full JMH output (JMH 1.37, JDK 21.0.10, single fork, 5 warmup + 10 measurement iterations × 1 s).
The baseline commands used were:
JAR=orbit-benchmarks/target/orbit-benchmarks.jar
# Baseline — all four backtracker scenarios
java --enable-preview \
-jar $JAR BacktrackerBenchmark \
-prof async:output=flamegraph \
2>&1 | tee benchmark-backtracker-baseline.txt
# Single scenario with flamegraph (repeat for each scenario of interest)
java --enable-preview \
-jar $JAR BacktrackerBenchmark \
-prof async:output=flamegraph \
-p scenario=backref_short \
2>&1 | tee benchmark-backtracker-backref_short.txt
Session 3 results
All figures are JMH AverageTime (ns/op).
| Scenario | BBE baseline | PikeVM baseline | JDK baseline | BBE target | Minimum improvement |
|---|---|---|---|---|---|
backref_short | 316.5 ± 15 | 1492.8 ± 183 | 43.8 ± 2.6 | ≤ 220 | 1.4× |
backref_sentence | 16987.7 ± 1461 | 17551.4 ± 2235 | 794.9 ± 92 | ≤ 11000 | 1.5× |
backref_long | 423132 ± 39960 | 428237 ± 39173 | 33675 ± 3013 | ≤ 340000 | 1.25× |
possessive | 492.8 ± 48 | 1556.1 ± 110 | N/A† | ≤ 540 | no regression |
†The JDK possessive figure (~0.36 ns/op) is not a real measurement: jdkPattern is null for that scenario and the benchmark body executes bh.consume(0).
Three optimisations were applied:
- OPT-3:
ArrayDequepre-sized tosplitCount + 1(count ofSplitandPossessiveSplitinstructions), eliminating resize copies for normal patterns. - OPT-1: Copy-on-write flag (
capturesDirty) eliminates the capture-array clone at every backtrack restore. The clone is deferred until the firstSaveCapturewrite after a restore. Reduces capture-array allocations by approximately 50% on backtrack-heavy paths. - OPT-2:
initialCapturesint[], backtrack stack, and balance stacks are hoisted outside the start-position loop and reset (Arrays.fill/clear()) at the top of each iteration, eliminating O(n) per-start-position allocation forbackref_sentence(~43 positions) andbackref_long(>10,000 positions).
Planned optimizations
- Thread pooling and capture array reuse in PikeVM
ByteMatcherfor byte-array inputs- Lookahead/lookbehind sub-engine allocation:
runSubProgallocates a freshBoundedBacktrackEngineper assertion evaluation; fixing this requires a context-passing design and is deferred to a later session. - Capture-array pooling across
execute()calls inBoundedBacktrackEngine: OPT-2 eliminated per-start-position allocation within a singleexecute()call; pooling the arrays between calls would reduce pressure further for callers that invokeexecute()in a tight loop.
Common anti-patterns
- Creating a new
Patternper request for the same regex string — compile once at startup and reuse. - Sharing a
Matcheracross threads —Matcheris not thread-safe; create one per thread from a sharedPattern. - Using
NEEDS_BACKTRACKERpatterns on adversarial user input without a timeout strategy —MatchTimeoutExceptionis thrown when the budget is exceeded, but callers must catch it.