Orbit Semantic Layers
This document is the authoritative reference for which behaviors Orbit provides in each mode, where it diverges from JDK or RE2, and which extensions are always-on. Every implementation decision that touches matching semantics — including the RegExCompatTest Batch 3 fixes — must be evaluated against this document before writing code.
Feature set hierarchy
RE2 is stricter than JDK, not an extension of it. The configurable layers, ordered from most to least restrictive:
RE2_COMPAT ⊂ Orbit default ⊂ +UNICODE ⊂ +PERL_NEWLINES
(smallest)
| Mode | Activation | What changes |
|---|---|---|
RE2_COMPAT | PatternFlag.RE2_COMPAT | Restricts to RE2 subset: no backrefs, no lookaround, no possessives, O(n) guarantee |
| Orbit default | no flags | JDK 21 behavioral parity. Also accepts syntax JDK rejects as PSE (see below). |
UNICODE | PatternFlag.UNICODE | \w/\d/\s/\b and POSIX classes use Unicode properties instead of ASCII |
PERL_NEWLINES | PatternFlag.PERL_NEWLINES | Dot and $ exclude \n only; \r stays a terminator for anchors |
Orbit’s default mode is a superset of JDK’s parser. It accepts syntax that JDK rejects with PatternSyntaxException — atomic groups (?>...), balancing groups (?<name-name2>...), conditional subpatterns (?(cond)yes|no), and Python-style named groups (?P<name>...). This is not a separate opt-in layer; it is simply Orbit’s parser being broader. No existing java.util.regex program uses these constructs (they would have thrown PSE), so there is no compatibility risk. For matched patterns, matching semantics are identical to JDK.
UNICODE and PERL_NEWLINES are opt-in relaxations toward Perl behavior, independent and combinable.
The transducer API (com.orbit.api.Transducer) is architecturally separate from all layers and is unaffected by PatternFlag.
JDK default vs RE2: key behavioral differences
RE2 is designed for untrusted input with a guaranteed O(n) time bound. It rejects or ignores anything that requires backtracking. JDK is Perl-like: feature-complete but ReDoS-vulnerable without external budget enforcement (Orbit adds the BBE budget on top).
Feature support
| Feature | JDK / Orbit default | RE2_COMPAT |
|---|---|---|
Backreferences (\1, \k<name>) | Supported | PatternSyntaxException |
Lookahead ((?=...), (?!...)) | Supported | PatternSyntaxException |
Lookbehind ((?<=...), (?<!...)) | Fixed- and variable-length | PatternSyntaxException |
Atomic groups (?>...) | Supported (BBE) | PatternSyntaxException |
Possessive quantifiers x*+ | Supported (BBE) | PatternSyntaxException |
Balancing groups (?<n-m>...) | Supported (BBE) | PatternSyntaxException |
| Conditional subpatterns | Supported (BBE) | PatternSyntaxException |
Character class intersection [a&&[b]] | Supported | PatternSyntaxException |
COMMENTS flag | Supported | Not supported |
LITERAL flag | Supported | Not supported |
UNICODE_CASE flag | Supported | Not supported |
UNIX_LINES flag | Supported | Not supported |
CANON_EQ flag | Supported | Not supported |
Region/anchoring API (region(), useAnchoringBounds()) | Supported | Not supported |
| Engine guarantee | Best-effort; BBE used for above features | Strictly O(n); DFA/PikeVM only |
Behavioral differences for shared features
| Behavior | JDK / Orbit default | RE2_COMPAT |
|---|---|---|
Dot . exclusions | \n, \r, \u0085, \u2028, \u2029 | \n only |
$ matches before | \n, \r, \u0085, \u2028, \u2029 (and before \r of \r\n) | \n only |
\r\n as unit | Yes — indivisible; $ before \r, ^ not between \r and \n | No — each character independent |
\b word characters | [a-zA-Z0-9_] + Unicode NON_SPACING_MARK | ASCII only: [a-zA-Z0-9_] |
| Alternation priority | Leftmost-first (NFA); matches() backtracks to find full-length alternative | Leftmost-longest (DFA) |
| Named capture syntax | (?<name>...) | (?P<name>...) or (?<name>...) depending on port |
| Case-insensitive backrefs | (a)\1 matches "aA" | No backrefs |
| Special Unicode folds | UNICODE_CASE+CASE_INSENSITIVE: dotless-i, long-S, Kelvin, Angstrom | Not applicable (flag unsupported) |
JDK default vs Perl: additional context
The PerlRegexCompatTest suite runs Perl’s re_tests. Many Perl features are not JDK features and are not Orbit targets:
| Feature | Perl | JDK / Orbit |
|---|---|---|
\K keep-out | Yes | No |
Branch reset (?|...) | Yes | No |
| Lookahead inside character classes | Yes | No |
\b is Unicode-aware by default | Yes | No (ASCII; needs UNICODE_CHARACTER_CLASS) |
POSIX classes with Unicode semantics | Yes | Limited |
Orbit does not have a Perl compat mode. PerlRegexCompatTest skips are expected.
Orbit extensions (JDK+ default layer)
These are available in Orbit’s default mode. They are not available under RE2_COMPAT because they all require BoundedBacktrackEngine. Existing java.util.regex programs are unaffected because JDK rejects these as PatternSyntaxException — no valid JDK program uses them.
| Feature | Syntax | Why safe to extend |
|---|---|---|
| Atomic groups | (?>...) | JDK throws PSE; no compat conflict |
| Balancing groups | (?<name-name2>...) | .NET only; JDK throws PSE |
| Conditional subpatterns | (?(cond)yes\|no) | JDK throws PSE |
(?P<name>...) Python-style named groups | (?P<name>...) | JDK throws PSE; alias for (?<name>...) |
Note: possessive quantifiers (x*+, x++, etc.) are supported by JDK 17+ and are therefore part of the JDK default layer, not an extension.
RE2_COMPAT — implemented (§6.10)
PatternFlag.RE2_COMPAT is fully wired as of §6.10. At compile time, Pattern.compile throws PatternSyntaxException for any non-RE2 construct — never at match time.
Rejected at compile time:
- Backreferences (
\1–\99,\k<name>) - Lookahead (
(?=...),(?!...)) - Lookbehind (
(?<=...),(?<!...)) - Atomic groups
(?>...) - Possessive quantifiers (
*+,++,?+,{n,m}+) - Balancing groups, conditional subpatterns
- Character class intersection (
[a&&[b]]) - Incompatible flags:
COMMENTS,LITERAL,UNICODE_CASE,UNIX_LINES,CANON_EQ,PERL_NEWLINES
Behavioral changes under RE2_COMPAT:
- Dot
.excludes only\n(not\r,\u0085,\u2028,\u2029) ^/$use\n-only terminator set;\ris not treated as a line terminator- Engine forced to
PikeVmEngine(NFA semantics); patterns can never be routed toBoundedBacktrackEngine
See ROADMAP.md §6.10 for full implementation notes.
Closing the Perl gap (PerlRegexCompatTest)
664 of 1974 Perl re_tests cases are currently skipped. The skips fall into three independent buckets — each has a different fix strategy.
Bucket 1 — Feature gaps (no flag needed)
These constructs are absent from Orbit’s parser/compiler but are compatible with JDK-default or JDK-extension semantics. Implementing them does not require a new mode flag.
| Feature | Skip count (approx) | Effort | Status |
|---|---|---|---|
\x{NNN} braced hex escape | ~10 | Easy | Done (§6.7.1) |
\c control-char escape | ~5 | Easy | Done (§6.7.1) |
\h/\H horizontal whitespace, \v/\V vertical whitespace | ~20 | Easy | Done (§6.7.2) |
\o{...} braced octal escape | ~5 | Easy | Done (§6.7.1) |
{,N} leading-comma quantifier (= {0,N}) | ~5 | Easy | Done (§6.7.1) |
\R any Unicode line break | ~15 | Medium | Done (§6.7.2 / Batch B1) |
\g relative/named backreference | ~10 | Medium | Done (§6.7.2) |
(?|...) branch reset group | ~30 | Medium | Done (§6.7.3 F1) |
\K keep assertion (reset match start) | ~20 | Medium | Done (§6.7.3 F2) |
Named conditional (?(<name>)...) | ~5 | Medium | Done (§6.7.2) |
\N{NAME} Unicode named char escape | ~few | Medium | Done (§6.7.5) |
\X extended grapheme cluster | ~10 | Hard | Not scheduled |
Recursion (?R), (?N), (?&name) | ~30 | Hard | Not planned |
POSIX collating [[.x.]], [[=x=]] | ~5 | Hard | Not planned |
Bucket 2 — UNICODE flag (Unicode-aware character classes) — Implemented (§6.8)
PatternFlag.UNICODE is fully wired. It changes the following behaviors:
| Behavior | JDK default | UNICODE |
|---|---|---|
\w | [a-zA-Z0-9_] | Unicode letter/digit/connector punctuation |
\d | [0-9] | Unicode decimal digit (\p{Nd}) |
\s | ASCII whitespace | \p{Z} union ASCII whitespace |
\b word boundary | ASCII + NON_SPACING_MARK | Character.isLetterOrDigit() or '_' |
[:alpha:] POSIX | ASCII only | \p{L} Unicode letters |
[:digit:] POSIX | [0-9] | \p{Nd} |
[:lower:] POSIX | [a-z] | \p{Ll} |
[:upper:] POSIX | [A-Z] | \p{Lu} |
[:space:] POSIX | ASCII whitespace | Unicode whitespace |
\p{Word} POSIX | Letter/digit/_ | Also includes CONNECTOR_PUNCTUATION |
| Case folding | ASCII + JDK special | Full Unicode case folding (dotless-i, long-s, Kelvin, Ångström) |
UNICODE implies UNICODE_CASE. It does not change dot or anchor behavior.
Bucket 3 — PERL_NEWLINES flag (dot/anchor newline set) — Implemented (§6.9)
PatternFlag.PERL_NEWLINES is fully wired. It implements Perl’s actual newline semantics, distinct from UNIX_LINES:
| Behavior | JDK default | UNIX_LINES | PERL_NEWLINES |
|---|---|---|---|
. excludes | \n \r \u0085 \u2028 \u2029 | \n only | \n only |
$ matches before | same 5 chars | \n only | \n only |
^ in MULTILINE | any of 5 | \n only | \n only |
\r as line terminator for $ | yes | no | yes |
\r\n as unit | yes | no | yes |
Bucket 4 — Won’t fix (intentional Perl divergence)
| Behavior | Reason |
|---|---|
Group reset to null vs Perl undef | Language-level difference; Java has no undef |
ZWNJ / ZWJ as \w | Orbit follows JDK intentionally |
Octal \400–\777 → Unicode U+0100–U+01FF | Perl-specific; not planned |
Self-referential backreferences (a\1?) | Not planned |
(?a), (?l), (?u), (?d) Perl inline flags | Not planned |
How this constrains Batch 3 (RegExCompatTest) fixes
All Batch 3 fixes target JDK default semantics. Their relationship to future RE2_COMPAT implementation:
| Group | Fix | RE2_COMPAT impact |
|---|---|---|
| A — Case-insensitive backrefs | regionMatches(ignoreCase=true) in BBE | Backrefs are PSE in RE2; fix lives only inside BBE, which RE2 never reaches |
B — matches() full-match backtrack | Accept only succeeds when pos == requiredEnd | RE2’s leftmost-longest naturally produces full-match semantics; gate fix behind !re2Compat |
| C — LITERAL + DOTALL | Fix parser LITERAL guard | LITERAL is PSE in RE2; no impact |
D — Dot excludes \u2028/\u2029 | Fix char class ranges in Parser.java | RE2 overrides to \n-only; make dot construction flag-aware so RE2 path uses a single exclusion |
E — \b + NON_SPACING_MARK | Add Unicode category to isWordChar | RE2 uses ASCII-only; make isWordChar flag-aware |
F — \p{javaXxx} | Add 10 properties to UnicodeProperties | javaXxx is JDK-specific; RE2 would throw PSE at parse time |
| G — Special Unicode folds | Extend case-fold table | UNICODE_CASE is PSE in RE2; no impact |
| H — Variable-length lookbehind | min/maxBodyLength in LookbehindPos/LookbehindNeg | All lookbehind is PSE in RE2; guard in parser |
For D and E, implement the JDK behavior using the flags already available in CompileResult / PatternFlag set rather than hardcoding. This makes the future RE2 branch a clean if/else rather than a structural change.
Behavior that must not regress
Run this before and after every Batch 3 change:
mvn test -pl orbit-core -Dtest="DotNetCompatTest,BasicRegexCompatTest,RegExCompatTest,TransducerTest"
| Feature | Test class | Why it could break |
|---|---|---|
Transducer API (applyUp/Down/tokenize/invert/compose) | TransducerTest | Architecturally separate but shares Prog/CompileResult |
Atomic groups (?>...) | DotNetCompatTest | BBE changes from Group B could affect atomic commit |
| Balancing groups | DotNetCompatTest | Same |
| Conditional subpatterns | DotNetCompatTest | Same |
| Possessive quantifiers | DotNetCompatTest, BasicRegexCompatTest | BBE stack changes |
ProgOptimiser EpsilonJump folding | all suites | Performance pass; must not alter match semantics |
ReDoS budget (MatchTimeoutException) | DotNetCompatTest | Budget check must survive any BBE changes |
\G anchor | RegExCompatTest | lastMatchEnd wiring must survive matches() changes (Group B) |
Document maintenance
Update this file when:
- A new flag is added to
PatternFlagorTransducerFlag - A behavior intentionally diverges from JDK
RE2_COMPATimplementation begins (move “intended” to “implemented”)- A new Orbit extension is added
Do not add Perl-specific behaviors to the JDK default layer.