Skip to content

/// heregex, even dynamic#93

Merged
edemaine merged 9 commits into
mainfrom
heregex
May 26, 2026
Merged

/// heregex, even dynamic#93
edemaine merged 9 commits into
mainfrom
heregex

Conversation

@edemaine
Copy link
Copy Markdown
Collaborator

@edemaine edemaine commented May 22, 2026

Fixes #87.

  • Adds Civet-style ///.../// heregex-style regex literals with whitespace and // comments ignored.
  • Supports ${...} dynamic interpolation in heregexes. These can change every time the regex attempts to match!
  • Adds ${const ...} interpolation for regex parts that should be captured once, instead of updated dynamically.
  • Optimizes generated dynamic regex parsers by caching source/parser state on $RDi.
  • Emits all-const heregexes as once-computed regex parsers.
  • Sample grammar coverage in samples/heregex.hera.
  • /.../ now compiles to JS literal /.../.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (df27194) to head (3b44e7e).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##              main       #93    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            9         9            
  Lines         1622      1783   +161     
  Branches       261       300    +39     
==========================================
+ Hits          1622      1783   +161     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR adds Civet-style ///.../// heregex literals to Hera, supporting whitespace-insensitive regex bodies with // line comments, ${...} dynamic interpolation, and ${const ...} one-time-captured interpolation. The implementation is well-structured: all-const heregexes compile to a self-overwriting lazy initialiser, mixed dynamic regexes cache their source string and parser object on $RD${i} to avoid redundant RegExp construction, and deduplication of identical dynamic patterns is handled at compile time via JSON serialisation comparison.

  • source/hera.hera: adds HeregexBody / HeregexPart / HeregexSubstitutionContent grammar rules; HeregexSubstitutionContent matches balanced {}, double-quoted, and single-quoted strings so that ${expr} interpolations survive nested braces and string literals.
  • source/compiler.civet: introduces defineDynamicRe, isIdentifierExpression, regExpSourceToExpression, and the \"RD\" codegen path; updates compileRuleBodyInline to extend regex-style handler params ($0$9) to "RD" rules via rule[0].startsWith(\"R\").
  • source/util.civet / test/: decompile, EBNF conversion, and tests for the new "RD" AST node type.

Confidence Score: 5/5

Safe to merge; all new codegen paths are well-tested and the dynamic regex caching logic is correct.

The all-const self-overwriting closure, the mixed-path ??= String() memoisation, and the regExpSourceToExpression index alignment between constInits and the template literal are all correct. Tests cover static, dynamic, const-only, mixed, typed, unicode-identifier, and deduplication paths. The observations raised are decompile fidelity edge cases and a documentation gap, none of which affect runtime parser correctness.

source/util.civet — the "RD" decompile path silently drops escaped spaces; worth a follow-up if decompile round-trip fidelity matters for tooling built on top of Hera.

Important Files Changed

Filename Overview
source/compiler.civet Core codegen for dynamic regexes; all-const self-overwriting pattern, mixed cache with ??= String() fix, and regExpSourceToExpression look correct. Minor: escaped spaces in static "RD" parts are output verbatim in toS and lost on reparse.
source/hera.hera Adds HeregexBody/HeregexPart/HeregexSubstitutionContent grammar rules with balanced-brace support; heregexEscapes map handles \n, \r, and escaped space correctly.
source/util.civet Adds "RD" case to toS and ruleToEBNF; string parts output verbatim which is semantically correct for round-trip but strips escaped-space information in decompile.
test/main.civet Good coverage: static, dynamic, const-only, mixed-const, deduplication, typed, and unicode identifier paths are all tested.
test/util.civet Adds decompile round-trip tests for "RD" nodes; tests verify compact form output and EBNF representation correctly.
samples/heregex.hera Clear sample demonstrating const vs dynamic substitution and mutation of parser state across rules.
source/hera-types.civet Adds RegExpExpression and RegExpPart types, plus Literal union including ["RD", RegExpPart[]]; types are precise and well-structured.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["HeregexBody parsed\n(RegExpPart[])"] --> B{All parts\nare strings?}
    B -- Yes --> C["Return ['R', body.join('')]\n(static regex, compiled once)"]
    B -- No --> D["Return ['RD', parts]\n(dynamic regex)"]

    D --> E["compiler: defineDynamicRe(parts)\n→ $RDi id (deduped by JSON.stringify)"]
    E --> F{All parts\nare const/string?}

    F -- Yes --> G["All-const path\nlet $RDi = self-overwriting closure\nEvaluates once, replaces itself with $R(...)"]
    F -- No --> H["Mixed path\nconst $RDi = closure\n$RDi.constN ??= String(expr)\n$RDi.source cached, rebuilt on change"]

    G --> I["$EXPECT($RDi, errorFn)"]
    H --> I

    I --> J{defaultHandler?}
    J -- Yes --> K["$R$0($EXPECT($RDi, …))\nextracts match[0]"]
    J -- No --> L["$EXPECT($RDi, …)\nfull RegExpMatchArray\nto handler $0…$9"]
Loading

Reviews (2): Last reviewed commit: "Forbid empty heregex" | Re-trigger Greptile

Comment thread source/hera.hera Outdated
Comment on lines +189 to +191
HeregexSubstitutionCharacter
[^}\\]+
EscapeSequence
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent truncation on nested {} in interpolation expressions

HeregexSubstitutionCharacter is [^}\\]+, so any unescaped } inside an interpolation expression terminates parsing of the expression. A user writing /// ${getPattern({key: val})} /// would silently have the expression truncated to getPattern({key: val, then the remainder would be parsed as heregex body content — potentially producing incorrect regex without an error. This is a sharp edge worth documenting explicitly (or enforcing with a compile-time error), since the failure is silent and the result may still be a valid regex with wrong semantics.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added basic brace matching, skipping over single and double quoted strings. It's probably) not perfect (unbalanced braces in comments or template strings would fail) but it's simple, language-agnostic, and hopefully enough.

Comment thread source/compiler.civet Outdated
Comment thread source/hera.hera Outdated
Comment on lines +176 to +177
HeregexBody
!TripleSlash HeregexPart* -> $2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty heregex /// /// (only whitespace) works but ////// silently falls through

The !TripleSlash negative lookahead at the top of HeregexBody means the body is required to begin with something other than ///. Without any whitespace the body immediately sees ///, !TripleSlash fails, and HeregexBody does not match, causing the rule to fall through to "/" !Space $RegExpCharacter* "/" instead of being recognised as an empty heregex. In practice /// /// (with a space) does produce ["R", ""] correctly, so the restriction only surfaces for the degenerate ////// form. Still, the !TripleSlash guard appears redundant: HeregexPart* naturally stops when it reaches /// because no alternative can consume it, so the guard could be removed without changing semantics for all non-degenerate cases.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches how Civet does things. To be honest, I'm not sure why it's like this, and not just a + on the Parts...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably improve it in Civet since it currently fails to parse.

Copy link
Copy Markdown
Collaborator Author

@edemaine edemaine May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized why Civet forbids //////: it would transpile to // which is a comment, not a regex. So I've reproduced that behavior here. Empty regexes aren't useful in Hera (I don't think...). Hmm, but // does work in Hera (always matches)... Should I forbid both or allow both? Perhaps // should be reserved for comments?

That said, Civet has some broken edge cases: ///// opens heregex and then is a comment, which can build //. Also /// /// compiles to // which is bad (invalid JS). Hera doesn't have this issue because it uses new RegExp.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoffeeScript compiles to /(?:)/ which we could match.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enabled ////// given that // works; both produce empty regular expressions.

I also changed Hera to use JS regexp literals, including /(?:)/ for //. This makes for cleaner output IMO, though the size reduction is minimal: Civet's main.mjs goes from 1289242 to 1286867 bytes, a reduction of 2375.

@edemaine
Copy link
Copy Markdown
Collaborator Author

@greptileai revise your review

@edemaine edemaine merged commit 317f07d into main May 26, 2026
2 checks passed
@edemaine edemaine deleted the heregex branch May 26, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/// regexes

2 participants