Skip to content

feat(latex): add sourceOffsets to parser-generated Error nodes#312

Open
zojize wants to merge 1 commit into
cortex-js:mainfrom
zojize:codex/latex-error-source-offsets
Open

feat(latex): add sourceOffsets to parser-generated Error nodes#312
zojize wants to merge 1 commit into
cortex-js:mainfrom
zojize:codex/latex-error-source-offsets

Conversation

@zojize

@zojize zojize commented Jun 26, 2026

Copy link
Copy Markdown

closes #311

AI usage disclosure: majority of implementation, human reviewed

Summary

The LaTeX parser reports parse errors as MathJSON Error nodes that say what went wrong but not where in the input. This PR populates the existing MathJsonAttributes.sourceOffsets field on parser-generated Error nodes with the character range of the offending LaTeX, so consumers can map a parse error back to its source span — e.g. to highlight the exact invalid token inline in a MathfieldElement.

// LatexSyntax().parse('\\foo')
{
  "fn": ["Error", { "str": "unexpected-command" }, ["LatexString", { "str": "\\foo" }]],
  "sourceOffsets": [0, 4]
}

Live demo: https://zojize.github.io/compute-engine/ — type an invalid expression and the offending token is colored; empty \sqrt/\frac slots get a red placeholder box. (Closes #.)

What's exposed

  • sourceOffsets on parser Error nodes — zero-based, end-exclusive character offsets into the serialized LaTeX (tokensToString). For input that round-trips through the tokenizer unchanged (no comments, Unicode normalization, or macro expansion — i.e. editor-generated LaTeX), they equal offsets into the original input string.
  • Missing-token errors (empty \sqrt{}, empty \frac{}{}) use a zero-width range at the parser position.
  • Parser.sourceOffsets(startToken, endToken?) — public helper converting a token range to that character range, so custom dictionary entries can attach offsets to errors they raise. Parser.error() uses it internally.
  • Boxed path — offsets propagate through boxing/serialization, so they survive ce.parse(latex).toMathJson() (selectable via the metadata option, included with metadata: 'all').

Implementation

Deliberately minimal — tokenizer.ts is not touched. Offsets are derived from the existing tokensToString via prefix length, so no per-token source map is threaded through the tokenizer.

Area Change
latex-syntax/parse.ts Parser.sourceOffsets() (= latex(0, token).length); error() returns { fn, sourceOffsets }
latex-syntax/dictionary/definitions-arithmetic.ts empty \sqrt{} / \frac{}{} operands emit a positioned missing error instead of the position-less MISSING sentinel
latex-syntax/types.ts, math-json/types.ts Parser.sourceOffsets signature + documented sourceOffsets semantics
boxed-expression (box, serialize, boxed-function, abstract-boxed-expression) + math-json/utils.ts, types-expression.ts, types-kernel-serialization.ts propagate sourceOffsets through ce.parse().toMathJson()

~150 lines of source + test-snapshot updates.

⚠️ Behavior change (please review)

Parser Error expressions now use the object form { fn: ["Error", …], sourceOffsets } instead of the bare array ["Error", …] whenever a range is available. Both are valid MathJSON, but a consumer matching Array.isArray(expr) && expr[0] === "Error" must also handle expr.fn?.[0] === "Error". This is the bulk of the test-snapshot churn in this PR. If you'd prefer this gated behind a parse option rather than on by default, happy to adjust.

Limitation (documented on the field)

Offsets are measured on tokensToString(tokenize(input)), so they drift from the original string only when that round-trip is lossy — comments, Unicode normalization, or multi-codepoint ZWJ emoji. There the range stays valid and monotonic but isn't byte-aligned to the original. Editor-generated LaTeX never hits those paths, so it's exact in practice. A heavier implementation could thread true source spans through the tokenizer if byte-exact ranges for arbitrary pasted input are ever needed.

Testing

  • latex-syntax/* and incomplete-expressions suites pass; snapshots updated for the object-form errors. New assertions in errors.test.ts cover unknown command (\foo[0,4]), nested error through ce.parse().toMathJson(), unexpected delimiter, and zero-width missing operands.
  • Full suite: 10,496 passing. The only failures are two pre-existing, environment-dependent flakes unrelated to this change — arithmetic.test.ts (last-digit machine-float drift) and compile-performance.test.ts (a ~1MB memory-threshold assertion).
  • tsc --noEmit clean.

Notes

  • src/api.md is not regenerated here — running typedoc/concat-md locally reformats the whole file (toolchain/version drift, ~3k lines of noise). The API documentation lives in the source doc-comments included in this PR; happy to regenerate api.md if you'd like it in the same PR or prefer to do it at release time.
  • I can add a CHANGELOG.md entry under ## [Unreleased] → ### New Features if that's the expected flow — let me know.

Attach a `sourceOffsets: [start, end]` character range to the Error
expressions produced by the LaTeX parser, identifying where in the input
each parse error occurred. Consumers can map a parse error back to the
exact span of source LaTeX (e.g. to highlight an offending token in a
Mathfield).

Offsets are zero-based, end-exclusive character offsets into the
serialized LaTeX (`tokensToString`). For input that round-trips through
the tokenizer unchanged (no comments, Unicode normalization, or macro
expansion -- e.g. editor-generated LaTeX), they match the original input
string. The tokenizer is not modified: offsets are derived from the
existing `tokensToString` via prefix length.

- parse.ts: `Parser.sourceOffsets(startToken, endToken)`; `error()` returns
  `{ fn, sourceOffsets }`. Missing-token errors use a zero-width range.
- definitions-arithmetic.ts: empty `\sqrt{}` and `\frac{}{}` operands emit a
  positioned `missing` error instead of the position-less MISSING sentinel.
- types: `Parser.sourceOffsets` signature + documented `sourceOffsets`
  semantics on `MathJsonAttributes`.
- boxed-expression: propagate `sourceOffsets` through
  `ce.parse().toMathJson()`.

Parser `Error` expressions now use the object form
`{ fn: ["Error", ...], sourceOffsets }` instead of the bare array when a
range is available; both are valid MathJSON. Test snapshots updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Source offsets on LaTeX parser Error nodes

1 participant