feat(latex): add sourceOffsets to parser-generated Error nodes by zojize · Pull Request #312 · cortex-js/compute-engine

zojize · 2026-06-26T16:55:36Z

closes #311

AI usage disclosure: majority of implementation, human reviewed

Summary

The LaTeX parser reports parse errors as MathJSON Error nodes that say what went wrong but not where in the input. This PR populates the existing MathJsonAttributes.sourceOffsets field on parser-generated Error nodes with the character range of the offending LaTeX, so consumers can map a parse error back to its source span — e.g. to highlight the exact invalid token inline in a MathfieldElement.

// LatexSyntax().parse('\\foo')
{
  "fn": ["Error", { "str": "unexpected-command" }, ["LatexString", { "str": "\\foo" }]],
  "sourceOffsets": [0, 4]
}

Live demo: https://zojize.github.io/compute-engine/ — type an invalid expression and the offending token is colored; empty \sqrt/\frac slots get a red placeholder box. (Closes #.)

What's exposed

sourceOffsets on parser Error nodes — zero-based, end-exclusive character offsets into the serialized LaTeX (tokensToString). For input that round-trips through the tokenizer unchanged (no comments, Unicode normalization, or macro expansion — i.e. editor-generated LaTeX), they equal offsets into the original input string.
Missing-token errors (empty \sqrt{}, empty \frac{}{}) use a zero-width range at the parser position.
Parser.sourceOffsets(startToken, endToken?) — public helper converting a token range to that character range, so custom dictionary entries can attach offsets to errors they raise. Parser.error() uses it internally.
Boxed path — offsets propagate through boxing/serialization, so they survive ce.parse(latex).toMathJson() (selectable via the metadata option, included with metadata: 'all').

Implementation

Deliberately minimal — tokenizer.ts is not touched. Offsets are derived from the existing tokensToString via prefix length, so no per-token source map is threaded through the tokenizer.

Area	Change
`latex-syntax/parse.ts`	`Parser.sourceOffsets()` (`= latex(0, token).length`); `error()` returns `{ fn, sourceOffsets }`
`latex-syntax/dictionary/definitions-arithmetic.ts`	empty `\sqrt{}` / `\frac{}{}` operands emit a positioned `missing` error instead of the position-less `MISSING` sentinel
`latex-syntax/types.ts`, `math-json/types.ts`	`Parser.sourceOffsets` signature + documented `sourceOffsets` semantics
boxed-expression (`box`, `serialize`, `boxed-function`, `abstract-boxed-expression`) + `math-json/utils.ts`, `types-expression.ts`, `types-kernel-serialization.ts`	propagate `sourceOffsets` through `ce.parse().toMathJson()`

~150 lines of source + test-snapshot updates.

⚠️ Behavior change (please review)

Parser Error expressions now use the object form { fn: ["Error", …], sourceOffsets } instead of the bare array ["Error", …] whenever a range is available. Both are valid MathJSON, but a consumer matching Array.isArray(expr) && expr[0] === "Error" must also handle expr.fn?.[0] === "Error". This is the bulk of the test-snapshot churn in this PR. If you'd prefer this gated behind a parse option rather than on by default, happy to adjust.

Limitation (documented on the field)

Offsets are measured on tokensToString(tokenize(input)), so they drift from the original string only when that round-trip is lossy — comments, Unicode normalization, or multi-codepoint ZWJ emoji. There the range stays valid and monotonic but isn't byte-aligned to the original. Editor-generated LaTeX never hits those paths, so it's exact in practice. A heavier implementation could thread true source spans through the tokenizer if byte-exact ranges for arbitrary pasted input are ever needed.

Testing

latex-syntax/* and incomplete-expressions suites pass; snapshots updated for the object-form errors. New assertions in errors.test.ts cover unknown command (\foo → [0,4]), nested error through ce.parse().toMathJson(), unexpected delimiter, and zero-width missing operands.
Full suite: 10,496 passing. The only failures are two pre-existing, environment-dependent flakes unrelated to this change — arithmetic.test.ts (last-digit machine-float drift) and compile-performance.test.ts (a ~1MB memory-threshold assertion).
tsc --noEmit clean.

Notes

src/api.md is not regenerated here — running typedoc/concat-md locally reformats the whole file (toolchain/version drift, ~3k lines of noise). The API documentation lives in the source doc-comments included in this PR; happy to regenerate api.md if you'd like it in the same PR or prefer to do it at release time.
I can add a CHANGELOG.md entry under ## [Unreleased] → ### New Features if that's the expected flow — let me know.

Attach a `sourceOffsets: [start, end]` character range to the Error expressions produced by the LaTeX parser, identifying where in the input each parse error occurred. Consumers can map a parse error back to the exact span of source LaTeX (e.g. to highlight an offending token in a Mathfield). Offsets are zero-based, end-exclusive character offsets into the serialized LaTeX (`tokensToString`). For input that round-trips through the tokenizer unchanged (no comments, Unicode normalization, or macro expansion -- e.g. editor-generated LaTeX), they match the original input string. The tokenizer is not modified: offsets are derived from the existing `tokensToString` via prefix length. - parse.ts: `Parser.sourceOffsets(startToken, endToken)`; `error()` returns `{ fn, sourceOffsets }`. Missing-token errors use a zero-width range. - definitions-arithmetic.ts: empty `\sqrt{}` and `\frac{}{}` operands emit a positioned `missing` error instead of the position-less MISSING sentinel. - types: `Parser.sourceOffsets` signature + documented `sourceOffsets` semantics on `MathJsonAttributes`. - boxed-expression: propagate `sourceOffsets` through `ce.parse().toMathJson()`. Parser `Error` expressions now use the object form `{ fn: ["Error", ...], sourceOffsets }` instead of the bare array when a range is available; both are valid MathJSON. Test snapshots updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(latex): add sourceOffsets to parser-generated Error nodes#312

feat(latex): add sourceOffsets to parser-generated Error nodes#312
zojize wants to merge 1 commit into
cortex-js:mainfrom
zojize:codex/latex-error-source-offsets

zojize commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zojize commented Jun 26, 2026

Summary

What's exposed

Implementation

⚠️ Behavior change (please review)

Limitation (documented on the field)

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant