feat(latex): add sourceOffsets to parser-generated Error nodes#312
Open
zojize wants to merge 1 commit into
Open
feat(latex): add sourceOffsets to parser-generated Error nodes#312zojize wants to merge 1 commit into
zojize wants to merge 1 commit into
Conversation
Attach a `sourceOffsets: [start, end]` character range to the Error
expressions produced by the LaTeX parser, identifying where in the input
each parse error occurred. Consumers can map a parse error back to the
exact span of source LaTeX (e.g. to highlight an offending token in a
Mathfield).
Offsets are zero-based, end-exclusive character offsets into the
serialized LaTeX (`tokensToString`). For input that round-trips through
the tokenizer unchanged (no comments, Unicode normalization, or macro
expansion -- e.g. editor-generated LaTeX), they match the original input
string. The tokenizer is not modified: offsets are derived from the
existing `tokensToString` via prefix length.
- parse.ts: `Parser.sourceOffsets(startToken, endToken)`; `error()` returns
`{ fn, sourceOffsets }`. Missing-token errors use a zero-width range.
- definitions-arithmetic.ts: empty `\sqrt{}` and `\frac{}{}` operands emit a
positioned `missing` error instead of the position-less MISSING sentinel.
- types: `Parser.sourceOffsets` signature + documented `sourceOffsets`
semantics on `MathJsonAttributes`.
- boxed-expression: propagate `sourceOffsets` through
`ce.parse().toMathJson()`.
Parser `Error` expressions now use the object form
`{ fn: ["Error", ...], sourceOffsets }` instead of the bare array when a
range is available; both are valid MathJSON. Test snapshots updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes #311
AI usage disclosure: majority of implementation, human reviewed
Summary
The LaTeX parser reports parse errors as MathJSON
Errornodes that say what went wrong but not where in the input. This PR populates the existingMathJsonAttributes.sourceOffsetsfield on parser-generatedErrornodes with the character range of the offending LaTeX, so consumers can map a parse error back to its source span — e.g. to highlight the exact invalid token inline in aMathfieldElement.What's exposed
sourceOffsetson parserErrornodes — zero-based, end-exclusive character offsets into the serialized LaTeX (tokensToString). For input that round-trips through the tokenizer unchanged (no comments, Unicode normalization, or macro expansion — i.e. editor-generated LaTeX), they equal offsets into the original input string.\sqrt{}, empty\frac{}{}) use a zero-width range at the parser position.Parser.sourceOffsets(startToken, endToken?)— public helper converting a token range to that character range, so custom dictionary entries can attach offsets to errors they raise.Parser.error()uses it internally.ce.parse(latex).toMathJson()(selectable via themetadataoption, included withmetadata: 'all').Implementation
Deliberately minimal —
tokenizer.tsis not touched. Offsets are derived from the existingtokensToStringvia prefix length, so no per-token source map is threaded through the tokenizer.latex-syntax/parse.tsParser.sourceOffsets()(= latex(0, token).length);error()returns{ fn, sourceOffsets }latex-syntax/dictionary/definitions-arithmetic.ts\sqrt{}/\frac{}{}operands emit a positionedmissingerror instead of the position-lessMISSINGsentinellatex-syntax/types.ts,math-json/types.tsParser.sourceOffsetssignature + documentedsourceOffsetssemanticsbox,serialize,boxed-function,abstract-boxed-expression) +math-json/utils.ts,types-expression.ts,types-kernel-serialization.tssourceOffsetsthroughce.parse().toMathJson()~150 lines of source + test-snapshot updates.
Parser
Errorexpressions now use the object form{ fn: ["Error", …], sourceOffsets }instead of the bare array["Error", …]whenever a range is available. Both are valid MathJSON, but a consumer matchingArray.isArray(expr) && expr[0] === "Error"must also handleexpr.fn?.[0] === "Error". This is the bulk of the test-snapshot churn in this PR. If you'd prefer this gated behind a parse option rather than on by default, happy to adjust.Limitation (documented on the field)
Offsets are measured on
tokensToString(tokenize(input)), so they drift from the original string only when that round-trip is lossy — comments, Unicode normalization, or multi-codepoint ZWJ emoji. There the range stays valid and monotonic but isn't byte-aligned to the original. Editor-generated LaTeX never hits those paths, so it's exact in practice. A heavier implementation could thread true source spans through the tokenizer if byte-exact ranges for arbitrary pasted input are ever needed.Testing
latex-syntax/*andincomplete-expressionssuites pass; snapshots updated for the object-form errors. New assertions inerrors.test.tscover unknown command (\foo→[0,4]), nested error throughce.parse().toMathJson(), unexpected delimiter, and zero-width missing operands.arithmetic.test.ts(last-digit machine-float drift) andcompile-performance.test.ts(a ~1MB memory-threshold assertion).tsc --noEmitclean.Notes
src/api.mdis not regenerated here — runningtypedoc/concat-mdlocally reformats the whole file (toolchain/version drift, ~3k lines of noise). The API documentation lives in the source doc-comments included in this PR; happy to regenerateapi.mdif you'd like it in the same PR or prefer to do it at release time.CHANGELOG.mdentry under## [Unreleased] → ### New Featuresif that's the expected flow — let me know.