Skip to content

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

Draft
Nucs wants to merge 462 commits into
masterfrom
nditer
Draft

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 462 commits into
masterfrom
nditer

Conversation

@Nucs

@Nucs Nucs commented Apr 22, 2026

Copy link
Copy Markdown
Member

Complete changelog of the nditer branch — everything in this PR since #612 merged.

455 commits · 806 files · +234,348 / −19,179 (vs master, after #612)


TL;DR

  • NpyIter — full port of NumPy 2.4.2's nditer (~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. 566+ byte-for-byte NumPy parity scenarios.
  • NpyExpr DSL + three-tier custom-op API — write your own ufuncs: raw IL (Tier 3A), element-wise scalar/SIMD (Tier 3B), or composable expression trees with operator overloads (Tier 3C). Exposed as the public np.evaluate, which runs fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPy result_type typing and fused reductions.
  • out= / where= / dtype= ufunc kwargs across the elementwise API — the kwargs on every NumPy ufunc, spanning the binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plus np.bitwise_and/or/xor and np.positive at the np.* surface.
  • NumPy-parity benchmark: geomean 1.00× at 10M elements across ~409 ops (166 faster / 171 close / 36 slower) — measured by a new official BenchmarkDotNet-vs-NumPy suite committed with the report.
  • 36 new np.* APIssort, pad (11 modes), tile, median/percentile/quantile (all 13 interpolation methods) + their nan* variants, average, ptp, take/put/place, extract/compress, diagonal/trace, argwhere/flatnonzero, unravel_index/ravel_multi_index/indices, delete/insert/append, diff/ediff1d, asfortranarray/ascontiguousarray, np.multithreading.
  • C/F/A/K order support wired through the whole APIShape understands F-contiguity, OrderResolver resolves NumPy order modes, ~68 layout bugs fixed across 9 fix groups.
  • Stride-native matmul/dot — BLIS-style GEBP GEMM absorbs arbitrary strides for all dtypes (kills a ~100× penalty on transposed inputs); fused 1-D dot is 3.5–9× faster with zero GC; opt-in multithreaded dot ~2× faster than NumPy's default on 1M vectors.
  • Sorting, casts & Complex finishednp.sort/np.argsort on a radix line-kernel (closes a Missing Function); a SIMD strided-cast campaign that killed the cast cliffs (15×8×15 astype matrix: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPy); np.zeros via calloc/demand-zero (O(1), was ~1000× slower); the six Complex transcendentals (sinharctan); and bit-exact pairwise summation for sum/mean.
  • Deterministic memory management — atomic reference counting + IDisposable on NDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).
  • Differential fuzzing infrastructure — 37,445 bit-exact NumPy-comparison cases across 24 corpus tiers, a seeded random fuzzer with shrinker, a CI FuzzMatrix gate, and a nightly soak workflow.
  • Legacy iterator stack deleted outrightMultiIterator, the Regen-generated cast templates, and NDIterator itself (interface + class + AsIterator extensions) are all gone; every code path now iterates through NpyIter / NpyFlatIterator / GetAtIndex.
  • Test suite: 9,990 passed / 0 failed on net8.0 + net10.0 (+2,600 new test methods), plus the 37,445-case fuzz corpus replayed by the FuzzMatrix gate.

1. NpyIter — full NumPy nditer port

From-scratch C# port of NumPy 2.4.2's iterator machinery under src/NumSharp.Core/Backends/Iterators/ (~12,557 lines), promoted to public API with NDArray overloads.

Capability Detail
Iteration orders C, F, A, K — incl. NEGPERM negative-stride handling, axis reordering + coalescing to full 1-D collapse
Indexing modes MULTI_INDEX, C_INDEX, F_INDEX, RANGE (parallel chunking), GotoIndex / GotoMultiIndex / GotoIterIndex
Buffering Buffered casting with all 5 casting rules, windowed buffered iteration, DELAY_BUFALLOC, buffered-reduce double-loop (incl. bufferSize < coreSize)
Reductions op_axes with -1 reduction axes, REDUCE_OK, IsFirstVisit, REUSE_REDUCE_LOOPS slab accumulation
Overlap safety COPY_IF_OVERLAP via a port of NumPy's mem_overlap solver (NpyMemOverlap.cs) — overlapping in/out operands no longer silently corrupt
Masking WRITEMASKED + ARRAYMASK executed — the buffered window flush writes back only mask-nonzero elements; VIRTUAL operands (null op slots) construct with NumPy 2.x semantics
Operands / dims Unlimited operands (NumPy caps at NPY_MAXARGS=64) and unlimited dimensions (NumPy caps at NPY_MAXDIMS=64) via dynamic allocation
APIs Copy, GetIterView, RemoveAxis, RemoveMultiIndex, ResetBasePointers, IterRange, DebugPrint, fixed/axis stride queries, GetValue<T>/SetValue<T>, …
Casting parity NpyIterCasting.CanCast matches NumPy's safe/same_kind lattice exactly

Validated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, a permanent variation-probe harness, and tools/iterator_parity. Dozens of parity bugs found and fixed against NumPy ground truth: negative-stride flipping, NO_BROADCAST enforcement, F_INDEX coalescing, buffered-reduction stride inversion, K-order on broadcast inputs, EXLOOP iternext, buffered-cast Advance, ranged Reset() desync, buffer free-list corruption, the size-1 stride-0 invariant (a (1,4) view with nonzero stride corrupted RemoveMultiIndex), op_axes out-of-bounds reads on stretched size-1 axes, write-broadcast validation, PARALLEL_SAFE wiring, and unit-axis absorption — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.

Execution at NumPy speed

NpyIter isn't just correct — it is now the production execution engine: DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NpyIter Tier-3B shell, and it measures at-or-faster than NumPy on every probed aspect (Release, i9-13900K, NumPy 2.4.2):

Aspect (float32) NumSharp NumPy Ratio
contig sqrt 10M 2.98 ms 3.24 ms 0.92×
contig add 10M 3.91 ms 4.09 ms 0.96×
strided add 1M 319 µs 416 µs 0.77×
strided sqrt 1M 206 µs 374 µs 0.55×
strided sum 1M 109 µs 205 µs 0.53×
fused a*b+c 10M 4.77 ms 13.38 ms 0.36×
fused (a-b)/(a+b) 10M 4.12 ms 22.33 ms 0.18×

Key mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels (2-D strided sqrt 1.36× faster than NumPy, strided sum 2.2× faster).

2. NpyExpr DSL + three-tier custom-op API

User-extensible kernel layer on top of NpyIter — the public answer to "how do I write my own ufunc":

  • Tier 3A — ExecuteRawIL: emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* strides, long count, void* aux).
  • Tier 3B — ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.
  • Tier 3C — ExecuteExpression: compose NpyExpr trees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons, Min/Max/Clamp/Where), plus Call() to splice any delegate/MethodInfo into a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.

This is what powers the fusion wins — one pass, no temporaries — and it is exposed publicly as np.evaluate(expr[, operands][, out]):

  • Per-node NumPy result_type typing — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly: (i4*i4)+f8 wraps the multiply in int32 (→ 1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4, i4/2 → f8) with NumPy's exact OverflowError, and special resolvers (true_divide, arctan2, negative-integer-literal powerValueError, bool add=OR/multiply=AND).
  • Fused reductionsNpyExpr.Sum/Prod/Min/Max/Mean compile a one-pass inner loop; sum(a*b) reads a and b once and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).
  • out= joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing through COPY_IF_OVERLAP); an EXTERNAL_LOOP guard prevents the silent count==1 slow path.
  • Measured (Release, 4M f64, NumPy 2.4.2): a*b+c 3.2×, (a-b)/(a+b) 6.1×, sum(a*b) 3.6×, sum f32 2.9×, i4*2+f8 3.5× faster. Permanent gate in benchmark/fusion/evaluate_bench.{cs,py}.

3. Legacy iterator stack retired

  • MultiIterator deleted; all callers migrated to NpyIter.Copy / multi-operand execution.
  • The Regen template NDIterator.template.cs + 16 generated NDIterator.Cast.* partials deleted (−3,870 LOC in one commit).
  • NDIterator (interface + NDIterator<T> + AsIterator extensions) deleted entirely[Obsolete] tombstones that threw at runtime after the migration and were referenced by nothing live. Production iteration runs through NpyIter/NpyIterRef (kernels), GetAtIndex (flat reads), and NpyFlatIterator (np.broadcast(...).iters).
  • ~400 per-dtype NPTypeCode switch sites replaced by a generic NpFunc dispatch utility.

4. C/F/A/K memory-layout support

  • Shape now tracks F-contiguity with NumPy-convention contiguity computation; new OrderResolver resolves C/F/A/K for every API with an order parameter.
  • Order support wired through: copy, array, asarray, asanyarray, *_like, astype, flatten, ravel, reshape, eye, concatenate, cumsum, argsort, tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.
  • New: np.asfortranarray, np.ascontiguousarray.
  • np.where selects C/F output layout the way NumPy does; ravel('F') of an F-contig source returns a view (was a 3,000× copy).
  • ~68 layout bugs fixed across 9 TDD fix groups, backed by ~3,300 lines of new order tests (Sections 41–51: reductions/keepdims, matmul/dot/outer/convolve, broadcasting-from-F, manipulation, file I/O fortran_order, Decimal scalar path, fancy-write isolation, …).

5. New & completed np.* APIs

New functions (36):

Area APIs
Fused / ufunc np.evaluate (fused expressions — see §2), np.bitwise_and, np.bitwise_or, np.bitwise_xor, np.positive
Sorting np.sort (+ ndarray.sort; np.argsort reimplemented) — radix line-kernel on NpyIter, stable, NaN-last, all axes / orders (IterAllButAxis drive mirroring NumPy's _new_sortlike)
Manipulation np.pad (all 11 NumPy modes + callable), np.tile, np.delete, np.insert, np.append
Indexing/selection np.take, np.put, np.place, np.extract, np.compress, np.argwhere, np.flatnonzero, np.diagonal, np.trace, np.unravel_index, np.ravel_multi_index, np.indices
Statistics np.median, np.percentile, np.quantile (all 13 interpolation methods, tuple axis, out=, keepdims, QuickSelect engine), np.average (weights, returned, tuple-axis; fused kernel 1.3–1.6× faster than NumPy at 1M), np.ptp, np.nanmedian, np.nanpercentile, np.nanquantile
Math np.diff, np.ediff1d
Creation np.asfortranarray, np.ascontiguousarray
Runtime np.multithreading(enabled, max_threads) — opt-in threaded kernels

Rebuilt to full NumPy 2.x parity:

  • np.clipmin=/max= keyword aliases, default-None bounds, NumPy 2.x dtype promotion, out= validation.
  • np.unique — 5 missing kwargs, sort+mask algorithm (up to 43× faster), NaN partitioning, n > Array.MaxLength fallback.
  • np.searchsortedside=, sorter=, multidim validation; IL binary-search kernels 5–25× faster (beats NumPy on 20/22 benchmarks).
  • np.copytocasting=, where= masked copies at NumPy speed (was 7–72× slower).
  • np.asarraycopy=, like=, device=, dtype-as-string. np.concatenate — full parity + C/F fast paths. np.all/np.any — tuple-axis, out=, where=. np.expand_dims — tuple axis. np.repeataxis= parameter. np.power — integer-power semantics, negative-exponent ValueError, crash fix.
  • np.broadcast — N-operand form (0..64, then unlimited — NumPy parity, was 2-operand only), live index cursor, lazy .iters, .numiter.
  • Engine completeness: bool/char max/min, Complex quantile, IsInf implemented (was a stub); the six Complex transcendentals sinh/cosh/tanh/arcsin/arccos/arctan implemented (hybrid BCL + C99 edge fix-ups, NumPy 2.4.2 parity — were NotSupportedException).
  • Full 15-dtype coverage pushed through the hot paths — the SByte/Half/Complex dtypes introduced in [new dtypes, NEP50] fully supported Half/Complex/SByte, np.* alias overhaul, NumPy 2.x type alias alignment #612 now work across every kernel family this PR touches (reductions, indexing, trace, casts, quantile, …).

out= / where= / dtype= ufunc kwargs (NumPy parity):

The kwargs present on every NumPy ufunc now span the elementwise core — binary (add, subtract, multiply, divide, true_divide, mod, power, floor_divide), unary-math (sqrt, exp, log, sin, cos, tan, abs/absolute, negative, square), the six comparisons, predicates (isnan/isfinite/isinf), bitwise, invert, arctan2 — each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:

  • out joins the broadcast but never stretches (mismatched/stretchable out raise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50), out only needs a same_kind cast; the provided instance is returned (reference identity).
  • where must be exactly bool (mask cast under 'safe'); it broadcasts over operands and participates in output shape; mask-false slots keep prior out contents.
  • out aliasing an input is well-defined via COPY_IF_OVERLAPadd(x[:-1], x[:-1], out=x[1:]) matches NumPy exactly.
  • dtype= computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the bool add→OR / multiply→AND remap keyed off the final loop dtype so add(True, True, dtype=i32) = 2.

6. Linear algebra

  • Stride-native GEMM for all 12 numeric dtypes — BLIS-style GEBP with stride-aware packers; the 8×16 Vector256 FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxing GetValue fallback chain.
  • Full matmul gufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a dedicated differential matrix (816 cases).
  • Fused single-pass 1-D dot — 3.5–9× faster, zero GC (was up to 446 gen-0 collections per call at 100K).
  • np.multithreading — opt-in parallel 1-D dot: 1M float dot 172 → 60 µs, ~2× faster than NumPy's default build. Off by default; bitwise-identical summation order when off.

7. Performance (beyond NpyIter and linalg)

Op Improvement
Axis reductions, narrow ints Widening SIMD (int16→int32 accum etc.): sum(int16, axis=1) 1058 ms → 2.7 ms (389×, now faster than NumPy); int32/uint32 2.3–4.6×; also fixes a uint32 axis-sum corruption bug
mean (axis) 217× (Phase-0 bug surgery); var/std 21×; count_nonzero 20×
np.nonzero IL SIMD kernel closes an 8–241× gap to NumPy
np.where IL kernels for scalar-broadcast & non-contiguous (1.2–2× NumPy on broadcast conditions)
Strided 1-D unary Fused strided-SIMD kernel: 0.55 ns/elem flat — beats NumPy at every size; strided sqrt reached parity via gather→tile→SIMD buffering
Strided flat reductions Incremental-advance path: strided sum 8.3× faster (11.8× behind NumPy → 1.4×)
Comparisons PDEP-based packed mask→bool store; broadcast/strided compares routed via NpyIter
Axis-0 reductions Column-tiled accumulation (breaks the output RAW dependency); 8× pairwise unrolled flat reductions
Allocation tcache-style size-bucketed buffer pool with a 1 B – 64 MiB window (covers both the small-N ufunc result and 4M+ outputs that previously paid a fresh VirtualAlloc + demand-zero faults); ≥1 MiB buckets capped at 2 buffers; pool-side GC memory pressure tracking live state; GC.SuppressFinalize on free; using/ARC adopted across concatenate, allclose, convolve, tile, eye, masking, shuffle, …
Casts (SIMD campaign) Strided/gathered SIMD kernels across the full 15×8×15 astype matrix — cvtt float→int, Giesen f16↔ widen/narrow, complex deinterleave, sub-word VPSHUFB shuffles, fused VPGATHER whole-array kernels, single-pass KEEPORDER same-type copy. Cliffs eliminated: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPy
np.zeros calloc / Windows VirtualAlloc demand-zero — O(1) regardless of size (10M f64: 14.3 ms → ~0.01 ms, was ~1000× slower)
Broadcast-reduce Stride-0 axes folded algebraically in the flat-reduction chokepoint (no O(D×N) materialize) — sum(broadcast_to(...)) now ~534–700× faster, beats NumPy, bit-exact
sum/mean (float) Bit-exact NumPy pairwise summation ported onto the per-chunk reduce path — matches np.add.reduce bit-for-bit (unblocks float32)
np.any/np.all (bool/char) Reinterpret to byte/ushort → existing integer SIMD path (was a 5–12× scalar cliff); fixes a latent AVX2 32-lane mask-overflow correctness bug
Complex/Half/Decimal reductions NpyIter chunked ForEach axis reductions — Decimal 5–13×, Half mean 1.6–3.7×, Complex mean 15–45×→parity; float16 negate ~10× via sign-bit flip
Casts (float→int32) NumPy-faithful SIMD cvtt, strided/reversed/gathered variants
np.split family O(1) sub-shape derivation, direct views — 1.5–4× faster than NumPy
Where/copyto/searchsorted/unique see §5

8. Official benchmark suite + honest methodology

  • New cross-platform run_benchmark.py entry point: BenchmarkDotNet Full rigor (50 iters, InProcessEmit) × all suites × {1K, 100K, 10M} vs NumPy 2.x — 1,813 C# measurements, 1,111 matched op×dtype×size comparisons, structural op-name join, tracked markdown report + per-suite artifacts + history snapshots. Coverage spans all 15 dtypes (SByte/Half/Complex suites added).
  • Headline: geomean NumSharp÷NumPy = 1.00× at N=10M (166 ops faster / 171 close / 36 slower) — parity across the whole op surface at memory-bound sizes; ~1.9× at 1K where per-call dispatch dominates (tracked as the next focus).
  • Found and neutralized a benchmark-invalidating tooling bug: dotnet run file-based apps compile the project reference in Debug (optimizations off) even with Configuration=Release properties — hand loops measured ~2× slow while DynamicMethod IL was immune. Benchmarks now assert IsJITOptimizerDisabled == false and refuse to mislead; the rule is documented.
  • Canonical NpyIter benchmark — a section-addressable harness covering 33 op families × {scalar/1K/100K/1M/10M}, integrated into run_benchmark.py, plus a post-release CI workflow (.github/workflows/benchmark.yml) that auto-commits report cards to master.
  • Frontier findings — found, then fixed. Adversarial probes flagged real losses; the headline ones are now closed: np.sum over a broadcast_to view (was 54× slower) folds stride-0 axes algebraically and runs ~534–700× faster than NumPy, bit-exact; scalar np.any/np.all on bool/char (was 5–12× slower) reinterpret onto the integer SIMD path; np.zeros (was ~1000× slower) is calloc-backed. Remaining tracked items: small-N (~1K) per-call dispatch overhead and a few iterator edge cases pinned as [OpenBugs]/skipped repros. A win surfaced too: hand-rolled 8-band parallel iteration 4.7×.

9. Differential fuzzing vs NumPy (new infrastructure)

  • 37,445 bit-exact corpus cases across 24 JSONL tiers generated from real NumPy 2.4.2 outputs: casts (full 15×15 matrix), binary arith (NEP50), div/mod/power, comparisons, unary (incl. float16 inputs + all narrow ints), reductions, NaN-aware reductions, cumulative, statistics, logic/extrema, bitwise+shift, where/place, manipulation, matmul, modf multi-output, sorting/searching, parameter sweeps, SIMD-tail boundaries (900 cases around vector-width edges), operand aliasing, and error-parity (exception-for-exception).
  • Seeded random fuzzer with an element-wise shrinker for minimal repros; metamorphic invariant tier (11 algebraic properties).
  • CI integration: FuzzMatrix gate wired into the build workflow + a new nightly fuzz-soak workflow (.github/workflows/fuzz-soak.yml).
  • Findings inventoried in docs/FUZZ_FINDINGS.md; every fixed class re-armed as a permanent regression gate. The error-parity tier alone surfaced 1 critical crash; the op tiers surfaced 17+ distinct bug classes that are now fixed (see §10).

10. Correctness — NumPy-parity bug fixes

Semantics (behavioral changes, may affect callers):

  • floor_divide / mod: NumPy-exact floored semantics and divide-by-zero results.
  • Comparisons: <= / >= now return False for NaN (IEEE/NumPy).
  • Flat min/max propagate NaN.
  • np.negative(uint) wraps modulo 2ⁿ instead of throwing; bool - bool and -bool/np.negative(bool) now throw (NumPy behavior).
  • Transcendental ufuncs use NEP50 width-based float promotion.
  • np.power: negative integer exponent raises ValueError; exact integer-power semantics.
  • Cast semantics aligned with NumPy across all dtype pairs (IL kernels + ConvertValue); complex→bool no longer drops the imaginary part; float→int SIMD uses truncation (cvtt) like NumPy.
  • Broadcasting keeps rank when a 1-D [1] meets a lower-rank operand; quantile-family dtype & bool handling; Complex np.where.
  • Integer reciprocal(0) is per-width exact: int32/int64MinValue, uint64 → 2⁶³, but 0 for int8/int16/uint8/uint16/uint32 (was MinValue/0 across the board); bool → int8.
  • clip/maximum/minimum: float16 signed-zero scalar tail, NaN propagation through the SIMD kernel, and correct F-contiguous/strided element pairing.
  • float16 axis sum accumulates in float32 (NumPy parity); Complex flat min/max return the NaN-bearing element verbatim; Complex unary math ported from NumPy's own C99 algorithms.

Crashes & corruption:

  • Overlapping-operand corruption eliminated iterator-wide (COPY_IF_OVERLAP, §1).
  • Masked iteration: a buffered WRITEMASKED write landed garbage in exactly the slots NumPy preserves (silent corruption of the elements the caller asked to protect) — now writes back only mask-nonzero elements.
  • uint32 axis-sum produced wrong values past 8 distinct columns (widening-SIMD rewrite).
  • np.pad: 5 correctness/crash bugs (battle-tested against NumPy 2.4.2); linear_ramp preserved Complex dtype.
  • UnmanagedStorage/ArraySlice: CopyTo direction + bounds bugs; CloneData partial-buffer bug; scalar offset lost on Clone; buffered NpyIter.Clone shared buffers; DTypeSize reported Marshal.SizeOf instead of in-memory stride; NPTypeCode.Char.SizeOf returned 1 (real: 2); stale Decimal priority.
  • TensorEngine now propagates through Cast/Transpose/copy/reshape/ravel (custom engines were silently dropped).
  • take with out= enforces NumPy's safe-cast direction; put/place non-contiguous writeback fixes; argsort on non-C-contiguous input.
  • NpyIter ForEach/ExecuteGeneric/ExecuteReducing read past the end without EXTERNAL_LOOP.
  • np.exp2 float32-output IL kernel was malformed (InvalidProgramException); np.power with a Half exponent threw InvalidCastException; a narrowing dtype= on a complex float-ufunc segfaulted — all fixed.
  • Complex nansum axis reduction read uninitialized memory for ndim ≥ 3; the AVX2 32-lane any() mask overflow (byte/sbyte) returned wrong results; net8.0 complex abs and axis min/max NaN propagation corrected.

11. Memory management — ARC + IDisposable

  • NDArray now implements IDisposable backed by atomic reference counting on the unmanaged block: CAS-driven TryAddRef/Release, idempotent Dispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.
  • Hammered by a 15-case lifecycle suite incl. 32-thread × 1,000-op concurrency races and 50-way parallel dispose — zero corruption.
  • Deterministic release means hot loops no longer wait on the finalizer queue; combined with the buffer pool this removes most steady-state GC pressure (dot at 100K: 446 collections → 0).

12. Char8 primitive

New 1-byte character type (NumSharp.Char8) — the NumPy S1/Python bytes(1) equivalent — with conversions, operators, span helpers, and 100% Python bytes API parity validated against a Python oracle. Vendored .NET ASCII/Latin-1 reference sources under src/dotnet/ document the upstream implementations it mirrors.

13. Examples — trainable MNIST MLP

New examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive implementation and a fused one (single-NpyIter bias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). Originally needed a "copy transposed views before np.dot" workaround (31× training speedup at the time); the stride-native GEMM (§6) made the workaround unnecessary. Converges to >99% test accuracy in the bundled demo.

14. Kernel architecture & hygiene

  • ILKernelGenerator split into DirectILKernelGenerator (legacy whole-array kernels, 51 partials under Kernels/Direct/) and ILKernelGenerator (NpyIter-driven per-chunk kernels — the target model matching NumPy's PyUFuncGenericFunction); migration path documented per kernel family.
  • All Vector128/256/512 and Math/MathF reflection centralized in VectorMethodCache / ScalarMethodCache; IL-emitted typed-field copier replaces the UnmanagedStorage.Alias switch.
  • 24 dead kernel methods removed outright (were [Obsolete(error: true)] tombstones, referenced by nothing); dead axis-reduction SIMD paths removed.

15. Documentation

  • NpyIter/NDIter book: docs/website-src/docs/NDIter.md (7-technique quick reference, decision tree, memory model, gotchas) + ndarray.md.
  • DocFX website — Benchmarks vs NumPy: benchmarks.md (head-to-head evidence companion to the IL-generation page), benchmark-iterator.md, benchmark-matrix.md, driven by the auto-committed report artifacts.
  • Engineering ledgers: PERF_LEDGER.md (every optimization with before/after), NPYITER_GAPS_AND_ROADMAP.md (gap analysis vs NumPy 2.4.2 + prioritized roadmap), MIGRATE_NPYITER.md, IL-kernel playbook, fuzz findings/coverage.
  • Branch quality audit findings are pinned as test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs — every Tier-1 finding fixed or reproduced as an [OpenBugs] test.

16. Tests & CI

  • +2,600 test methods; suite now 9,990 passed / 0 failed on net8.0 + net10.0. Zero regressions maintained commit-by-commit.
  • New suites: np.evaluate (per-node wraparound, dtype matrices, weak scalars + overflow, fused-vs-unfused, out= identity/cast/aliasing, fused reductions), out=/where=/dtype= parity suites (broadcast/cast/error-text pins), WRITEMASKED/VIRTUAL parity; NpyIter battletests (566 scenarios), order-support sections 41–51, ARC lifecycle, clone regression, np.pad/average/median/percentile/ptp/diff battle tests, IL-kernel battle tests, behavioral audit harness.
  • CI: fuzz gate in build-and-release.yml, nightly fuzz-soak.yml, new post-release benchmark.yml (auto-commits NumPy-comparison report cards to master).
  • Known gaps stay visible: the still-unimplemented NumPy functions are flip/fliplr/flipud/rot90, diag, gradient, and round (np.sort is now done); small-N (~1K) per-call dispatch overhead is the headline performance focus (docs/NPYITER_GAPS_AND_ROADMAP.md); a few iterator edge cases remain pinned as [OpenBugs]/skipped repros. Every open issue found by the audits/fuzzers/benches is checked in as a failing-by-design test rather than ignored.

Breaking changes

Change Impact Migration
bool - bool, -bool, np.negative(bool) now throw Matches NumPy Use ^ / cast to int first
NaN <= / >= returns False Matches IEEE & NumPy Use np.isnan explicitly
floor_divide/mod divide-by-zero & floored results Matches NumPy
np.negative(uint) wraps instead of throwing Matches NumPy
np.power(int, negative int) raises ValueError Matches NumPy Use float exponents
Cast edge cases (overflow/NaN/complex→bool/float→int truncation) Matches NumPy
Transcendental ufuncs: NEP50 width-based promotion Return dtype may change
np.clip/quantile-family dtype promotion Return dtype may change
Broadcast views are read-only; broadcasting keeps rank for 1-D [1] Matches NumPy .copy() to write
MultiIterator and NDIterator (+ NDIterator<T>, AsIterator) removed Public types removed (threw at runtime anyway) Use NpyIter / NpyIter.Copy / NpyFlatIterator
NpyIter: MaxOperands=8 and 64-dim limits removed None (loosening)
np.copyto unwriteable-destination error type corrected Exception type change

Everything above was validated against NumPy 2.4.2 ground truth — by 37k differential corpus cases, 566 iterator parity scenarios, and per-feature battle tests run on actual NumPy output.

Nucs added 27 commits May 29, 2026 20:44
… throws)

NumPy has no subtract ufunc loop for the bool dtype: bool - bool,
np.subtract(bool, bool) and np.ptp(bool) (which computes max - min) all raise
TypeError ("numpy boolean subtract, the `-` operator, is not supported, use
the bitwise_xor, the `^` operator, or the logical_xor function instead.").
NumSharp previously computed a boolean result instead - a silent divergence
surfaced while implementing np.ediff1d.

DefaultEngine.Subtract now guards the both-operands-boolean case and throws
NotSupportedException with NumPy's message. Mixed bool + numeric still
promotes to the numeric type and subtracts (e.g. bool - int32 -> int32), so
only the genuine bool/bool case is rejected. The guard sits at the single
engine chokepoint, so it covers the - operator, np.subtract, the
scalar-scalar path, and any op that subtracts internally (e.g. np.ptp).

Side effect: np.ptp(bool) now raises the same error as NumPy instead of
returning a bool, fixing a previously-documented [Misaligned] divergence.

Tests
- np_subtract_tests.cs: dropped the (Boolean, Boolean) DataRow from
  SubtractAllPossabilitiesBoolean (it contradicted the test's own _REGEN
  template, which excludes Boolean) and added SubtractBoolBool_ThrowsLikeNumPy
  asserting the operator, np.subtract and the scalar path all throw.
- np.ptp.BattleTests.cs: Ptp_Bool_DivergesFromNumPy -> Ptp_Bool_ThrowsLikeNumPy
  (dropped [Misaligned]; now asserts the throw, matching NumPy).

Full suite green on net10.0 (9404 passed, 0 failed). Two net8.0 failures
(Trunc_FContig_PreservesFContig, Max_WithNaN_PropagatesNaN) are pre-existing
and unrelated: they fail identically with this change reverted and involve no
boolean subtraction.
…e(bool) throw)

NumPy has no negative ufunc loop for the bool dtype: unary -bool,
np.negative(bool) and ndarray.negative() all raise TypeError ("The numpy
boolean negative, the `-` operator, is not supported, use the `~` operator or
the logical_not function instead.") - even for empty arrays. NumSharp
previously returned logical_not instead, a silent divergence (companion to the
bool-subtract divergence fixed in the previous commit).

Changes
- DefaultEngine.Negate (the unary `-` operator / TensorEngine.Negate path):
  the bool branch now throws instead of routing to LogicalNot.
- NDArray.negative() (the np.negative / ndarray.negative() path, a separate
  implementation): guards bool up-front - before the size==0 short-circuit and
  the cast allocation, so empty bool also throws - and drops the now-dead
  boolean switch case.
- np.logical_not(bool) previously relied on Negate(bool) routing to LogicalNot.
  It now routes through np.invert (for bool, logical_not == ~ == invert), so
  logical_not keeps working after the Negate guard. The `!` and `~` operators
  were already independent of Negate and are unaffected.

Numeric negate (int / float / decimal / complex / half) is unchanged.

Tests
- NDArray.negative.Test.cs: Negative_Bool_ThrowsLikeNumPy asserts -bool,
  np.negative(bool), ndarray.negative() and empty-bool all throw, and that
  ~bool / np.logical_not(bool) still return the boolean flip [F, T, F].

Full suite green on net10.0 (9405 passed, 0 failed). The two net8.0-only
failures (Max_WithNaN_PropagatesNaN, Trunc_FContig_PreservesFContig) are
pre-existing and unrelated: they reproduce on a clean build with this change
reverted and involve no boolean negation.
…cross all dtype pairs

NumSharp had THREE cast implementations with divergent out-of-range / NaN semantics:
  - Converts (scalar; used by astype/CastTo): NumPy-faithful on all 3276 cast cases.
  - DirectILKernelGenerator cast kernel (SIMD; used by NpyIter.Copy contig/strided):
    SATURATED float->int and sign-extended signed->UInt64 to only 32 bits -> 142 wrong.
  - NpyIterCasting.ConvertValue (scalar fallback; used by NpyIter.Copy/copyto/where/
    concatenate): read every source through a double intermediate (dropping Int64/UInt64
    low bits) and used C# saturating casts -> 346 wrong.

NumPy uses MODULAR WRAPPING for integer narrowing/sign-change and C truncation
(type-min / int.MinValue sentinel on NaN/overflow) for float->int. The IL kernel and
ConvertValue instead SATURATED (clamped to dst min/max), so every out-of-range,
negative->unsigned, float-overflow or NaN value produced a wrong result through
NpyIter.Copy -- a latent bug reachable today via np.where, np.copyto and np.concatenate
cross-dtype casts (not just the astype fast-path under development).

Spec established empirically: a 15x15 dtype x edge-value ground-truth matrix generated
from NumPy 2.x (.claude/scripts/cast_spec_numpy.py), replayed bit-exactly through all
three NumSharp paths (cast_spec_numsharp.cs). Converts matched NumPy on every case and is
the single source of truth.

Fixes:
  - NpyIterCasting.ConvertValue: read the source through a LOSSLESS intermediate
    (long / ulong / double / decimal categories instead of blanket double) and route
    integer narrowing and float->int conversions through the Converts.* table. Integer
    casts now wrap (unchecked); float->int truncates with the NumPy overflow/NaN sentinel.
    Result: 0/3276 divergences (was 346).
  - DirectILKernelGenerator: TryGetCastKernel / TryGetStridedCastKernel now decline the two
    families whose emitted IL diverges from NumPy (float->integer; signed-narrow->UInt64)
    via DivergesFromNumpyCast(), so those casts fall back to the NumPy-exact ConvertValue
    path. All other pairs keep their SIMD kernels. NumPy-exact SIMD emission for the
    declined families (raw cvtt truncation, full 64-bit sign extension) is a perf follow-up.

Result: all three cast paths are now bit-exact with NumPy across all 3276 cast cases
(was il=142, cv=346, effective fast path=271 wrong). Full unit-test suite: 0 failures,
9405 passed (TestCategory!=OpenBugs,HighMemory).
astype's cross-dtype path used a scalar per-element Converts loop
(UnmanagedMemoryBlock.CastTo) that was 1.5-3x slower than NumPy on contiguous arrays
and paid a cache-hostile reorder for transposed/F-contiguous sources. Replace it with a
SIMD copy-with-cast through NpyIter.Copy into a KEEPORDER output: the output's memory
order mirrors the source's contiguity, matching NumPy's array_astype default order='K' /
PyArray_NewLikeArray (numpy/_core/src/multiarray/methods.c:769). An F-contiguous or
transposed source now casts as a flat both-F copy instead of a strided reorder.

Wins (min of 3 runs, old astype -> new, vs NumPy 2.x), 2048x2048:
  i32->f64 contig     8.55ms -> 4.56ms   (0.89x NumPy - beats it)
  i32->f64 transpose 10.76ms -> 4.54ms   (0.89x NumPy)
  f32->f64 transpose 10.99ms -> 4.63ms   (0.93x NumPy)
  i32->i32 transpose  1.83ms -> 1.82ms   (0.91x NumPy)
All non-narrowing cast families (int->float, float->float, int->int, same-width) are now
within 10% of NumPy or faster across contig/strided/transpose/fortran/broadcast/negstride.

float->int and signed-narrow->UInt64 are excluded via DivergesFromNumpyCast: they have no
NumPy-faithful SIMD kernel yet (see 71b6c38), and NpyIter.Copy's scalar strided fallback
is slower than the legacy contiguous Converts loop, so they keep the legacy path - correct
and no slower than before. Restoring SIMD for float->int with NumPy wrapping semantics
(raw cvtt truncation, no saturation) is the follow-up that closes the f64->i32 downcast gap.

Correctness rests on the cast-semantics alignment in 71b6c38 (all three cast paths now
bit-exact with NumPy across the full 15x15 matrix). Full suite: 0 failures, 9405 passed.
…float->int & signed->UInt64

Implements float->int casting correctly AND fast, closing the f64->i32 downcast gap and
un-declining the family from DivergesFromNumpyCast.

Root cause (proven in .claude/scripts/f64_to_i32_shootout.cs): the hardware truncating convert
VCVTTPD2DQ/VCVTTPS2DQ already yields NumPy's exact result -- truncate toward zero, INT_MIN
sentinel on NaN/overflow -- while C#'s (int)d and the IL conv.* opcodes SATURATE. So the fix is
to USE cvtt, not to clamp.

Changes:
  - EmitConvertTo (scalar element conversion; used by every cast-kernel tail and the ScalarOnly
    float->int pairs): route float->{integer,char} through the Converts.* table (NumPy-exact)
    instead of the saturating conv.* opcodes, and sign-extend signed->UInt64 via Conv_I8 instead
    of the zero-extending Conv_U8. The entire scalar cast path is now bit-exact with NumPy.
  - CastDoubleToInt32Contig / CastSingleToInt32Contig: width-adaptive cvtt helpers (AVX512
    8/16-wide, AVX 4/8-wide, SSE2 2/4-wide, scalar Converts tail) registered for the float->int32
    contig pairs in TryGetCastKernel. The IL generator's width-matched emitter can't express the
    double->int32 lane halving cleanly, so these use the "tidy helper -> single Call" pattern.
  - DivergesFromNumpyCast narrowed from {all float->int, signed->UInt64} to just signed->UInt64
    (its vectorized widen still sign-extends to only 32 bits, so it stays on the correct
    ConvertValue scalar fallback). float->int is NumPy-faithful on every path now.

Perf (min of 3, 2048x2048, new vs NumPy 2.x astype):
  f64->i32 contig     9.01ms -> 2.90ms  (0.85x NumPy - beats)
  f64->i32 transpose 13.68ms -> 2.95ms  (0.87x)
  f64->i32 fortran   13.76ms -> 3.04ms  (0.88x)
  f64->i32 broadcast 14.35ms -> 2.39ms  (0.92x)
f64->i32 now beats NumPy on contig/transpose/fortran/broadcast. Strided/negstride f64->i32
improved 2.4-2.6x (8.3->3.4ms, 16.9->6.5ms) but remain on the scalar strided path (1.4-2.0x
NumPy); a strided-cvtt inner loop is the follow-up.

Correctness: all three cast paths bit-exact with NumPy across the full 3276-case matrix
(cast_spec) plus multi-element SIMD-body verification (cast_multi_correctness: float->all 9 int
dsts x contig/strided/reversed, signed->u64). Full suite: 0 failures, 9405 passed.
…non-contiguous gap)

The contig float->int32 cvtt helper (26b710b) left strided and negative-stride f64->i32 on
the scalar strided path (1.4-2.0x NumPy). Add a NumPy-faithful cvtt strided kernel that
vectorizes the innermost axis for the three common dst-contiguous run shapes:
  - unit-stride source      -> VCVTTPD2DQ forward
  - reversed source ([::-1])-> load forward chunk, cvtt, reverse lanes (Vector128.Shuffle)
  - positive-strided source -> AVX2 gather (a[:,::2] etc.), then cvtt
Outer dims walk via an incremental-offset odometer (element strides). Anything else
(non-contiguous dst, exotic strides, no AVX2) falls back to the NumPy-faithful scalar
Converts path. Registered in TryGetStridedCastKernel for double->int32.

Perf (min of 3, 2048x2048, new vs NumPy 2.x):
  f64->i32 strided   large  3.41ms -> 1.98ms  (0.83x NumPy - beats)
  f64->i32 negstride large  6.48ms -> 2.99ms  (0.85x)
  f64->i32 negstride med    0.385  -> 0.154   (0.80x)

With this, the astype variation matrix has ZERO real-array (>=256K-element) failures: every
meaningful-size cast cell across f64<->i32 / f32<->f64 / i32<->i32 x
contig/strided/transpose/fortran/broadcast/negstride is now within 10% of NumPy or faster.
The remaining 21 matrix fails are all trivial-size (<=4K elements, sub-microsecond), bound by
NDArray allocation overhead rather than the cast kernel.

Correctness: cast_spec 0/3276 mismatches on all three paths; multi-element SIMD-body test
(double->int32 strided/reversed/contig) 0 failures. Full suite: 0 failures, 9405 passed.
…ex->bool dropping imaginary part

Builds the differential-fuzzer harness (Plan A items #1+#2 foundation): an offline,
committed NumPy 2.4.2 oracle corpus replayed bit-exact by MSTest, so every
NpyIter-backed operation can be proven identical to NumPy across the full input
space -- caught systematically, not by hand-picked cases. The motivating failure
class (the cast saturate-vs-wrap bug, latent in where/copyto/concatenate) must be
impossible to ship again.

Harness
- oracle/gen_oracle.py + oracle/layout_catalog.py: Python generator emitting JSONL
  cases {op, params, operands[], expected}. Each operand is serialized bytes-exact as
  (dtype, shape, element-strides, element-offset, bufferSize, base-buffer-hex). 26
  single-array layouts: C/F-contiguous (1/2/3-D), transposed, strided, negative-stride,
  offset slices, broadcast (stride-0), scalar-broadcast, 0-D views, newaxis, singleton
  dim, empty, empty-composed, 5-D. Edge values are FRONT-LOADED (NaN / +/-inf / overflow
  / -0.0) so even an 8-element operand exercises the float->int cvtt sentinel paths.
  describe() self-validates that every operand is a genuine VIEW into its base buffer,
  so a layout that accidentally yields a copy/scalar fails loudly at generation.
- test/NumSharp.UnitTest/Fuzz/FuzzCorpus.cs: reconstructs the EXACT operand views from
  raw bytes via contiguous-storage + Alias (handles broadcast where buffer < size,
  negative strides, offset slices, 0-D, empty) -- no Python at test time.
- BitDiff.cs: element-wise bit-exact compare with NaN tokenized (payload non-contractual);
  -0.0 and +/-inf are bit-compared since NumPy preserves signed zero / canonical infinity.
- OpRegistry.cs: op-name -> NumSharp call (astype today; binary/comparison/unary/reduction
  /where tiers extend the switch).
- FuzzCorpusTests.cs: one data-driven [TestCategory("FuzzMatrix")] test per corpus file;
  a failure lists every divergent cell (not first-failure-wins).
- HarnessSelfTests.cs: proves the harness has teeth -- BitDiff detects value/NaN-vs-number
  /signed-zero divergence and treats differing NaN payloads as equal; and the corpus
  pipeline genuinely exercises the float->int wrap-not-saturate semantics on real edges.
- Corpus committed under Fuzz/corpus/ and copied to test output: astype_smoke (312 cases)
  + astype_full (4394 = 13 x 13 NumPy-representable dtypes x 26 layouts).

Bug found and fixed (the matrix's first real catch)
- complex->bool through the NpyIter scalar-cast fallback (NpyIterCasting.ConvertValue)
  dropped the imaginary part: it took c.Real and applied real != 0, so complex(0, nonzero)
  and complex(-0.0, nonzero) returned False. NumPy defines bool(z) == (z != 0): truthy when
  EITHER part is non-zero. Latent wherever complex->bool casts through the iterator (astype,
  np.where, np.copyto, np.concatenate). Fixed to honor both parts, matching the already-correct
  Converts.ToBoolean(Complex). All 15 divergent cells across every layout now match NumPy.

Verification
- astype_full: 0 / 4394 divergences from NumPy.
- Full CI suite (net10.0, TestCategory!=OpenBugs&!=HighMemory): 0 failed / 9412 passed.

Scope note: Char and Decimal have no NumPy analog, so they are excluded from the
differential corpus; both remain covered by the existing Converts-oracle cast tests.
…classifier

Extends the differential fuzzer to binary arithmetic (add/subtract/multiply/divide)
across 20 NEP50 dtype pairs x 9 pairwise operand layouts, bit-comparing value AND
result dtype AND broadcast shape against NumPy 2.4.2. 720 cases.

Pairwise layouts (layout_catalog.PAIR_LAYOUTS): pp_contig_contig (SimdFull),
pp_contig_fortran, pp_contig_strided (SimdChunk), pp_strided_strided (General),
pp_scalar_right / pp_scalar_left (SimdScalar*), pp_broadcast_row / pp_broadcast_col,
pp_negstride_both. Operands are emitted at natural shapes; the op performs broadcasting.

The matrix now asserts three things per case: result dtype == NumPy (the headline NEP50
promotion failure), result shape == NumPy (broadcasting), and bit-exact bytes.

Misaligned classifier (never-silent intended divergences)
- test/.../Fuzz/MisalignedRegistry.cs: the explicit, documented set of intended
  NumSharp-vs-NumPy differences. The matrix excuses ONLY these patterns (logging a
  per-reason count) and fails on anything else. Cases stay in the corpus so a future
  fix simply starts passing, and any drift past tolerance turns back into a hard failure.
- BitDiff.WithinUlp: ULP-distance helper used ONLY to classify documented near-misses,
  never to relax the default bit-exact gate.

Two intended divergences documented (maintainer decision: keep as Misaligned)
1. NEP50 weak-scalar (41 cases): NumSharp treats a 0-D array operand as a weak scalar
   (the other operand's dtype drives promotion), e.g. array(int32)+zerod(int64) -> int32
   where NumPy gives int64; array(uint8)+zerod(int8) -> uint8 vs NumPy int16. NEP50 makes
   0-D arrays full promotion participants; only Python scalar literals are weak. NumSharp
   cannot distinguish the two (both are 0-D NDArrays), and `arr + 5` ergonomics were chosen
   over strict parity. Array+array promotion is correct across all layouts.
2. Complex true-division ~1 ULP (11 cases): complex / {float,complex,int} differs from
   NumPy's npy_cdivide by one ULP (System.Numerics.Complex uses different scaling).
   Complex add/sub/mul are bit-exact; only division diverges.

Verification: Binary_Arith excuses exactly 52 divergences (11 complex-divide + 41
weak-scalar) and bit-matches the other 668 cases; full FuzzMatrix category 0 failed /
8 passed. No NumSharp.Core changes in this commit (harness + corpus + generator only).
…Bugs]

Extends the binary differential generator to floor_divide / mod / power and runs them as
a known-failing [OpenBugs] reproduction (excluded from CI; remove the tag when fixed).
The bit-exact add/sub/mul/divide matrix (Binary_Arith) stays green; these three ops are
split into a separate binary_divmod_power.jsonl corpus + Binary_DivModPower_KnownDivergences
so the CI gate is not broken while the divergences are tracked.

206/430 cases diverge from NumPy 2.4.2 (27 NEP50 weak-scalar cases auto-excused). The real
bugs, confirmed directly against NumPy:

  integer ÷0 / mod 0   NumPy returns 0 (with RuntimeWarning). NumSharp:
                       - int32 //0  -> 2147483647 / -2147483648 (saturated sentinel), not 0
                       - int32  %0  -> the dividend (1,-7,5), not 0
                       - uint8 //0 and %0 -> throw DivideByZeroException
  float //0            NumPy: 1.0//0 -> +inf, -1.0//0 -> -inf, 0.0//0 -> nan.
                       NumSharp: NaN for all (loses the ±inf sign result).
  mixed-precision mod  mod(float32, float64) computes in float32 then widens (low mantissa
                       zeroed) instead of promoting to float64 first. add/sub/mul/div promote
                       correctly; only mod loses precision.
  complex power        complex ** {float,complex,int} differs by ~1 ULP (like complex divide)
                       plus some gross edge divergences (NumPy NaN vs NumSharp 0).

Correct already: mod sign convention is floored (mod(-7,3)==2, matches NumPy), and float %0
is nan on both sides.

CI suite (net10.0, !OpenBugs&!HighMemory): 0 failed / 9413 passed. FuzzMatrix: 8 passed.
Adds the comparison tier (equal/not_equal/less/greater/less_equal/greater_equal -> bool)
across the 20 NEP50 dtype pairs x 9 pairwise layouts. 1080 cases.

Found: NumSharp's <= and >= return True when an operand is NaN; IEEE/NumPy return False
(only != is True against NaN). <, >, ==, != all handle NaN correctly. 122 cases, every one
the NumPy-false -> NumSharp-true signature on the NaN element. Documented in MisalignedRegistry
(tagged [known bug]) with a tight signature so the matrix stays green; tracked for fix.

Comparison matrix excuses exactly 122 NaN-ordering divergences and bit-matches the other 958.
FuzzMatrix category green.
…classes

Adds the unary tier (negative/abs/sign/sqrt/cbrt/square/reciprocal/floor/ceil/trunc/
sin/cos/tan/exp/log) across 7 dtypes x 26 single-array layouts. 2574 cases. Adds Half-ULP
support to BitDiff and routes THREW through the Misaligned classifier (new DivergenceKind.Threw).

Also fixes a generator artifact: np.ascontiguousarray forces ndim>=1, which corrupted the
expected shape of 0-D results to [1]. gen_unary/gen_binary now read r.shape BEFORE
ascontiguousarray (used only for the bytes), so scalar results compare as [] (matching NumSharp).

Documented divergence classes (excused, never silent; remove from MisalignedRegistry as fixed):
  494  unary NEP50 promotion  NumSharp promotes int->float64 for unary float ops regardless of
                              width; NumPy uses width-based float (bool/int8->float16,
                              int16->float32, int32+->float64). square/reciprocal/floor dtype also differ.
   90  complex unary          square catastrophic cancellation (re^2-im^2 -> 0); sin/cos/tan/log
                              inf/NaN edge handling differs (System.Numerics vs npy_c*).
   72  unary ~ULP             transcendental / magnitude (abs complex) within 2 ULP.
   61  reciprocal             integer reciprocal returns 0 vs NumPy ÷0 sentinel; non-contiguous
                              operand throws "Can't return a memory address".
   24  negative on unsigned   np.negative(uint8) throws NotSupportedException; NumPy wraps modulo.

Unary matrix excuses these and bit-matches the rest. FuzzMatrix category: 10 passed.
Tracked for fixes in task #9.
…ergences

Adds the reduction tier (sum/prod/min/max/mean/std/var/argmax/argmin/all/any) across axis
in {None, 0, ndim-1} x keepdims {false,true} x 7 dtypes x 10 layouts. 3640 cases. OpRegistry
gains a reduction dispatch handling the heterogeneous NumSharp signatures (mean/std/var split
None-vs-int overloads; argmax/argmin require an int axis; prod takes a Type arg).

Reductions are mostly solid — only 80/3640 diverge. all/any/argmax/argmin/prod are bit-exact.
Documented divergence classes (excused in MisalignedRegistry; remove as fixed, task #10):
  reduction NaN propagation   min/max/mean/std/var return the non-NaN result; NumPy keeps NaN
                              (regular reductions propagate; only nan* variants skip).
  complex axis reduction      mean/std/var/sum of complex along an axis throws
                              "Can't construct NDCoordinatesAxisIncrementor with a vector shape".
  bool min/max along axis     returns True where NumPy returns False.
  summation precision         sum/mean/std/var floating accumulation differs from NumPy
                              pairwise / two-pass (algorithm order).
  reduction result dtype      NEP50 accumulator width / complex->real (std/var) differences.

FuzzMatrix category: 11 passed.
Adds the selection tier: np.where(cond,x,y) across 8 dtype pairs x 5 triple-operand layouts
(contig / broadcast-x-y / strided / scalar-cond / broadcast-cond) and np.place(arr,mask,vals)
across 3 contiguous layouts x 5 dtypes. where=40 cases, place=15 cases.

np.place is bit-exact with NumPy on all 15 cases. np.where bit-matches 33/40; documented:
  3  NEP50 weak-scalar   where(cond, array, 0-D scalar) promotes weakly (generalized the
                         weak-scalar rule to any multi-operand op with a 0-D operand).
  4  complex np.where    throws NotSupportedException "Zero-push unsupported for Complex".

With this, all six A1 op tiers (T1 cast, T2 binary, T3 comparison, T4 unary, T5 reduce,
T6 where/place) are implemented and green. FuzzMatrix category: 13 passed. Every divergence
is either bit-exact, documented in MisalignedRegistry, or a tracked [OpenBugs] reproduction.
Adds oracle/fuzz_random.py: a seeded, reproducible random generator that draws random
transposed/strided/reversed views (random ndim<=4, dims, stride perms/signs), random dtypes,
and random ops across all tiers, computes the NumPy 2.4.2 result, and writes a committed
corpus the C# harness replays. FuzzRandom replays a 2000-case seed (1234); FuzzRegression
replays shrunk soak failures from corpus/regressions/ (empty until the soak finds one).

Shrinker.cs minimizes a failing element-wise case to a single 0-D element using the corpus's
own expected byte at the diverging index (no NumPy needed); RunCorpus emits the minimal repro
in the failure message so a soak find is immediately committable. A self-test proves the
shrunk case reproduces the divergence (bool add True+True).

The random fuzzer is scoped to NumSharp-PRODUCIBLE layouts: NumSharp's slicing normalizes
offset into the storage base (Shape.offset stays 0) and keeps consistent size-1 strides, so
reconstructing NumPy's RAW offset!=0 / junk-size-1-stride representation exposes ops that
assume the normalized form (real but unreachable via the API; tracked as task #11). With that
scoping the random gate is green; it surfaced three NEW divergence classes the fixed matrices
missed (tasks #11, #12):
  bool arithmetic    True+True -> NumSharp 2 vs NumPy bool True.
  size-1 collapse    NumSharp collapses an all-size-1 result to 0-D; NumPy keeps [1].
  complex binary     multiply (and divide on some data) catastrophic cancellation + ~1 ULP.

Shape divergences now route through the classifier too. FuzzMatrix category: 16 passed.
…EADME

The FuzzMatrix differential gate (cast/binary/comparison/unary/reduce/where/place matrices +
seeded FuzzRandom + FuzzRegression) already runs on every push/PR: the tests are
[TestCategory("FuzzMatrix")], not excluded by the CI filter, and replay the committed offline
corpora with no Python. Documented that in build-and-release.yml's filter comment.

Adds .github/workflows/fuzz-soak.yml: a nightly (04:00 UTC) + manual job that installs NumPy
2.4.2, sweeps seeds through oracle/fuzz_random.py (~200k cases/seed, ~1M/night), and replays
each through the harness. On a divergence the shrunk minimal repro is printed and the failing
corpus uploaded as an artifact, to be pinned under corpus/regressions/ for FuzzRegression.

Adds test/NumSharp.UnitTest/Fuzz/README.md: how the offline oracle/replay model works, corpus
regeneration commands, the CI cadence, and the full documented-divergence ledger with task refs.

Completes Plan A (A0 harness, A1 T1-T6 op matrices, A2 random fuzzer + shrinker, A3 CI). Full
suite (net10.0, !OpenBugs&!HighMemory): 0 failed / 9421 passed; FuzzMatrix: 16 passed.
Adds docs/FUZZ_FINDINGS.md: every NumSharp-vs-NumPy-2.4.2 divergence the differential fuzzer
surfaced, each bit-exact verified, with minimal NumPy-vs-NumSharp reproductions, root cause where
known, and disposition (FIXED / BUG / INTENDED / SCOPING) mapped to tasks #7-#12.

22 findings: 1 fixed (complex->bool), 18 confirmed bugs (integer ÷0/mod0, float //0,
mixed-precision mod, complex power; NaN <=/>=; NEP50 unary promotion, negative-unsigned,
reciprocal, complex unary; reduction NaN propagation, complex-axis throw, bool min/max,
summation precision, result dtype; complex where throw; bool arithmetic, size-1 collapse,
complex binary cancellation), 2 intended Misaligned (NEP50 weak-scalar, complex-divide ULP),
1 scoping note (ops vs raw offset!=0 / junk size-1 strides).
…r behaviors)

Adds docs/FUZZ_PLAN_NEXT.md: an actionable plan for the two open coverage rows.

Part 1 — #2 sections C (per-operand flags) + E (composite paths): black-box, extends the existing
differential harness with out= / where= params and operand-relationship layouts (aliased,
overlapping, write-masked, out-strided/broadcast/cross-dtype). C5 + error parity fold in test-kind
#4. Prerequisite: OpRegistry/corpus support for a pre-seeded out operand and a mask.

Part 2 — #3 NpyIter behaviors (= section D, white-box): port NumPy's test_nditer.py (~106 fns).
B0 coverage ledger -> B1 expose iterator state (ChosenOrder/CoalescedNDim/IsBuffered/InnerStrides)
-> B2 order+coalescing (highest value, guards the KEEPORDER stride-sort) -> B3 buffered casting ->
B4 op_axes/remove_axis/ranged (triage feature gaps) -> B5 nditer-config error parity. Acceptance:
every test_iter_* green / [Misaligned] / tracked [OpenBugs] - nothing untracked.

Plus an adjacent-gaps appendix (op breadth incl matmul/manipulation/bitwise/nan-aware, order=/dtype=/
ddof params, SIMD-tail shapes, leak/perf/metamorphic) and an effort-shape table.
…ion north star)

Adds docs/ROADMAP.md consolidating the entire path from the current baseline to the
/np-function goal (every np.* bit-identical to NumPy 2.4.2 OR >=1.5x faster), correctness-first
with the differential fuzzer as the regression net for the perf refactor.

Five phases:
  0  Baseline (Plan A done) — harness, T1-T6 matrices, random fuzzer, CI gate.
  1  Correctness backlog — fix the 22 findings, grouped by shared root (F1 div-by-zero, F2 NaN,
     F3 NEP50 promotion, F4 unsupported throws, F5 complex algorithms, F6 bool, F7 size-1 shape,
     F8 summation precision, F9 representation). Each fix flips its classifier branch and re-arms
     the gate; F5/F8/F9 are implement-vs-document judgment calls.
  2  Coverage breadth — 2A finish #2 sections C+E (out=/where=/aliasing/overlap/mask), 2B op
     tiers T7-T15 (manipulation, matmul/dot, bitwise, nan-aware, cumulative, stat, logic,
     sorting, multi-output) = the ~75 untested transformation ops.
  3  NpyIter behavioral parity (#3 / Plan B) — port test_nditer.py; de-risks Phase 5.
  4  Depth — params (order=/dtype=/ddof/axis), SIMD-tail & large shapes, error parity (#4),
     unmanaged lifecycle (#5), metamorphic invariants (#7).
  5  Performance — the >=1.5x-NumPy mission (#6): the DirectILKernelGenerator -> ILKernelGenerator
     (NpyIter-driven) migration in CLAUDE.md priority order, benchmark ledger, perf CI gates;
     the differential matrices keep the kernel rewrite from breaking parity.

Includes a dependency graph, value-weighted recommended order, and an effort-shape table. Detail
for phases 2A/3 lives in FUZZ_PLAN_NEXT.md; findings in FUZZ_FINDINGS.md.
… (Phase 1 F1)

Ports NumPy's division/modulo C kernels into a new NpyDivision helper and routes
both IL emission paths through it, fixing the floor_divide/mod parity defects the
differential fuzzer surfaced (FUZZ_FINDINGS #2, #3, #4 — all now bit-exact).

Root cause
----------
The old EmitFloorDivideOperation/EmitModOperation emitted inline IL that:
  * threw DivideByZeroException for unsigned integer ÷0 (and returned a saturated
    sentinel / the raw dividend for signed int ÷0 and mod-0);
  * force-converted float //0 to NaN via EmitFloorWithInfToNaN, losing NumPy's
    ±inf (sign-of-dividend) result and mis-handling inf//x, x//inf, overflow;
  * computed signed-integer floor division by round-tripping through double,
    losing precision for large int64 and skipping NumPy's snap-to-nearest.

Fix
---
New src/NumSharp.Core/Utilities/NpyDivision.cs replicates NumPy exactly:
  * Integer floor_div_@TYPE@ (loops_arithmetic.dispatch.c.src): ÷0 -> 0, signed
    floor toward -inf, MIN/-1 wraps to MIN (overflow), sub-int types compute in
    the int domain so the #DE trap can't fire.
  * Integer remainder (loops_modulo.dispatch.c.src): ÷0 -> 0, floored (Python)
    sign convention, MIN % -1 == 0.
  * Float npy_floor_divide / npy_remainder (CPython divmod port in
    npy_math_internal.h.src): fmod -> sign-fixup -> snap-to-nearest; b==0 returns
    a/b (±inf / nan), never a forced NaN. Computed in float32 for Single to stay
    bit-exact.

Both the same-type generic path (DirectILKernelGenerator.Binary.cs
EmitFloorDivideOperation<T>) and the resultType path (DirectILKernelGenerator.cs
EmitFloorDivideOperation/EmitModOperation) now emit a single Call to the matching
helper, mirroring the existing NpyIntegerPower pattern. Decimal/Half/Complex are
unaffected (handled by their own emit paths before this point).

Verification
------------
  * 26 hand-picked edge cases (0.7//0.1==6.0, -2.0//inf==-1.0, inf//2.0==nan,
    1e308//1e-300==inf, MIN//-1, MIN%-1, signed/unsigned ÷0) all bit-exact.
  * mod(f32,f64) now promotes to float64 and matches NumPy bit-exact (#4).
  * binary_divmod_power.jsonl promoted from [OpenBugs] to [FuzzMatrix]: floor_divide
    and mod are now CI-gated bit-exact; the only remaining excused divergence is
    complex power (~ULP/edge), documented for Phase 1 F5.
  * net10.0 FuzzMatrix gate: 16/16 green.

Clears FUZZ_FINDINGS #2, #3, #4 (the reciprocal-int part of F1 tracks with the
unary batch). Complex power (#5) reclassified BUG -> F5 (algorithmic, deferred).
…se 1 F2)

The scalar comparison kernel emitted 'a <= b' as '!(a > b)' using the ordered Cgt
opcode (Clt for '>='). For a NaN operand the ordered compare yields false, so the
negation produced **true** — but IEEE/NumPy make every ordered comparison with NaN
false (only '!=' is true). FUZZ_FINDINGS #6, surfaced by the comparison matrix
(122 cases, all on the NaN element).

Fix: for float operands use the UNORDERED compare (Cgt_Un / Clt_Un). With NaN these
yield true, so the negation is false — correct. The unordered opcodes were already
used for unsigned-integer comparison, so the change is just isUnsigned -> (isUnsigned
|| isFloat); signed integers keep the ordered Cgt/Clt (no NaN domain). '<' and '>'
were already correct (Clt/Cgt already yield false for NaN), as were '==' / '!='.

The SIMD path (Vector.LessThanOrEqual/GreaterThanOrEqual) was already IEEE-correct,
so only EmitComparisonOperation (scalar/strided/tail) needed the change.

Verified bit-exact on scalar (1-D), SIMD (NaN at lanes 0 and 17 of a 32-wide array),
strided (every-2nd view), and float32 paths. The MisalignedRegistry NaN-ordering
branch is removed so the comparison matrix now verifies this bit-exact; net10.0
FuzzMatrix gate 17/17 green. Clears FUZZ_FINDINGS #6.
…ase 1 F4)

np.negative(NDArray) called the legacy Regen-templated NDArray.negative(), a
per-dtype switch that threw NotSupportedException for every unsigned dtype (Byte,
UInt16, UInt32, UInt64, Char) and read @out.Address (so non-contiguous operands
also threw 'Can't return a memory address when sliced/broadcasted'). NumPy negates
unsigned integers by two's-complement wrap: np.negative(uint8(1)) == 255.
FUZZ_FINDINGS #8.

Fix: route np.negative through nd.TensorEngine.Negate(nd) — the same engine path
the unary '-' operator (NDArray.Primitive.cs) and nd.negate() already use. Its IL
kernel emits the two's-complement (~v + 1) for unsigned types and drives strided /
non-contiguous operands through NpyIter. The legacy hand-written nd.negative() is
left untouched (still the public method) but is no longer on the np.negative path.

Verified bit-exact vs NumPy on contiguous and strided uint8/uint16/uint32 (e.g.
[0,1,2,255] -> [0,255,254,1]) and unchanged for signed/float. The MisalignedRegistry
'negative on unsigned throws' branch is removed so the unary matrix verifies it
bit-exact. Clears FUZZ_FINDINGS #8 (reciprocal non-contiguous, the sibling Address
bug in #9, tracked separately).
…cs (Phase 1 F3a)

NumSharp promoted every integer input to float64 for the float-producing unary
ufuncs (sqrt/cbrt/exp/log/trig/...), regardless of input width. NumPy NEP50 uses
WIDTH-BASED promotion: bool/int8/uint8 -> float16, int16/uint16 -> float32,
int32/uint32/int64/uint64 -> float64 (float/complex preserved). FUZZ_FINDINGS #7.

Fix: new DefaultEngine.ResolveUnaryFloatReturnType implements the width-based rule;
the 20 transcendental Default.<Op>.cs files (ACos/ASin/ATan/Cbrt/Cos/Cosh/Deg2Rad/
Exp/Exp2/Expm1/Log/Log10/Log1p/Log2/Rad2Deg/Sin/Sinh/Sqrt/Tan/Tanh) now call it
instead of ResolveUnaryReturnType (which widened to float64 via GetComputingType).
The int->Half / int->Single unary kernels already existed, so this is a pure
result-dtype routing change; int32/int64 are unaffected (already float64).

The dtype-PRESERVING ufuncs (square/floor/ceil/trunc/round/reciprocal) are
intentionally left on the old resolver pending F3b (they need integer identity /
x*x / int-reciprocal kernels to preserve the integer dtype).

Verification:
  * 364 of 494 unary dtype divergences clear bit-exact; the remaining 130 are the
    F3b-pending preserve-dtype ops, now scoped in MisalignedRegistry so a
    transcendental promotion regression fails the gate (not silently excused).
  * Half/Single transcendental values stay within 2 ULP of NumPy's float16/float32
    libm (documented algorithm difference, same class as complex-divide ULP).
  * DtypeCoverageTests.Sqrt_IntegerDtypes updated: it asserted the old uniform
    float64 (codified the bug); now asserts the NumPy width-based dtype per input.
  * Full net10.0 suite green: 9422 passed / 0 failed; FuzzMatrix 17/17.

Clears the transcendental half of FUZZ_FINDINGS #7 (and #15 dtype for those ops).
…Phase 1 F2-reductions)

NumPy's (non-nan*) min/max PROPAGATE a NaN operand; NumSharp's SIMD reduction used
hardware Vector.Min/Max (MINPS/MAXPS), which drop NaN and return the other lane, so
np.min([nan,-inf,1]) gave -inf instead of nan. sum/mean/std/var already propagated
(NaN arithmetic). FUZZ_FINDINGS #11.

Fix: emit/compute the NaN-propagating form for float/double Min/Max —
  ConditionalSelect(Equals(acc,acc) & Equals(v,v), MinMax(acc,v), acc + v)
— hardware min/max where both lanes are ordered (non-NaN), and acc+v (= NaN) where
either is NaN (the self-equality mask is the standard SIMD NaN test). Applied at
every combine site of the FLAT reduction:
  * EmitVectorBinaryReductionOp — unrolled body, pairwise tree merge, remainder
    (DirectILKernelGenerator.Reduction.cs);
  * EmitVectorReductionOp — the GetLower/GetUpper horizontal tree-reduce fold;
  * CombineVectors256/128 — the C# axis-SIMD/gather helpers
    (DirectILKernelGenerator.Reduction.Axis.Simd.cs); the typeof(T) guard is a
    JIT-time constant, so integer Min/Max keep the single-instruction path.
The scalar tail / lane combine already used Math.Min/Max (which propagate).

Verified: flat np.min/np.max now propagate NaN for every array size 3..257 and every
NaN position, double and float32; integer min/max unchanged; full net10.0 suite green
(9422 passed / 0 failed).

Scope: the axis (vertical/strided) SIMD min/max in CreateAxisReductionKernelTyped has
a further combine site that still drops NaN (e.g. np.min(a2d, axis=0)); the classifier
now excuses ONLY axis!=null NaN-propagation, so the flat path is gate-verified and a
flat regression fails. Partially clears FUZZ_FINDINGS #11 (flat); axis tracked under
F2-reductions.
…Phase 1 F6 + F4)

Two independent parity defects the differential fuzzer surfaced, both now bit-exact
and CI-gated (their MisalignedRegistry branches removed).

F6 — bool arithmetic (FUZZ_FINDINGS #17). NumPy's bool dtype has no integer add/
multiply loop: '+' is logical OR, '*' is logical AND, so True + True == True (raw
byte 1). NumSharp did byte arithmetic and stored 2 in a bool slot. Fix: in
ExecuteBinaryOp, when both operands are bool (resultType == Boolean) remap Add ->
BitwiseOr and Multiply -> BitwiseAnd before kernel dispatch, so every SIMD/scalar
and same-type/mixed path writes a normalized 0/1 byte. '-' has no bool loop and
keeps throwing like NumPy. Verified scalar + 32-wide SIMD.

F4 — complex np.where (FUZZ_FINDINGS #16). where(cond, x, y) threw
'NotSupportedException: Zero-push unsupported for Complex' whenever the promoted
result was complex: NpyExpr.EmitPushZeroPublic (WhereNode pushes a typed zero for the
unselected branch) had no Complex case. Both-complex where already worked; only the
MIXED promotion where(cond, complex, real) hit it. Fix: add Complex (Complex.Zero)
and Half cases. Now where([T,F,T], complex, float) == [(1+2j),(8+0j),(5+6j)] bit-exact.

Harness: Shrinker_MinimalCaseReproducesDivergence used bool True+True as its fixture
(now fixed, so it no longer diverges); retargeted to a synthetic int32-add case with a
planted-wrong expected buffer, so it exercises the shrinker mechanism without
depending on a live bug.

Verified: net10.0 FuzzMatrix 17/17; full suite 9422 passed / 0 failed. Clears
FUZZ_FINDINGS #16 and #17.
…ank operand (Phase 1 F7)

Shape.Broadcast's size-1 fast path treated a 1-D [1] (NDim==1, size==1) identically
to a 0-D scalar and broadcast it to the OTHER operand's dimensions. When that other
operand had fewer dimensions, [1] adopted the lower rank — so [1] + 0-D scalar gave
shape [] instead of [1] (a dropped dimension). The path was asymmetric: scalar + [1]
was already correct. FUZZ_FINDINGS #18 (found by the random fuzzer).

Fix: guard each size-1 collapse on the other operand having at least equal rank —
left collapses only when rightShape.NDim >= leftShape.NDim, and symmetrically for the
right — so the result keeps ndim == max(ndims), matching NumPy. The [1] -> higher-rank
broadcasts (e.g. [1] + [2,3] -> [2,3]) are unchanged (guard is satisfied there).

Verified: [1]+scalar -> [1], scalar+[1] -> [1], [1]+[1] -> [1], [2]+scalar -> [2],
[1]+[2,3] -> [2,3], [3]+[1] -> [3], [2,1]+[3] -> [2,3], [[1]]+scalar -> [1,1],
scalar+scalar -> []. Shape.Broadcast is core (used by every broadcasting op); full
net10.0 suite green (9422 passed / 0 failed), FuzzMatrix 17/17. Clears FUZZ_FINDINGS #18.
…al-noncontig (Phase 1)

Two findings turned out PARTIALLY resolved in committed code; tightened the classifier
to verify the resolved sub-cases bit-exact while excusing only the precise residuals.

FUZZ_FINDINGS #12 (complex axis reduction): reducing a complex 2-D+ array along an axis
(sum/mean/prod/std/var/min/max) now works bit-exact (std(complex2d, axis=0) -> real
float64, verified without the uncommitted NpyIter.cs). Only a 1-D complex array reduced
along its only axis still throws 'NDCoordinatesAxisIncrementor with a vector shape'.
The classifier now excuses just that 1-D Threw case; the 2-D matrix is gate-verified.

FUZZ_FINDINGS #9 (reciprocal non-contiguous): reciprocal on a non-contiguous FLOAT
operand (transposed/strided) now works. reciprocal(int) still returns the float64
reciprocal (vs NumPy's integer ÷0 sentinel) and still throws on a non-contiguous
integer operand (the int->float path needs a flat Address) — both pending F3b. The
classifier scoped to integer input (value diff or non-contig throw); float non-contig
is gate-verified.

No production code change — corpus replay confirmed the resolutions are in committed
code. net10.0 FuzzMatrix 17/17 green. Documents the exact resolved-vs-pending split
for #9 and #12.
…l matrix (Phase 2B)

Built the T8 linear-algebra differential corpus (matmul/dot/outer, 408 cases across the
gufunc shape space x 6 dtypes x C/F layouts). It immediately surfaced three real bugs —
all now fixed and bit-exact CI-gated:

1. np.matmul only handled 2-D x 2-D. Every other NumPy-supported shape was broken:
     matmul(1d,1d) -> Debug.Assert crash; matmul(2d,1d) -> [n,1] (extra dim);
     matmul(1d,2d) -> throws 'core dimension mismatch'; batched (b,n,k)@(b,k,m) -> throws
     'cannot be broadcast'.
   Rewrote Matmul to the NumPy gufunc (n?,k),(k,m?)->(n?,m?): promote 1-D operands
   (lhs->[1,k], rhs->[k,1]), run a batched matmul that broadcasts ONLY the stack dims
   (the old path broadcast the full shapes incl. the matrix dims, hence the throw) and
   does a 2-D MultiplyMatrix per batch element, then squeeze the inserted dims. Verified
   bit-exact vs NumPy for 2-D, 1-D@1-D, 2-D@1-D, 1-D@2-D, 3-D/4-D batched, broadcast batch.

2. np.squeeze(arr, axis) over-collapsed: after removing the named axis it also collapsed a
   remaining length-1 dim to 0-D (squeeze([1,1],0) -> [] instead of [1]), diverging from
   NumPy and breaking matmul's 1-D-promotion squeeze. Fixed squeeze_fast to drop ONLY the
   named axis (collapse to scalar only when no axis remains).

3. np.dot(N-D, 1-D) returned the wrong dtype: it did np.sum(a*b, axis=1), which used the
   reduction accumulator (int32 -> int64, uint8 -> uint64) and a hard-coded axis=1. NumPy's
   dot preserves the input dtype and contracts the last axis. Routed the N-D-dot-1-D case
   through Matmul (same contraction, correct dtype + axis).

Harness: OpRegistry gains matmul/dot/outer; gen_oracle gains a  mode (gen_matmul
with C/F operand layouts via the C-contig-base + transposed-view pattern); new
matmul.jsonl corpus + [FuzzMatrix] Matmul test.

Verified: matmul.jsonl 408/408 bit-exact; net10.0 FuzzMatrix 18/18; full suite 9423/0.
Nucs added 25 commits June 20, 2026 19:57
Continues SubwordCopy into the cross-WIDTH family: {i16,u16,char} -> {i8,u8}
(low-byte truncate) and {i16,u16,char} -> bool (!=0) for strided views. 2-byte
sources are not VPGATHER-eligible, so TryGetIntToNarrowStridedKernel (4/8-byte
gather) declines them and the generic NarrowInt IL kernel fell to a scalar inner
loop on a non-unit inner stride. All 36 domain cells (3 src x 3 dst x 4 strided
layouts) now win, clean best-of-7:

  i16|strided|i8  0.76 -> 1.50   char|strided|i8  0.79 -> 1.73
  i16|strided|bool 0.65 -> 2.10  char|strided|bool 0.68 -> 2.39
  i16|negcol|i8   0.89 -> 2.73   i16|negcol|bool  ~   -> 4.63
  sliced/negrow (ss==1 path) 1.19-5.x — NO regression vs the prior winners.

Kernel (DirectILKernelGenerator.Cast.SubwordNarrow.cs), two-stage SIMD reusing
the SubwordCopy shuffles, inner loop inlined in the odometer (per the SubwordCopy
PERF NOTE — a per-row helper costs ~6x):
  * ss==2  deinterleave even 2B elems (VPACKUSDW+VPERMQ), then narrow;
  * ss==-1 reverse 2B elems (VPSHUFB+VPERMQ), then narrow;
  * ss==1  contiguous truncating Vector256.Narrow (32/iter — the sliced/negrow
    path the generic kernel already won, kept fast so no regression);
  * narrow step: int = mask low byte + VPACKUSWB (== unchecked (byte) == NumPy
    wrap); bool = AndNot(CompareEqual(v,0), 1) then pack.
  f16 source is excluded (float->int is a value truncate, kept on the Half
  kernels); only C-NON-contiguous layouts reach here (C-contig uses
  TryGetCastKernel, untouched). Routed in TryGetStridedCastKernel after
  SubwordCopy, ahead of the int-narrow / to-bool resolvers.

Validation
  * BIT-EXACT vs NumPy 2.4.2: 208/208 hashes (26 byte-copy + narrow pairs x 8
    layouts), base arange%65521 (incl. 0 for the bool zero-branch + diverse high
    bytes for the truncation) — subword_cross_npref.py / subword_cross_verify.cs.
  * Standalone math proven 0-diff + benchmarked (subword_narrow_poc.cs).
  * Full suite green: 9956 passed / 0 failed (net10.0, excl OpenBugs/HighMemory).

Note: scoreboard refresh deferred — this session's repeated sweeps ran in a
degraded environment (accumulated dotnet build-server/run processes inflating the
best-of-3 NS times ~7pp globally; the clean baseline committed at 9aea3ac is
93.4%). Per-cell wins above are best-of-7 (min over rounds, robust to load); the
full-matrix refresh should be re-run on a clean box.
…+ plan docs)

Reflects the Wave 16/16b/16c kernels in the project docs:

- .claude/CLAUDE.md: DirectILKernelGenerator partial count 60 -> 62; Cast & Copy
  list adds .Cast.SubwordCopy.cs + .Cast.SubwordNarrow.cs; the cast coverage row
  now documents the sub-word strided/negcol SIMD lane shuffles (VPACKUS
  deinterleave / VPSHUFB reverse for same-type & same-size-bit copies; +low-byte/
  !=0 narrow for 2B-int->{1B,bool}) and the inline-in-odometer perf rule (a per-row
  helper costs ~6x). The old "2-byte/1-byte strided stay scalar" caveat is removed.

- CAST_BEAT_NUMPY_PLAN.md: new §12 (Wave 16 table + kernels + the inline gotcha +
  208/208 bit-exact + the best-of-3/throttle measurement caveat + the noise
  finding); §11 residual list updated (sub-word strided no longer hardware-limited).

- CAST_REMAINDER_133_PLAN.md: §0 banner marking §2 CLOSED by Wave 16 and confirming
  the §6 noise hypothesis (true >=0.9 ~96%, not the sweep headline); remaining real
  laggards narrowed to §3 (alloc-bound), §4 (AVX512 i64/u64->f16), §5 (memory-bound),
  + a couple genuine large-type cells.

Scoreboard refresh deferred to a clean environment (this session's repeated sweeps
ran degraded by accumulated dotnet processes; per-cell wins are best-of-7, robust).
…6 shipped)

Full cast-matrix sweep on a settled environment (no stray dotnet processes):
1438/1568 (91.7%) >=0.9, 1374 (87.6%) win >=1.0, geomean 1.705. The net count vs
the prior 93.4% snapshot is within the best-of-3 noise band (~140 narrowly-winning
cells jitter ±0.9 run-to-run); the Wave 16/16b/16c cells now win in this run.

Confirmed via clean best-of-7 (robust to load) that the sweep's in-domain LAGs are
noise: char|strided|i16 1.66, char|strided|u16 1.84, char|strided|u8 1.82 (sweep
read 0.77-0.88). The remaining sub-word strided LAGs are a DIFFERENT, unaddressed
family: 1B->2B WIDENING (i8/u8/bool -> i16/u16/char) at 0.55-0.91 (the inverse of
Wave 16c narrowing) — the natural next target.
The inverse of Wave 16c: {bool,u8,i8} -> {i16,u16,char} for strided views. 1-byte
sources aren't VPGATHER-eligible, so the generic WidenInt IL kernel fell to a scalar
inner loop on a non-unit inner stride (the i8|strided|i16 0.73 / bool|strided|i16
0.66 / u8|strided|char 0.62 cliff). All 36 domain cells (3 src x 3 dst x 4 strided
layouts) now win, clean best-of-7 / best-of-3 sweep:

  i8|strided|i16  0.73 -> 2.09   bool|strided|i16 0.66 -> 4.17
  u8|strided|u16  0.67 -> 2.61   bool|strided|u16 0.67 -> 7.03
  i8|strided|char 0.71 -> 2.94   u8|strided|char  0.62 -> 2.96
  + negcol/sliced/negrow (ss==1/-1) 1.9-3.5x — sliced/negrow already won via the
  generic kernel (1.1-1.8x); the new ss==1 path is as-fast-or-faster (no regression).

Kernel (DirectILKernelGenerator.Cast.SubwordWiden.cs), two-stage SIMD reusing the
SubwordCopy shuffles, inner loop inlined in the odometer:
  * ss==2  deinterleave even bytes (VPACKUSWB+VPERMQ), then widen;
  * ss==-1 reverse bytes (VPSHUFB+VPERMQ), then widen;
  * ss==1  contiguous Vector256.WidenLower/Upper (32/iter);
  * widen step: signed src (i8) = WidenLower/Upper on sbyte = SIGN-extend; unsigned
    src (u8,bool) = on byte = ZERO-extend. The dst signedness (i16/u16/char) is
    irrelevant — same 16-bit pattern (i8 -56 -> 0xFFC8 == int16 -56 / uint16 65480).
  dst==Half excluded (int->float value cast, kept on the Half kernels). Routed in
  TryGetStridedCastKernel after SubwordNarrow; C-contig untouched (TryGetCastKernel).

Validation
  * BIT-EXACT vs NumPy 2.4.2: 280/280 hashes (35 pairs x 8 layouts), incl. i8 sign-
    extend across -128..127 (subword_cross_npref.py / subword_cross_verify.cs).
  * Standalone math 0-diff + benchmarked (subword_widen_poc.cs).
  * Full suite green: 9977 passed / 0 failed (net10.0, excl OpenBugs/HighMemory).

Note: scoreboard refresh deferred (this session's sweeps degraded ~7pp by accumulated
dotnet processes; the 36 widening cells win even in the degraded best-of-3).
… (bcast-reduce ~700x NumPy)

The flat reduction's broadcast handling (DefaultEngine.ExecuteElementReduction) folded each
stride-0 (broadcast) axis by ITERATING its D identical copies via ExecuteAxisReduction —
cache-hot but still O(D×unique), landing at ~1.0x NumPy (the prior "50x->0.9x" fix). A
stride-0 axis repeats its data exactly, so reducing over it is CLOSED-FORM — nothing to
iterate:
    sum -> *D,  prod -> ^D,  min/max -> identity   (mean = sum/count, in the caller).
Drop every broadcast axis to a non-broadcast view of the UNIQUE data (index 0 along each
stride-0 axis), reduce THAT once via the existing fast contiguous/IL-SIMD path, then apply
the collapsed multiplicity to the scalar. O(unique), not O(D×unique).

Measured (8192-unique x 1024 broadcast = 8.4M logical, -c Release, best-of-9):
  before (iterate fold) : 1.10   ms  (1.03x NumPy)
  after  (algebraic)    : 0.0016 ms  (~700x NumPy) — AT the floor: the SIMD sum of the
                          8192-row alone is 0.00149 ms; C# dispatch overhead is 0.11 us.
No per-element kernel can approach this: a perfect contiguous SIMD reduce of the
materialized 8.4M is 2.16 ms = 1350x slower — it pays O(D×N) while the collapse is
O(unique). IL/NpyIter/IL-generation are throughput tools for that slower class and cannot
beat the algorithmic win; the inner unique-reduce already rides the IL SIMD kernel.

Correctness:
- Integer & min/max BIT-EXACT (modular */pow compose under truncation; min/max are
  order-independent). Float sum/prod differ only ULP-level — the summation-order class the
  codebase already documents here, well within the broadcast fuzz's 1e-5 tolerance.
- CombineWithCount applies the closed form in the accumulator dtype (IPowS/IPowU wrap like a
  materialized integer reduction; DPow throws on decimal overflow, matching np.prod).
- DropBroadcastAxes returns the unique view CONTIGUOUS (an O(unique) copy only when the
  broadcast wraps a strided view): the prior per-axis fold always finished on a fresh
  contiguous intermediate, and reducing a strided unique view directly tripped a PRE-EXISTING
  sub-word strided-prod bug (np.prod over a strided view returns 0 — collected for filing).

Verified: bcast_consistency fuzz 4689/4727 pass (the 38 fails are all pre-existing — 30
NaN-engine + 8 decimal argmax/argmin — zero on the flat sum/prod/min/max surface this
touches); 586/586 non-OpenBugs reduction unit tests pass (net8 + net10).
…with a radix line kernel

Closes one of the 8 documented Missing Functions (np.sort) and replaces the
pathological argsort. Sorting is structured exactly like NumPy's _new_sortlike /
_new_argsortlike (item_selection.c): an NpyIter "IterAllButAxis" drive hands one
1-D line at a time to a high-performance inner sort kernel.

ARCHITECTURE
- Driver (AxisSort.DriveAllButAxis): NpyIterRef.AdvancedNew with opAxesNDim=ndim-1
  and op_axes listing every axis EXCEPT the sort axis. Dropping the sort axis from
  the iterator stops it coalescing into the contiguous run (the bug that made an
  earlier EXTERNAL_LOOP+CORDER attempt hand the kernel count=20 into an N=5 buffer);
  the kernel then walks each line itself via the axis stride. Handles 1-D..N-D, any
  axis (incl. negative and None=flatten), and strided/transposed/F-contig inputs.
- Inner kernel = native-width LSD radix (RadixSort): u32 core for 1/2/4-byte dtypes,
  u64 core for 8-byte; 8-bit digits, one count + one scatter per byte, double-buffered,
  trivial-pass skip. Stable, so argsort ties resolve in ascending index order
  (== NumPy kind='stable'). Caller fills keys with a monotonic unsigned transform
  (so unsigned ascending order == NumPy value order) and un-transforms on the way out.
- Floats partition NaN to the end before radix (NumPy sorts NaN last); the float->key
  transform is the standard IEEE sign-flip so -inf<..<+inf orders correctly.
- Half/Complex/Decimal route to BCL introsort (Span.Sort) with exact NumPy comparators
  (npysort_common.h): real NaN-last; complex CDOUBLE_LT lexicographic real-then-imag,
  any-NaN-part last. Complex sorts at all now (it has no IComparable).
- argsort always returns int64 indices; the radix carries a parallel index column.

WHY RADIX (POC, i9-13900K, ratios are NumPy_ms/NS_ms, >1 = NS faster)
- int32 sort:    radix 164-217 M/s = 0.72-0.90x NumPy, 8-9x over BCL Span.Sort.
- int32 argsort: radix 87-135 M/s  = 2.1-4.3x FASTER than NumPy (NumPy uses slow
  indirect quicksort; radix carries indices for one extra scatter stream).
- Variation shoot-out @1m int32 sort: radix 197 M/s >> AVX2 vectorized quicksort
  52 M/s (correct, but out-of-place partition + no index path) >> Span.Sort 22 M/s.
  8-bit beats 11-bit (L1 thrash) and beats ulong-widened combined-histogram for
  <=4-byte. A SIMD quicksort could only edge radix on int *sort* and would forfeit
  the argsort win, so radix is the right max-perf, all-dtype, correct-and-stable choice.
  IL-generating the radix gives no gain (memory-bandwidth-bound, not dispatch-bound).

CORRECTNESS (all vs NumPy 2.4.2)
- 82/82 .npy round-trip parity: int32/int64/uint8/int16/float32/float64 x 1-D/2-D/3-D
  x every axis (incl. negative + None) x NaN/+-inf x empty/single/ties/constant.
- Half/Complex/Decimal verified (incl. the explicit complex lexicographic NaN-last case).
- 20 new unit tests (NpSortTests) incl. transposed-view, in-place ndarray.sort,
  3-D argsort-reconstructs-sort. Sort_ApiGap retagged [OpenBugs] -> [Fixed].
- Full suite: 9977 passed, 0 failed.

API
- np.sort(a, axis=-1, kind=null), np.argsort(a, axis=-1), ndarray.sort(axis=-1) in-place.
- ndarray.argsort<T> kept for source compat (delegates; element type from the array).
  Deleted the old LINQ implementation (per-element tuple + NDArray view allocation).
…esentative run)

Full cast-matrix sweep reflecting Wave 16a-d: 1503/1568 (95.8%) >=0.9, 92.5% win
>=1.0, geomean 1.941. The geomean rise from the pre-Wave-16 1.70 is largely the real
Wave 16 contribution — ~100 sub-word strided cells moved from ~0.6-0.9 to 2-7x.

All 36 1B->2B widening cells (Wave 16d) now win 1.9-5.0x; the full sub-word strided
family (copy + same-size + narrow + widen) is done. Remaining <0.9 (~48-65 depending
on run): AVX512-gated i64/u64->f16, memory-bound i64/u64->narrow strided, alloc-bound
1M same-type contig, + a few genuine large-type cells and best-of-3 jitter.

Measurement caveat: the sweep headline oscillates run-to-run (this session saw 87-97%
across back-to-back runs) — degraded by accumulated dotnet build-server/run processes
inflating either the NumSharp or NumPy subprocess. This run (geomean ~1.94) is
representative; per-cell Wave 16 wins are best-of-7 + 280/280 bit-exact, robust to load.
- .claude/CLAUDE.md: partial count 62 -> 63; Cast & Copy list adds
  .Cast.SubwordWiden.cs; coverage row notes 1B-int->2B via deinterleave/reverse +
  sign/zero Vector256.Widen.
- CAST_BEAT_NUMPY_PLAN.md §12: Wave 16d row (1B->2B widen 0.55-0.92 -> 1.9-7.0x) +
  SubwordWiden kernel in the kernels list.
- CAST_REMAINDER_133_PLAN.md §0: banner now covers the WHOLE sub-word family
  (copy + narrow + widen, ~100 cells, 280/280 bit-exact).
…VX2 mask overflow

np.any / np.all on the two non-JIT-vectorizable dtypes (bool, char) fell through to
the scalar per-element loop in AnySimdHelper/AllSimdHelper because
Vector256<bool>/<char>.IsSupported == false. For the benchmarked any(all-false)
case at scale this was the 0.08-0.18x cliff (5-12x SLOWER than NumPy): a full
scalar scan of every element, no SIMD. bool/char are bitwise identical to
byte/ushort for a zero/non-zero test (bool: 0/1; char: UTF-16 code unit), so
reinterpret at the top of both helpers and run the existing vectorized integer
path. (Applies to the direct ReduceBool route AND the IL-emitted reduction route,
which both call these helpers.)

While fixing that, found and fixed a latent CORRECTNESS bug in AnySimdHelper: the
"all lanes zero" reference mask was `(1u << vectorCount) - 1`, which for
Vector256<byte>/<sbyte> (32 lanes) is `(1u << 32) - 1`. C# masks uint shift counts
to 5 bits, so `1u << 32 == 1u << 0 == 1`, making allZeroMask = 0 instead of
0xFFFFFFFF. An all-zero vector then hit `bits (0xFFFFFFFF) != 0` -> returned true on
the FIRST vector, so:
    np.any(np.zeros(1_000_000, dtype=byte)) -> True   (NumPy: False)
byte/sbyte any() was WRONG for any all-zero array >= 32 elements on AVX2 hardware.
Masked until now because the benchmark uses bool (scalar path) and arrays < 32 use
the Vector128 path where the shift doesn't overflow. Widened to
`(uint)((1UL << vectorCount) - 1)` (correct for all 4..32 lane counts). This is also
prerequisite for the bool->byte redirect, which now routes through the byte path.

Result (i9-13900K, Release, best-of-9, vs NumPy 2.4.2), np.any(bool all-false):
    100K  0.18x -> 1.90x     1M  0.08x -> 1.04x     10M  0.09x -> 1.04x
NumSharp now runs at 79-95 GB/s (memory-bandwidth-bound) -- at/above NumPy; the
block-unroll rewrite considered next was unnecessary once at the memory ceiling.

Verified: 33 correctness checks (bool/byte/sbyte/char/int x all-false/all-true/
early-hit/all-zero x sizes 1M/40/10 hitting V256 / V256+tail / scalar paths) pass,
incl. byte all-zero any=False at 1M; 19 NpAnyTest/np_all_Test pass on net8+net10;
broader reduction sweep green except 3 pre-existing [OpenBugs] (convolve int64,
var/std ddof>n) in unrelated kernels.
…p the broadcast-reduce copy

Follow-up to the broadcast-reduce algebraic collapse. That commit added a .copy() in
DropBroadcastAxes to dodge a "np.prod over a strided view returns 0" bug — a symptom patch.
This removes the copy and fixes the bug at its root so the strided unique view reduces
through NpyIter directly (NpyIter handles non-contiguous via coalesce + stride-ordered
axis-permute, and is the always-faster path — no materialization).

ROOT CAUSE — NpyIter.Execution.DetermineAccumulatorType had a hand-rolled integer-widening
switch that had DRIFTED from the canonical NPTypeCode.GetAccumulatingType (the map the
direct/contiguous reduction path uses):
  - SByte and Char were MISSING -> fell through to `_ => src`, so a strided sub-word
    sum/prod/cumsum/cumprod accumulated in 1-/2-byte lanes and OVERFLOWED. np.prod over a
    strided sbyte/char view returned garbage (e.g. 47 for a true 3375) or 0.
  - Byte was mis-widened to Int64 instead of UInt64 (latent: a UInt64 TResult was returned
    through an Int64 accumulator; positive products happened to fit).
The contiguous path was unaffected because it never uses this method (it passes the
GetAccumulatingType result down explicitly). The bug was dormant until the algebraic collapse
started reducing strided unique views directly.

FIX:
- DetermineAccumulatorType now delegates the Sum/Prod/CumSum/CumProd widening to
  src.GetAccumulatingType() — single source of truth, can never drift again. (Mean/Var/Std
  -> Double and the pass-through for floats/decimal/complex are unchanged; GetAccumulatingType
  preserves Half/Single/Double/Decimal/Complex, so it is an exact drop-in.)
- DropBroadcastAxes drops the .copy(): returns the (possibly strided) unique view as-is; the
  flat reduction routes non-contiguous inputs through NpyIter, now correct.

This also fixes strided sub-word reductions OUTSIDE the broadcast path (any np.sum/np.prod/
np.cumsum/np.cumprod over a strided/transposed sbyte/char view) — the fuzz improved 38 -> 37
fails (the remaining 37 are all pre-existing: 30 NaN-engine + 7 decimal argmax/argmin).

Verified (-c Release):
- strided-prod pinpoint: sbyte/char/int32/int64/double x {1-D strided, transposed, sliced} all
  match the contiguous reference (was sbyte 47/0, char garbage).
- bcast-reduce canary still 0.0021 ms (~534x NumPy), bit-exact, NO copy.
- broadcast fuzz 4690/4727 (zero failures on the sum/prod/min/max surface).
- 2522 unit tests green across reduction/cast/strided/cumulative/scan/statistics/iterator
  (net10, excl OpenBugs).
…e 17, bucket A)

Closes the last large genuine cast laggard: i64->f16 and u64->f16 across all 8
layouts were the only AVX512-gated family left (was 0.68-0.89x NumPy on best-of-3,
falling to the scalar generic emitter). There is NO AVX2 i64->f32 (cvtqq2ps is
AVX512DQ, and this box is AVX2-only: Vector512 not accelerated, Avx512F/DQ/BW all
false), so synthesize the conversion exploiting f16's tiny finite range.

KEY INSIGHT: f16 saturates to +-inf at 65520, so EVERY finite i64/u64->f16 result
has |v| < 65520 << 2^17 -- which fits in i32 *exactly* (well under the 2^24 f32
mantissa limit, so the i32->f32 step never rounds, single RTNE at f32->f16). Any
|v| >= 65520 is inf regardless of the low bits.

KERNEL (DirectILKernelGenerator.Cast.ToHalf.cs):
  - Int64x8ToHalfBits / UInt64x8ToHalfBits: clamp out-of-range lanes to a +-70000
    sentinel (VPCMPGTQ + byte-granular VPBLENDVB; u64 uses the sign-bias unsigned
    compare), so the low-32-bit narrow is exact and the sentinel maps to +-inf.
    Pack 8 lanes' low words via PermuteVar8x32(RtoSel)+GetLower, then the exact
    cvtdq2ps and the already-proven Giesen FloatToHalfBits.
  - BulkInt64ToHalf / BulkUInt64ToHalf: 16-at-a-time contig bulk + scalar tail.
  - Int64ToHalfBits / UInt64ToHalfBits scalar ports (tail + strided conv); finite
    |v| < 2^24 so (float)v is exact, then SingleToHalfBits.
  - Contig (CastInt64/UInt64ToHalfContig) + strided (StridedInt64/UInt64ToHalf via
    the existing StridedNarrowDriver, srcSize=8) wired into TryGetXToHalfKernel /
    TryGetXToHalfStridedKernel (Int64/UInt64 switch cases). No Cast.cs routing
    change -- the X->Half resolvers were already in the chain.

VALIDATION (bit-exact vs NumPy 2.4.2):
  - 190,043 i64 values (every int in the full +-70010 boundary region + 2^16..2^63
    magnitudes + 50K randoms) and 120,024 u64 values: SIMD-diff=0, scalar-diff=0.
  - All 8 layouts x {i64,u64} = 16 SHA256 hashes through the real astype path,
    with boundary-spanning values (-inf/finite/+inf/0): 16/16 bit-exact.
  - Full suite 9977/9977 on net10.0. (The lone net8.0 failure is a PRE-EXISTING
    log1p/complex128 1-ULP/NaN-payload libm difference, confirmed by stash-test;
    unrelated to this cast.)

PERF (best-of-7, canonical (arange%17)+1 base, 1M, NPY/NS):
  i64->f16  C 1.95  F 1.99  T 1.95  sliced 1.46  negrow 1.90  negcol 1.34  strided 1.20  bcast 1.95
  u64->f16  C 1.66  F 2.53  T 2.68  sliced 2.50  negrow 2.57  negcol 1.69  strided 1.55  bcast 2.58
  geomean 1.914x (was ~0.80). Every cell now >0.9; whole bucket beats NumPy.

PoCs: benchmark/poc/i64_u64_half_poc.cs (value proof), i64u64_half_layout_{npref.py,
verify.cs} (8-layout hash proof), i64u64_half_bench.cs (best-of-7 timing).
…(Wave 17, bucket B)

Bucket B was ~30 cells the best-of-3 scoreboard flagged at 0.77-0.90 for
{f32,f64,c128} -> {i64,u64}. A clean best-of-7 re-measure (all 6 src->dst x 8
layouts, canonical (arange%17)+1 base, GC between cells) showed 25 of them were
pure best-of-3 JITTER -- now 0.9-2.5x, many >1.0 (e.g. f32/f64/c128 -> i64 all
contig/F/T/sliced/negrow >=0.95, bcast 2.3-2.5x). float->i64 has no dedicated
kernel (scalar Converts == NumPy's cvttsd2si) and is already at parity.

Only 5 cells were GENUINELY <0.9, all -> u64 on negcol/strided -- and all because
the strided u64 path fell to a gather even when the data is contiguous:

  - f32|negcol|u64 0.88->0.96, f64|negcol|u64 0.86->0.93: negcol ([:, ::-1]) is a
    contiguous-REVERSED row, so load contiguously + VPERMQ/VPERMD lane-reverse
    instead of a -1-stride gather (Cast.FloatToUInt.cs). Reads stay inside the row.
  - f64|strided|u64 0.86->0.97: ::2 deinterleaves even doubles from two contiguous
    loads (Permute4x64 0x08/0xDD + Blend 0x0C). The 2nd vector is loaded at 2i+3
    (not 2i+4) so the max read is 2i+6 = the last needed element L(i+3) -> provably
    no over-read past the row (the gather's only safety advantage, recovered).
  - c128|negcol|u64 0.80->0.96: the reals are interleaved with imags, but for negcol
    the 8 surrounding doubles are contiguous (re3,im3,..,re0,im0), so two loads +
    UnpackLow + VPERMQ-0x72 extract the 4 reals in logical order, beating the
    -2-stride real gather (Cast.Complex.cs FusedComplexToUInt64). Over-read-safe.
  - c128|strided|u64 ~0.87 (memory-bound): the ::2 reals sit every 4th double; a
    contiguous-load alternative reads 2x the data, so the gather is already near
    optimal. Left as-is; straddles 0.9 on jitter.

VALIDATION: {f32,f64,c128} -> u64 across all 8 layouts = 24 SHA256 hashes vs NumPy
2.4.2, with values that exercise the modular wrap (negatives), the 2^63 sentinel
(NaN/+-inf/overflow) and truncation: 24/24 bit-exact. Full suite 9977/9977 on
net10.0 (net8.0's lone log1p/complex128 failure is pre-existing, stash-confirmed).

PoCs: benchmark/poc/float_wideint_bench.cs (best-of-7 6x8 matrix),
float_u64_layout_{npref.py,verify.cs} (24-hash layout proof).
…tude mask (Wave 17, bucket E)

f16|strided|bool was a genuine 0.14x cliff (clean best-of-7: NumSharp 1.53ms vs
NumPy 0.22ms, ~7x slower) -- the scoreboard's best-of-3 0.87 was a lucky sample.
f16 is 2-byte so it is not VPGATHER-eligible; StridedHalfToBool routed through the
generic StridedNarrowDriver, which scalar-converts the ss==2 ([:, ::2]) and ss==-1
([:, ::-1]) inner rows. Wave 16's SubwordNarrow excluded f16 (it needs a magnitude
mask, not a plain low-byte narrow), so f16->bool never got the shuffle treatment.

FIX (DirectILKernelGenerator.Cast.ToBool.cs): a dedicated StridedHalfToBool that
applies the SubwordNarrow deinterleave (ss==2: mask even words -> VPACKUSDW ->
VPERMQ 0xD8) / reverse (ss==-1: VPSHUFB _revWords -> VPERMQ 0x4E) shuffles, then
NumPy's half-truthiness (bits AND 0x7fff) != 0 (so +-0.0 -> False, NaN/inf -> True).
Inner ss==1 rows (sliced/negrow/bcast) run the contiguous magnitude compare;
arbitrary strides (F/T column-gather) and the tail keep the scalar test. Inner loops
inlined in the odometer (a per-row helper costs ~6x -- SubwordCopy PERF NOTE).

RESULT (best-of-7, 1M, NPY/NS): every f16->bool layout now beats NumPy 3.9-7.3x:
  C 4.00  F 3.87  T 4.02  sliced 7.25  negrow 7.01  negcol 6.49  strided 5.27  bcast 6.92
  (strided 0.14 -> 5.27). NumPy's f16->bool is uniformly slow (0.30-0.57ms).

VALIDATION: 8 layouts x SHA256 vs NumPy 2.4.2 with a base carrying +-0.0 / NaN /
+-inf / subnormal / 65504 -> 8/8 bit-exact (-0.0 bits 0x8000 -> False confirmed).
Full suite 9977/9977 net10.0 (net8.0's lone log1p/complex128 failure pre-existing).

Bucket E also RE-MEASURED 6 other scoreboard singletons (f64->u8/i32/char,
u64->f32, c128->u8, u64->u8) -- all were best-of-3 jitter, clean best-of-7 is
1.08-1.87x. PoCs: benchmark/poc/f16bool_{bench.cs,layout_verify.cs,layout_npref.py},
subword_arb_bench.cs.
Scoreboard: 1508/1568 >=0.9 (96.2%, was 95.8%); the i64/u64->f16 (A) and
f16->bool (E) families dropped off the <0.9 list entirely. geomean 1.89 on this
best-of-3 run (clean best-of-7 across the touched buckets is 1.94) -- this sweep
was mildly degraded by accumulated dotnet build-server processes (cannot be
cleared without taskkill), so the float->i64/u64 cells (bucket B) read 0.84-0.89
here despite measuring 0.95-2.5x under clean back-to-back best-of-7 (documented in
the plan + commit faa2754). The scoreboard is best-of-3 and noisy by construction.

Remaining <0.9 (60), by class:
 - ~15 same-type small-dtype contig (bool/u8/i8 -> self, C/T/sliced/negrow/bcast,
   0.16-0.29): NOT a kernel deficiency -- a cache-residency artifact. A warm
   Buffer.MemoryCopy of 1M is 0.0138ms = NumPy's 0.014ms exactly; NumSharp wins
   1.9-3.5x at every cache-busting size (4M/16M/64M). NumPy only wins at 1M because
   CPython refcounting reuses one warm buffer per iteration while .NET GC hands back
   cold ones. Unwinnable without deterministic disposal; left as-is (optimal).
 - float/c128 -> i64/u64 (best-of-3 jitter on this degraded run; clean >=0.95).
 - c128->bool / c128->u64 strided (memory-bound, gather of strided reals already
   near-optimal -- the contiguous alternative reads 2-4x the data).

Plan (CAST_REMAINDER_133_PLAN.md) gets a full Wave 17 section: bucket-by-bucket
A-E with the kernel approach, clean best-of-7 numbers, bit-exact proof counts, and
the cache-residency characterization for C/D so it is not re-chased.
…s+imags (Wave 17)

c128|negcol|bool was a genuine 0.66x (clean best-of-7: NumSharp 0.635ms vs NumPy
0.418ms) -- FusedComplexToBool gathered reals (idx) and imags (idx+1) with a -2-stride
double gather for the reversed inner row. As with c128->u64 negcol, the 8 doubles
around p are contiguous-reversed (re3,im3,re2,im2,re1,im1,re0,im0), so two loads +
UnpackLow/UnpackHigh + VPERMQ-0x72 deinterleave BOTH reals and imags into logical
order, then the existing OrderedEqual !=0 | test. Over-read-safe (p-6 of the last
group is the row's first physical real).

RESULT (best-of-7, 1M): c128|negcol|bool 0.66 -> 1.50x. (c128|strided|bool ~0.9-1.07
is NumPy-jitter / memory-bound -- the ::2 reals+imags sit every 4th double, so the
contiguous alternative reads 4x the data; gather stays.) The other 6 c128->bool
layouts were already 1.08-1.19x.

VALIDATION: 8 layouts x SHA256 vs NumPy 2.4.2 with a base carrying re=0&im=0 (False),
im!=0 (True), NaN/inf (True), -0.0-0.0j (False): 8/8 bit-exact. Full suite 9977/9977
net10.0 (net8.0's lone log1p/complex128 failure pre-existing).

PoCs: benchmark/poc/c128bool_{bench.cs,layout_verify.cs,layout_npref.py}.
np.sort, ndarray.sort and np.argsort were O(N^2) for every 1-D array and
every axis=None call - the most common cases. Root cause: AxisSort's
IterAllButAxis driver (DriveAllButAxis) drops the sort axis via op_axes,
so a 1-D input yields a 0-dimensional iterator (oa_ndim=0, op_axes=[[]]).
NpyIter mis-drives that degenerate shape as `size` single-element
iterations instead of 1, and because each line kernel re-sorts the whole
line (it walks by LineCtx.n, ignoring the per-call count) total work is
N x O(N) = O(N^2).

Measured before fix (i9-13900K, Release), 1-D int32:
  N=10K   sort    514 ms   argsort    768 ms
  N=100K  sort 53,419 ms   argsort 83,132 ms   (~104x per 10x in N = quadratic)
1M+ effectively hung (minutes). Unit tests never caught it (5-12 element
arrays, N^2 trivial) and the radix POC benched RadixSort directly,
bypassing the driver.

Fix: at the single choke point DriveAllButAxis, promote a 1-D operand to
a (1, N) memory-sharing view (np.expand_dims aliases storage, so in-place
sort still mutates the original; a leading size-1 axis leaves the
data-axis stride - and thus LineCtx in/out strides - unchanged). Now
exactly one axis (size 1) is kept, so the iterator makes exactly one
line-kernel call. Multi-D was already correct (one call per line) and is
untouched.

Measured after fix:
  N=100K  sort 0.68 ms (147 M/s)  argsort 1.06 ms (94 M/s)  ~78,000x faster
  1-D throughput now flat across N (quadratic gone).
vs NumPy 2.4.2 argsort: int32 1.47x (1M) / 2.1x (4M) FASTER;
  int64 ~parity; f64 0.6x (1M) -> 0.96x (4M).

Correctness: NpSortTests 20/20 pass; validated 1-D sort/argsort, in-place
on strided a[::2] view, axis=None, int64/f64.

NOTE (separate latent bug, not fixed here): the underlying NpyIter
0-dimensional / empty-op_axes iteration emits `size` iterations instead of
1; this still affects any other 0-dim op_axes use. The sort path now works
around it rather than relying on the iterator's 0-dim behavior.
…ializing a contiguous copy

Removes the copy-to-contiguous anti-pattern (materializing a strided/transposed/
sliced/F-order/broadcast view because a kernel only handled C-contiguous data)
from seven compute paths, replacing each `.copy()`/`.copy('C')` with in-place
strided iteration — the same principle applied to the reduction fold earlier on
this branch ("must use NpyIter, not copy"). Every element is now read through its
own strides in C-order (via a shared `FlatStrideOffset` odometer) or through the
existing IL strided kernel; results are bit-identical to the prior copy path and
verified against NumPy across all dtypes x layouts.

CumSum / CumProd flat (Default.Reduction.CumAdd.cs, CumMul.cs):
  - Both flat scans now share ScanElementwiseFlat, which drives the IL scan kernel
    directly on strided/transposed/sliced/reversed/F-order/3-D inputs. The flat
    scan generator already emitted a correct C-order strided walk
    (EmitScanStridedLoop); the old `cumsum_elementwise(arr.copy())` simply prevented
    it from ever being reached.
  - Enabled CumProd in the IL scan kernel (DirectILKernelGenerator.Scan.cs):
    EmitScanCombine now emits Mul, EmitScanHelperCall maps to a new
    CumProdHelperSameType. cumprod previously had NO IL kernel at all (TryGet
    returned null at generation, always falling to the scalar copy path).
  - Half/Complex accumulators can't be combined with il Add/Mul except same-type-
    contiguous, so for non-contiguous Half/Complex (and decimal cumprod) we keep a
    single C-order ravel — exactly what NumPy does — not a kernel-dodge copy.
  - Base address folds Shape.offset in (sliced views retain it; simple contiguous
    slices bake it into Address). NEP50 accumulator widening preserved.

Var / Std flat (Default.Reduction.Var.cs, Std.cs):
  - The fallback two-pass now iterates the input in place via FlatStrideOffset
    (var/std are order-independent, so any visiting order is valid — mirrors NumPy,
    which reduces strided arrays without copying). Shared moment helpers
    (VarMomentsReal<T> via INumberBase, VarMomentsDecimal, VarMomentsComplex) are
    used by both var and std (std = sqrt). Removes the old double-materialization
    (arr.copy() + Cast-to-double).

Integer Reciprocal (Default.Reciprocal.cs):
  - The 1/x integer loop reads the source through its strides instead of
    materializing sliced/strided/broadcast views (the comment admitted the previous
    raw read "THREW for views"). Broadcast stride=0 is handled naturally.
  - Pre-existing 1/0 semantics preserved verbatim (see known-bug note below).

Left / Right Shift (Default.Shift.cs):
  - Array-operand shift walks both the values and the per-element shift counts
    through their own strides (ExecuteShiftArrayStrided), so broadcast (stride=0),
    transposed, sliced and mixed-layout operands no longer get copied. The
    contiguous fast path still uses the IL scalar-loop kernel.

Clip (Default.ClipNDArray.cs, DirectILKernelGenerator.Clip.cs):
  - New ClipStrided reads src and (array) bounds through their own strides in
    C-order, writing a fresh C-contiguous result — eliminating the copy('C')
    normalizations in the general path, PrepareBound, and the @out path. The
    contiguous fast path still drives the IL SIMD kernel. One strided loop is
    shared across dtypes via Max/Min function pointers.
  - Clamp semantics mirror the IL kernel bit-for-bit: Math.Max/Min for sized
    numerics+decimal, the (now internal) Half/Complex NaN helpers, Char as UInt16,
    and NaN-propagating float/double Max/Min that resolve the signed-zero tie like
    hardware MAXPS/MINPD (maximum(+0,-0) = -0) rather than Math.Max (+0). np.maximum
    /minimum/fmax/fmin are implemented on top of clip, so this keeps their
    signed-zero/NaN results identical to the SIMD path and to NumPy.
  - Fixed a latent allocation bug: result buffers are now built from a fresh
    contiguous Shape, never np.empty(lhs.Shape) which would clone a strided/F-order
    layout and mis-address the linear dst writes.

Verification: per-op fuzz harnesses compare strided/transposed/sliced/reversed/
F-order/3-D/broadcast views against the contiguous ground truth and against live
NumPy across all dtypes (scan 308/308, var-std 660/660, recip+shift 220/220, clip
350/350). Full unit suite green (9987 passed, 0 failed); net8.0 + net10.0 build clean.

Known pre-existing bug surfaced (NOT changed here): NumPy integer reciprocal(0) is
0 for int8/int16/uint8/uint16/uint32 and the sign-bit sentinel only for
int32/int64/uint64 (uint64 -> 2^63); NumSharp uses MinValue for all signed and 0
for all unsigned, so reciprocal(int8/int16 0) and reciprocal(uint64 0) diverge.
This is the existing documented behavior, faithfully preserved by the strided
rewrite; flagged for a separate fix.
…/maximum signed-zero scalar tail

Two NumPy-parity bugs, both surfaced while removing the strided copy-hacks.

Integer reciprocal 1/0 (Default.Reciprocal.cs)
  Re-probed NumPy 2.4.2 bit-exact and deterministic across array sizes / lane
  positions: reciprocal(0) is the sign-bit sentinel 0x80..0 ONLY for the 32/64-bit
  loops (int32 -> int.MinValue, int64 -> long.MinValue, uint64 -> 2^63) and 0 for
  every narrower type (int8, int16, uint8, uint16, AND uint32). NumSharp returned
  MinValue for ALL signed (so int8 -> -128, int16 -> -32768) and 0 for ALL unsigned
  (so uint64 -> 0) — wrong for int8/int16/uint64. Fixed the three per-type 1/0
  results; the other five already matched. (A prior pin had mis-probed int8 1/0 as
  -128; corrected.)

  reciprocal(bool): NumPy promotes bool to the int8 loop -> int8 [True->1, False->0]
  (probed np.reciprocal([True,False]) -> int8 [1,0]). NumSharp took the float loop
  (Double [1, inf]). The dispatcher now maps a bool loop type to int8 and casts the
  input, so bool reciprocal returns int8 with the int8 1/0 = 0 rule. dtype was
  already preserved for the int types (the registry's "widen to float64" note was
  stale).

clip / maximum / minimum / fmax / fmin signed-zero scalar tail
  (DirectILKernelGenerator.Clip.cs, Default.ClipNDArray.cs)
  np.maximum etc. are built on clip. For float32/float64 the SIGNED-ZERO tie resolves
  to the SECOND operand (hardware MAXPS/MINPD: maximum(+0,-0) = -0) — the OPPOSITE of
  float16 (ties to the first operand) and of Math.Max (returns +0). The IL kernel's
  SIMD body already matched NumPy, but its SCALAR TAIL (EmitScalarClamp) used Math.Max
  and so diverged on a signed-zero tie landing in the last partial vector. Added
  NaN-propagating, tie-to-second FloatMaxNaN/FloatMinNaN/DoubleMaxNaN/DoubleMinNaN
  helpers and wired EmitScalarClamp to them for Single/Double; the engine's ClipStrided
  now shares the same helpers (dropping its local copies), so the contiguous SIMD body,
  contiguous scalar tail, and strided path all agree bit-for-bit with NumPy. The Half
  helper (tie-to-first) is correct for float16 and left unchanged.

Tests
  - MisalignedRegistry: removed the two now-obsolete reciprocal excuses (the int->float64
    dtype excuse and the int ÷0 / non-contig-throw value excuse); the unary fuzz corpus
    now verifies reciprocal(int/bool) bit-exact.
  - NewDtypesCoverageSweep: corrected B36_SByte (int8 1/0 -> 0, not MinValue) and added
    B36_Int16 (1/0 -> 0), B36_UInt64 (1/0 -> 2^63), B36_Bool (-> int8 [1,0]) pinning the
    per-type rule.

Verified: contiguous maximum/minimum with -0/+0 across SIMD body AND scalar tail match
NumPy bit-for-bit (f32 and f64); reciprocal(all 8 int types + bool) matches NumPy;
strided reciprocal still equals the contiguous result. Full unit suite green
(9990 passed, 0 failed); net8.0 + net10.0 build clean.
…O(N^2))

CalculateBroadcastShape gated the op_axes branch on `opAxesNDim > 0`, so an
explicit 0-dimensional iteration (opAxesNDim == 0, op_axes = [[]]) fell through
to natural broadcasting and returned the operand's own shape [N]. That made
IterSize = N, so a kernel driven without EXTERNAL_LOOP ran N times instead of
once. NumPy's NpyIter_RemoveAxis leaves a 0-dim iterator with GetIterSize() == 1
(one iteration, data pointer = operand base; any dropped axis is the caller's
responsibility via its saved stride).

This is the ROOT of the 1-D / axis=None sort & argsort O(N^2) addressed at the
symptom level in b4d2128: AxisSort's IterAllButAxis drops the only axis of a
1-D operand -> opAxesNDim == 0 -> the line kernel (which re-sorts the whole line
per call, ignoring the per-call count) was invoked N times.

Fix: treat opAxesNDim == 0 as an explicit 0-dim iteration and return an empty
iteration shape (IterSize == 1). The `> 0` -> `>= 0` gate plus an early
`return Array.Empty<long>()` leaves the opAxesNDim > 0 and the unspecified
(-1, natural-broadcast) paths untouched.

TDD (NpyIterZeroDimOpAxesTests, 10 tests, direct NpyIterRef.AdvancedNew):
- RED against the unfixed iterator: 7 fail / 3 pass, every failure the exact
  symptom (NDim 1 not 0; Calls/IterSize = 100000/1000/500/256/10/20 not 1).
- GREEN after fix: 10/10. Functional cases (single-call whole-line walk,
  in-place fill, two-operand lockstep copy, sliced-offset view) confirm the
  NDim=0 base pointer is offset-correct and read/write-valid; 3 regression
  guards (no-op_axes 1-D -> N calls; drop-axis-1 -> per-row; drop-axis-0 ->
  per-col) hold before and after.

Regression: 42/42 (NpSortTests 20 + NpyIterZeroDimOpAxes 10 + AxisStride 12).

Note: AxisSort's (1,N) promotion workaround from b4d2128 is now redundant (the
sort path could rely on this root fix directly) but left in place; it is correct
and independent.
…iterals (NEP50)

`a * 2`, `a + 5`, `a / 3.0` route through operator(NDArray, object) ->
np.asanyarray(literal), so a Python int literal is wrapped as an int32 0-d scalar
(a float literal as float64). For `f32[] * 2` the operands therefore arrive as
f32 x int32. resultType already reflects NumPy's weak promotion (f32), but the
fast-path gates key off the RAW operand dtypes:
  - TryTrivialContiguousBinaryOp: `if (lhsType != rhsType ...) return null;` bails
    BEFORE the scalar-broadcast arm, and
  - TryExecuteBinaryOpViaNpyIter: sameDtype=false -> per-element-convert mixed loop
    (vector body null).
So the op fell to the heavy mixed path instead of the SimdScalarRight/Left
scalar-broadcast loop it should take. Measured (100K f32, Release):
  a * 2  (int32 scalar) 0.080 ms   vs   a * 2f (f32 scalar) 0.035 ms  -- 2.25x slower
for IDENTICAL work and an identical f32 result. This hit EVERY `array <op>
python-literal` and dominated the 100K-tier elementwise losses in the full benchmark
(`a * 2 (literal) f32 @100K` 0.067x was among the worst non-degenerate op-matrix cells).

Root cause is path-check ORDERING: the same-dtype gate is evaluated first, against
un-promoted operand dtypes, so a case that is effectively same-dtype after NumPy's
weak-scalar promotion never reaches the scalar-broadcast fast path.

Fix (ExecuteBinaryOp, before the trivial bypass): when exactly one operand is a
0-d/size-1 scalar, the OTHER (array) operand already IS resultType, and only the
scalar's dtype differs, materialize the scalar at resultType (one element -- exactly
what NumPy does, casting the scalar to the loop dtype). Both operands then share
resultType and the same-dtype SIMD paths (contiguous SimdScalarRight/Left and the
same-dtype NpyIter SIMD loop, incl. strided) light up. When the ARRAY operand also
differs (e.g. int32[] * 2.5 -> float64) the guard skips: the array genuinely needs a
cast, which is the mixed path's job.

Value-identical: the mixed path converted the same scalar to resultType per element;
this converts it once up front. Verified across f32/f64/i16/i32/i64/u8/i8 (contig +
strided): (a * 2) == (a * same-type-scalar 2), every dtype eq=True; i32 * 2.5 still
-> float64 = 2.5 (guard correctly skips). Post-fix: a*2 ~= a*2f ~= a*s32 (gap closed).
778 binary/arithmetic/promotion/scalar/broadcast + 2345 math/reduction/statistics/
ufunc/comparison/evaluate tests pass on net8.0 + net10.0.

Settled while investigating (the suspected second ordering bug): DivergesFromNumpyCast
is a dead stub (returns false), so the cast *->bool slowness is NOT a gating-order bug
-- it is fixed overhead on a 1-byte output, like the same-type copy diagonal.
These 14 documents were working notes for work that has since shipped and
is covered by durable artifacts (tests, the benchmark scoreboards, and the
website docs). Removed as dead weight:

Audit working notes (findings already fixed or pinned as [OpenBugs] under
test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs, which remain):
  docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md
  docs/plans/audit_v2/01_iterators.md
  docs/plans/audit_v2/02_ilkernel_simd.md
  docs/plans/audit_v2/03_default_math_reductions.md
  docs/plans/audit_v2/04_logic_shape_storage.md
  docs/plans/audit_v2/05_ndarray_creation.md
  docs/plans/audit_v2/06_manipulation_apis_logic.md
  docs/plans/audit_v2/07_math_ops_selection_sorting_stats.md
  docs/plans/audit_v2/08_casting_random_utilities.md

Implementation plans for shipped features:
  docs/OUT_WHERE_NPYITER_FAMILIES_PLAN.md   (out=/where= families — shipped)
  docs/plans/CAST_BEAT_NUMPY_PLAN.md        (cast matrix — shipped)
  docs/plans/CAST_REMAINDER_133_PLAN.md     (cast continuation — shipped)

POC scratch (the real implementation landed in NpyIter):
  benchmark/poc/NPYITER_REDUCTION_PLAN.md
  benchmark/poc/POC_RESULTS.md

Build/tests unaffected — every inbound reference is a prose/comment citation,
not a code dependency. Four stale citations remain to be scrubbed separately:
  src/.../Math/DefaultEngine.BitwiseOp.cs:13   -> OUT_WHERE_NPYITER_FAMILIES_PLAN.md
  src/.../Unmanaged/UnmanagedStorage.Cloning.cs:426 -> CAST_BEAT_NUMPY_PLAN.md
  docs/NPYITER_GAPS_AND_ROADMAP.md:50          -> benchmark/poc/POC_RESULTS.md
  docs/PR611_CHANGELOG.md:210                  -> docs/plans/audit_v2/
…32b0 wave)

The branch advanced ~143 commits past the last changelog refresh. Updated the
doc to represent the current/final state, not the prior snapshot:

Stats: 312 -> 455 commits, 615 -> 806 files, +217,949/-16,402 ->
+234,348/-19,179. Tests: 9,709 -> 9,990 passed / 0 failed (net8.0 + net10.0).

Flipped now-false claims:
- np.sort: was 'remains unimplemented' -> implemented (radix line-kernel on
  NpyIter, stable, NaN-last, all axes/orders); np.argsort reimplemented.
  Missing-functions list narrowed to flip family / diag / gradient / round.
- NDIterator: was 'survives as a thin lazy wrapper' -> deleted entirely
  (interface + class + AsIterator); all iteration via NpyIter/NpyFlatIterator/
  GetAtIndex. Breaking-changes row + TL;DR updated.
- Frontier 'open losses' -> fixed: broadcast-reduce (was 54x slow) now
  ~534-700x NumPy via algebraic stride-0 fold; scalar any/all (bool/char)
  on the integer SIMD path; np.zeros calloc-backed (was ~1000x slow).

Added current-state coverage:
- §5: np.sort row, np.broadcast N-operand parity, six Complex transcendentals
  (sinh..arctan). New-API count 30/35 -> 36 (consistent across TL;DR + §5).
- §7: SIMD strided-cast campaign (astype 15x8x15: 716->~391 lagging,
  852->1,177 wins), np.zeros calloc, broadcast-reduce fold, bit-exact pairwise
  sum, any/all SIMD, Complex/Half/Decimal axis reductions, f16 negate.
- §10: integer reciprocal 1/0 per-width, clip/maximum signed-zero + NaN,
  f16 axis-sum float32 accum, exp2/power-Half/complex-narrowing crashes,
  complex nansum uninit read, AVX2 any() mask overflow, net8.0 complex abs.

Also fixed the dangling docs/plans/audit_v2/ reference (those docs were
removed last commit) -> points at test/.../AuditV2/AuditV2_*.cs instead.

Live PR #611 description updated to match via REST PATCH.
…+ scrub dangling plan-doc refs

The Direct/ kernels carried 24 methods marked [Obsolete("Unused...", error: true)]
as 'pending removal' tombstones. error:true makes any reference a compile error,
so they were provably unreachable. Removed outright across 12 partials:

  Argwhere(1) Binary(1) Cast(1) Comparison(1) Masking.NaN(4) MatMul(1)
  MixedType(1) Reduction.Axis.Simd(8 + orphaned header comment) Reduction.NaN(1)
  Scan(3) Search(1) Unary(1)

The Reduction.Axis.Simd block (Add/Mul/Min/Max 256/128) was superseded by the
type-specialized AddOp*/MulOp* structs (435a07e); the rest by their non-throwing
Get*/TryGet* twins. Net -683 lines, zero behavior change.

Also scrubbed 3 comments/prose that still cited the plan docs deleted earlier in
this branch (kept each substantive note, dropped the dead file path):
  - DefaultEngine.BitwiseOp.cs   (OUT_WHERE_NPYITER_FAMILIES_PLAN.md)
  - UnmanagedStorage.Cloning.cs  (CAST_BEAT_NUMPY_PLAN.md)
  - docs/NPYITER_GAPS_AND_ROADMAP.md (benchmark/poc/POC_RESULTS.md)
PR611 changelog §14 updated: 'pending deletion' -> removed.

Verified: NumSharp.Core builds clean (0 errors); full suite 9990 passed / 0 failed
/ 11 skipped on net10.0 (CI filter). No dangling refs to deleted docs remain.
Dead-code sweep (Roslyn IDE analyzers + whole-repo reference-frequency probe):
each method's name occurs exactly once across all .cs in src/test/benchmark/
examples/tools — i.e. declared and never referenced anywhere, so unreachable.
All private/protected, so scope confirms no external caller. Removed:

  CreateLogicalResultShape         Default.LogicalReduction.cs
  ExpandStartDim, ExpandEndDim     Default.Dot.cs
  ExecuteShiftOpScalar             Default.Shift.cs
  BulkHalfToBoolV, ConvHalfToBool  DirectILKernelGenerator.Cast.ToBool.cs
  FindFirstNaN_Double/_Float       NDArray.unique.Kwargs.cs
  TryUniformConstant               np.pad.cs
  uniqueComplex (protected)        NDArray.unique.cs

Checked for cascade: BulkHalfToBool (non-V) stays — still used by
CastHalfToBoolContig; the removed BulkHalfToBoolV was the unused strided-driver
adapter. No other symbol orphaned.

Verified: NumSharp.Core builds clean; full suite 9990 passed / 0 failed / 11
skipped on net10.0 (CI filter). Zero references to the 10 names remain.
…bsystems)

Full run of benchmark/run_benchmark.py (~3h wall, numpy 2.4.2, Release): the
BenchmarkDotNet op/dtype/N matrix across all 14 comparison suites, the NpyIter
iterator benchmark (+ README cards), and the layout/operand/cast/fusion matrix
subsystems. The committed report files were stale (predated several perf fixes
already in HEAD); this refreshes them. layout_results.md/.tsv are generated for
the FIRST time (the Layout subsystem had never been run on this branch).

Recent fixes now visible in the numbers:
- bcast_reduce canary: 0.02x (51.8x SLOWER) -> 538.56x faster        [cb62e37]
- any(F) reduction:    ~0.1x -> 8.89x/8.41x small-N, ~1x at scale,
                       geomean 2.74x (+ byte/sbyte AVX2 all-zero-mask fix) [c79ed6f]
- np.evaluate F/T:     fusion F 0.11x->1.85x, T 0.12x->1.74x         [16afbac]
- forder_out canary:   0.46x -> 1.28x

Op-matrix summary moved the right way (fewer bad cells):
  before  1851 ops | OK 797  close 289  slow 219  red 81  negl 389  n/a 76
  after   1851 ops | OK 792  close 357  slow 177  red 72  negl 384  n/a 69
NpyIter headline 1.17x -> 1.18x geomean; iterator construction 2.88x -> 3.33x.

Known weak spots persist (NOT new regressions - pre-existing, predate this run):
- left_shift/right_shift @100k ~0.05x (scalar-loop shift path, not vectorized)
- np.sum(f64) @100k 0.074x (np.sum() API dispatch overhead; the isolated
  iterator reduction still wins 2.79x)
- copy/cast + index-math at small N, float16 scalar path (no Vector<Half> in BCL)

Generated against the current working tree (branch HEAD + local engine WIP);
result files only - no source changed.
@Nucs

Nucs commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

📊 Performance snapshot vs NumPy 2.4.2

Full benchmark/run_benchmark.py regen on this branch — the BenchmarkDotNet op/dtype/N matrix (1,851 ops × 1K/100K/10M), the NpyIter iterator harness (+ cards), and the layout/operand/cast/fusion subsystems. Convention throughout is NPY/NS = NumPy_ms ÷ NumSharp_ms, so >1.0× = NumSharp faster. Absolute ms aren't comparable across machines; the ratios are (same box, same run).

Headline: op-matrix geomean 1.18×, scaling to 1.26× at 10M. Of 1,851 ops: 792 faster · 357 within 2× · 177 + 72 slower — and the slow buckets shrank this cycle (🔴 81→72, 🟠 219→177).

🟢 Where it's great (often an order of magnitude)

Area Result
Reductions / stats prod 13.8×, nanstd/nanvar/percentile/quantile 11–13×, axis-sum (int8) ~11× at 10M
dot / matmul dot(f64) 14.2× @100k
Fused np.evaluate (a-b)/(a+b) 4.16×, a*b+c 1.56× — and it now holds on F/transposed (1.85× / 1.74×) and broadcast (3.60×) operands
Broadcast-then-reduce 538× vs NumPy (algebraic axis-collapse)
any / all 2.7–8.4× (SIMD bool/char scan + early-exit)
astype matrix 1,439 / 1,568 cells win; per-layout geomean 1.47–2.21×; the old float→narrow-int cliff is now a win (f16→narrow 3.77×)
Iterator construction 3.3× vs np.nditer

🔴 Where it's still bad (the honest worklist)

Area Result Cause
left_shift / right_shift @100k ~0.05× (≈20× slower) array-by-scalar shift runs a scalar loop, not vectorized
np.sum(f64) @100k 0.074× np.sum() API dispatch overhead (~200µs); the raw iterator reduction still wins 2.79×
copy/cast & index-math @ small N 0.3–0.6× iterator-build cost dominates µs-scale ops (flatten / astype / unravel / ravel_multi_index @1k–100K)
float16 across the board ~0.65× no Vector<Half> in the .NET BCL → scalar path
less → bool 0.54× comparison-to-bool at small N
100K tier overall 0.90× small/medium arrays are where per-call overhead bites; 10M is a clean win

TL;DR

Reductions, fusion, broadcasting, casts, and large-array throughput are clearly ahead of NumPy — frequently 5–500×. The losses cluster in per-call overhead at small/medium N (API dispatch + iterator construction), the non-vectorized scalar-shift path, and float16 (a BCL limitation, not a design one). All performance, no correctness — it's the optimization backlog, not a bug list.

numpy 2.4.2 · Release · best-of-rounds · shared-hardware caveat applies to absolute ms, not ratios.

@Nucs

Nucs commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

📐 Performance matrices — >1.0× = NumSharp faster

Every cell is NPY/NS = NumPy_ms ÷ NumSharp_ms>1.0× = NumSharp faster · <1.0× = NumSharp slower · =1.0× = parity. (numpy 2.4.2 · Release · best-of-rounds · same box.)

1) Operations — category × cache tier (geomean)

category scalar 1K 100K 1M 10M
elementwise 1.05 1.54 1.18 1.09 1.11
reductions 2.67 1.99 1.51 1.44 1.42
copy/cast 0.61 0.59 0.40 1.39 1.06
index-math 0.32 0.51 0.97 1.22 1.22
dtypes 0.71 0.85 1.97 1.54 1.47

2) Per-family × cache tier

family scalar 1K 100K 1M 10M geomean
— elementwise —
add 1.01 1.48 1.03 0.88 1.01 1.06
sqrt 0.85 1.15 1.00 1.01 1.02 1.00
copy 0.88 2.59 1.78 1.33 1.72 1.56
strided 0.89 1.12 1.00 1.02 0.99 1.00
bcast 0.89 1.13 1.02 0.98 1.03 1.01
reversed 0.85 1.28 0.90 0.99 1.00 0.99
castbuf 1.98 2.29 1.65 1.35 1.16 1.64
mixbuf 1.49 1.94 1.40 1.24 1.09 1.40
— reductions —
sum 1.84 1.78 2.79 2.21 1.76 2.04
sum axis=0 1.90 0.86 0.96 1.00 0.94 1.08
sum axis=1 1.85 0.86 1.51 1.83 1.57 1.47
sum dtype= 1.97 1.47 0.49 0.47 0.55 0.82
amin 1.70 1.61 0.71 0.70 0.82 1.02
cumsum 1.47 1.09 1.06 1.80 1.68 1.39
any (all-false) 8.89 8.41 2.12 0.98 1.00 2.74
any (hit) 9.01 8.50 8.50 7.87 8.22 8.41
— copy/cast —
flatten 0.43 0.44 0.17 2.17 0.90 0.57
astype 0.30 0.53 0.59 1.97 1.90 0.81
ravel.T 0.45 0.73 0.48 2.11 1.01 0.80
copy in-place 1.77 0.81 0.81 1.06 1.02 1.05
less → bool 0.81 0.52 0.26 0.54 0.76 0.54
— index-math —
unravel_index 0.33 0.50 0.95 1.01 0.97 0.68
ravel_multi_index 0.32 0.52 0.99 1.49 1.53 0.82
— dtypes —
complex128 0.74 0.63 1.01 0.76 0.89 0.80
float16 0.72 0.65 0.62 0.62 0.62 0.65
int8 0.67 1.47 12.09 7.70 5.78 3.51

3) Operand & broadcast layouts — case × dtype (1M elements)

case f64 f32 f16 i32 i64 c128 geomean
1-D contiguous (a+a) 2.55 2.18 0.62 2.04 2.54 2.38 1.87
1-D strided a[::2] 1.83 1.36 0.52 1.37 1.77 1.86 1.34
1-D reversed a[::-1] 2.26 2.09 0.56 2.00 2.14 2.23 1.71
array + scalar 2.63 2.11 0.63 1.80 2.17 2.58 1.81
scalar + array 2.35 2.10 0.64 1.98 2.48 2.59 1.85
mixed C + F 2.12 2.01 0.62 1.98 2.02 2.26 1.70
mixed C + T 2.59 2.13 0.62 2.04 2.20 2.51 1.84
broadcast +row(1,C) 2.72 2.28 0.63 2.00 2.42 2.42 1.89
broadcast +col(R,1) 2.68 2.14 0.56 1.99 2.45 2.82 1.88
col-broadcast unary 2.55 1.76 0.86 1.69 2.66 6.09 2.18

4) astype cast matrix — geomeans (1M, over all 1,568 src→dst × layout cells)

By memory layout

C F T sliced negrow negcol strided bcast
1.82 1.97 1.88 1.87 1.89 1.81 1.47 2.21

By source dtype

bool u8 i8 i16 u16 i32 u32 i64 u64 char f16 f32 f64 c128
2.06 1.99 2.02 2.26 2.21 1.96 1.94 1.64 1.58 1.98 2.41 1.75 1.45 1.16

By destination dtype

bool u8 i8 i16 u16 i32 u32 i64 u64 char f16 f32 f64 c128
1.95 1.76 1.80 1.63 1.61 1.82 1.66 1.93 1.76 1.63 2.48 1.63 2.04 2.55

Reading: 8.89 = NumSharp ~8.9× faster; 0.40 = NumSharp ~2.5× slower. The float16 ≈0.6 column reflects the missing Vector<Half> in the .NET BCL (scalar path). decimal omitted from cast geomeans (no NumPy dtype). Absolute ms aren't cross-machine comparable; these ratios are.

Nucs added 3 commits June 24, 2026 07:59
The benchmark/history/<date>_<sha>/ provenance convention held only the
2026-06-05 snapshot; the just-regenerated full run (report commit e3b7c26) was
never persisted there — its raw archive sat only in the gitignored
benchmark/results/20260623-065155/. This promotes the four core artifacts + a
MANIFEST, matching the existing snapshot's shape.

Snapshot: benchmark/history/2026-06-23_e3b7c268/
- MANIFEST.md            provenance / env / methodology / headline geomeans
- benchmark-report.md    op-matrix + appended NpyIter/Layout/Operand/Cast/Fusion
- benchmark-report.json  unified machine-readable results
- benchmark-report.csv   spreadsheet form
- numpy-results.json     raw NumPy timings (merge input)

Run 20260623-065155 · i9-13900K / Win11 Pro N / .NET 10.0.101 / Python 3.12.12 /
numpy 2.4.2 · Release · ~3h. Benchmarked branch HEAD b8623db + local engine WIP
(built clean). Convention NPY/NS (>1 = NumSharp faster). Headline: op-matrix
1.14x/0.90x/1.26x at 1K/100K/10M; NpyIter 1.18x geomean (bcast_reduce 538x,
any(F) 2.74x); fusion (a-b)/(a+b) 4.16x. Raw BDN per-class JSON omitted
(regenerable per the MANIFEST).
… symlink + publish ritual

Make a benchmark run produce a complete, committable provenance snapshot under
benchmark/history/ (the only tracked home for the json/csv/numpy-results that are
gitignored at the benchmark root), and expose the newest run through a git-tracked
benchmark/history/latest symlink so docs and tooling can reference a stable path.
Previously a run only left the gitignored results/<ts>/ raw scratch plus root-level
report files (json/csv ignored), so "what we commit and reference" had no home and
history/ snapshots were assembled by hand.

WHY
  results/<ts>/         = gitignored raw scratch (BDN per-class JSON, ~tens of MB)
  history/<date>_<sha>/ = tracked provenance snapshot (MANIFEST + report + all sheets)
  history/latest        = git symlink -> newest snapshot (the stable reference path)
The benchmark/.gitignore drops benchmark-report.{json,csv} and numpy-results.json at
the root, so history/<snap>/ is the ONLY committed location that carries the full,
machine-readable result set alongside the rendered .md.

NEW  benchmark/scripts/snapshot_history.py
  Assembles a snapshot from the newest (or --results-dir) run + the rendered subsystem
  sheets into history/<date>_<sha>/ and (re)points history/latest at it:
    - 15 artifacts: benchmark-report.{md,json,csv}, numpy-results.json,
      npyiter/layout/operand/cast/fusion *_results.{md,tsv}, cards/{ops,cat}.png
    - auto-generated MANIFEST.md: provenance + environment (CPU parsed from the BDN
      ProcessorName, OS, .NET SDK, Python, NumPy) + NPY/NS convention + methodology +
      the per-size geomean headline parsed back out of benchmark-report.md + the
      NpyIter/Cast headlines.
    - PROVENANCE accuracy: the working-tree marker reads DIRTY only when BUILD-AFFECTING
      sources differ from HEAD (uncommitted-or-new src/ code or *.csproj/props/targets),
      NOT on untracked scratch (benchmark/poc/*.cs), gitignored output, or the report
      files a run always regenerates. So a clean-checkout CI run correctly reads
      "clean (HEAD exactly)" instead of a permanent false DIRTY. All MANIFEST cells are
      |-escaped so a commit subject / WIP path containing '|' can't inject a table column.
    - history/latest as a relative mode-120000 symlink (repo has core.symlinks=true and
      already tracks symlinks); text-file fallback if the OS refuses symlinks.
    - <date>_<sha> names the BENCHMARKED commit; --head lets an after-the-fact
      re-snapshot record the run's real report commit (and reads THAT commit's subject).
    - flags: --results-dir / --snap-name / --head / --commit / --no-stage.
  Raw BDN per-class JSON is intentionally not persisted (regenerable).

WIRE run_benchmark.py
  New step 6 after the root-copy: unless --no-history, shell out to snapshot_history.py
  with the just-produced --results-dir, so every full run ends with a committable
  snapshot + repointed latest. Adds HISTORY_DIR const, the --no-history opt-out, and a
  closing "Snapshot: .../latest -> ..." line. Passes --no-stage: writing the snapshot
  must NOT mutate the git index — staging is the human's "review" step (run -> review ->
  commit) and CI stages benchmark/history/ explicitly; a local perf check shouldn't
  silently `git add` ~16 files.

WIRE .github/workflows/benchmark.yml (the post-release publish ritual)
  git add now includes benchmark/history/ so the snapshot + latest symlink are committed
  to master alongside the refreshed report/cards; commit message and header comment
  updated to mention the history snapshot.

DOCS
  - benchmark/CLAUDE.md: documents history/ (latest + <date>_<sha>/), results/ as
    gitignored scratch, scripts/snapshot_history.py, and a new
    "History snapshots & the publish ritual" section (results/ vs history/<snap>/ vs
    latest table + run -> review -> commit ritual).
  - .claude/CLAUDE.md (Performance Convention): "what we commit & reference is
    benchmark/history/, not results/<ts>/"; reference the stable
    benchmark/history/latest/benchmark-report.md. Also reconciles the canonical-harness
    line to FIVE appended subsystems (adds the previously omitted benchmark/operand/).

BACKFILL benchmark/history/2026-06-23_e3b7c268/
  Re-ran the script over the e3b7c26 report run: the snapshot carries the full 15
  artifacts (was 4 hand-copied core files) + a regenerated MANIFEST that correctly reads
  "clean (HEAD exactly)" (no build-affecting WIP at that commit), and history/latest
  points at it.
Add a detailed compatibility matrix comparing NumPy (v2.4.2) string dtype/features to NumSharp (nditer branch f7be5f7). Documents support across dtypes, parsing, scalars, array creation, memory layout, casting, comparisons, string operations, array functions and I/O, and highlights that NumSharp lacks a real string dtype (only a UTF-16 char vector and several hacks). Notes two spec corrections found against live NumPy, links to NUMPY_STRING_TYPES.md, and tracks issue #592.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant