[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul by Nucs · Pull Request #611 · SciSharp/NumSharp

Nucs · 2026-04-22T09:19:43Z

Complete changelog of the nditer branch — everything in this PR since #612 merged.

455 commits · 806 files · +234,348 / −19,179 (vs master, after #612)

TL;DR

NpyIter — full port of NumPy 2.4.2's nditer (~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. 566+ byte-for-byte NumPy parity scenarios.
NpyExpr DSL + three-tier custom-op API — write your own ufuncs: raw IL (Tier 3A), element-wise scalar/SIMD (Tier 3B), or composable expression trees with operator overloads (Tier 3C). Exposed as the public np.evaluate, which runs fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPy result_type typing and fused reductions.
out= / where= / dtype= ufunc kwargs across the elementwise API — the kwargs on every NumPy ufunc, spanning the binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plus np.bitwise_and/or/xor and np.positive at the np.* surface.
NumPy-parity benchmark: geomean 1.00× at 10M elements across ~409 ops (166 faster / 171 close / 36 slower) — measured by a new official BenchmarkDotNet-vs-NumPy suite committed with the report.
36 new np.* APIs — sort, pad (11 modes), tile, median/percentile/quantile (all 13 interpolation methods) + their nan* variants, average, ptp, take/put/place, extract/compress, diagonal/trace, argwhere/flatnonzero, unravel_index/ravel_multi_index/indices, delete/insert/append, diff/ediff1d, asfortranarray/ascontiguousarray, np.multithreading.
C/F/A/K order support wired through the whole API — Shape understands F-contiguity, OrderResolver resolves NumPy order modes, ~68 layout bugs fixed across 9 fix groups.
Stride-native matmul/dot — BLIS-style GEBP GEMM absorbs arbitrary strides for all dtypes (kills a ~100× penalty on transposed inputs); fused 1-D dot is 3.5–9× faster with zero GC; opt-in multithreaded dot ~2× faster than NumPy's default on 1M vectors.
Sorting, casts & Complex finished — np.sort/np.argsort on a radix line-kernel (closes a Missing Function); a SIMD strided-cast campaign that killed the cast cliffs (15×8×15 astype matrix: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPy); np.zeros via calloc/demand-zero (O(1), was ~1000× slower); the six Complex transcendentals (sinh…arctan); and bit-exact pairwise summation for sum/mean.
Deterministic memory management — atomic reference counting + IDisposable on NDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).
Differential fuzzing infrastructure — 37,445 bit-exact NumPy-comparison cases across 24 corpus tiers, a seeded random fuzzer with shrinker, a CI FuzzMatrix gate, and a nightly soak workflow.
Legacy iterator stack deleted outright — MultiIterator, the Regen-generated cast templates, and NDIterator itself (interface + class + AsIterator extensions) are all gone; every code path now iterates through NpyIter / NpyFlatIterator / GetAtIndex.
Test suite: 9,990 passed / 0 failed on net8.0 + net10.0 (+2,600 new test methods), plus the 37,445-case fuzz corpus replayed by the FuzzMatrix gate.

1. NpyIter — full NumPy `nditer` port

From-scratch C# port of NumPy 2.4.2's iterator machinery under src/NumSharp.Core/Backends/Iterators/ (~12,557 lines), promoted to public API with NDArray overloads.

Capability	Detail
Iteration orders	C, F, A, K — incl. NEGPERM negative-stride handling, axis reordering + coalescing to full 1-D collapse
Indexing modes	`MULTI_INDEX`, `C_INDEX`, `F_INDEX`, `RANGE` (parallel chunking), `GotoIndex` / `GotoMultiIndex` / `GotoIterIndex`
Buffering	Buffered casting with all 5 casting rules, windowed buffered iteration, `DELAY_BUFALLOC`, buffered-reduce double-loop (incl. `bufferSize < coreSize`)
Reductions	`op_axes` with `-1` reduction axes, `REDUCE_OK`, `IsFirstVisit`, `REUSE_REDUCE_LOOPS` slab accumulation
Overlap safety	`COPY_IF_OVERLAP` via a port of NumPy's `mem_overlap` solver (`NpyMemOverlap.cs`) — overlapping in/out operands no longer silently corrupt
Masking	`WRITEMASKED` + `ARRAYMASK` executed — the buffered window flush writes back only mask-nonzero elements; `VIRTUAL` operands (null op slots) construct with NumPy 2.x semantics
Operands / dims	Unlimited operands (NumPy caps at `NPY_MAXARGS=64`) and unlimited dimensions (NumPy caps at `NPY_MAXDIMS=64`) via dynamic allocation
APIs	`Copy`, `GetIterView`, `RemoveAxis`, `RemoveMultiIndex`, `ResetBasePointers`, `IterRange`, `DebugPrint`, fixed/axis stride queries, `GetValue<T>`/`SetValue<T>`, …
Casting parity	`NpyIterCasting.CanCast` matches NumPy's `safe`/`same_kind` lattice exactly

Validated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, a permanent variation-probe harness, and tools/iterator_parity. Dozens of parity bugs found and fixed against NumPy ground truth: negative-stride flipping, NO_BROADCAST enforcement, F_INDEX coalescing, buffered-reduction stride inversion, K-order on broadcast inputs, EXLOOP iternext, buffered-cast Advance, ranged Reset() desync, buffer free-list corruption, the size-1 stride-0 invariant (a (1,4) view with nonzero stride corrupted RemoveMultiIndex), op_axes out-of-bounds reads on stretched size-1 axes, write-broadcast validation, PARALLEL_SAFE wiring, and unit-axis absorption — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.

Execution at NumPy speed

NpyIter isn't just correct — it is now the production execution engine: DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NpyIter Tier-3B shell, and it measures at-or-faster than NumPy on every probed aspect (Release, i9-13900K, NumPy 2.4.2):

Aspect (float32)	NumSharp	NumPy	Ratio
contig sqrt 10M	2.98 ms	3.24 ms	0.92×
contig add 10M	3.91 ms	4.09 ms	0.96×
strided add 1M	319 µs	416 µs	0.77×
strided sqrt 1M	206 µs	374 µs	0.55×
strided sum 1M	109 µs	205 µs	0.53×
fused `a*b+c` 10M	4.77 ms	13.38 ms	0.36×
fused `(a-b)/(a+b)` 10M	4.12 ms	22.33 ms	0.18×

Key mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels (2-D strided sqrt 1.36× faster than NumPy, strided sum 2.2× faster).

2. NpyExpr DSL + three-tier custom-op API

User-extensible kernel layer on top of NpyIter — the public answer to "how do I write my own ufunc":

Tier 3A — ExecuteRawIL: emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* strides, long count, void* aux).
Tier 3B — ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.
Tier 3C — ExecuteExpression: compose NpyExpr trees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons, Min/Max/Clamp/Where), plus Call() to splice any delegate/MethodInfo into a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.

This is what powers the fusion wins — one pass, no temporaries — and it is exposed publicly as np.evaluate(expr[, operands][, out]):

Per-node NumPy result_type typing — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly: (i4*i4)+f8 wraps the multiply in int32 (→ 1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4, i4/2 → f8) with NumPy's exact OverflowError, and special resolvers (true_divide, arctan2, negative-integer-literal power → ValueError, bool add=OR/multiply=AND).
Fused reductions — NpyExpr.Sum/Prod/Min/Max/Mean compile a one-pass inner loop; sum(a*b) reads a and b once and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).
out= joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing through COPY_IF_OVERLAP); an EXTERNAL_LOOP guard prevents the silent count==1 slow path.
Measured (Release, 4M f64, NumPy 2.4.2): a*b+c 3.2×, (a-b)/(a+b) 6.1×, sum(a*b) 3.6×, sum f32 2.9×, i4*2+f8 3.5× faster. Permanent gate in benchmark/fusion/evaluate_bench.{cs,py}.

3. Legacy iterator stack retired

MultiIterator deleted; all callers migrated to NpyIter.Copy / multi-operand execution.
The Regen template NDIterator.template.cs + 16 generated NDIterator.Cast.* partials deleted (−3,870 LOC in one commit).
NDIterator (interface + NDIterator<T> + AsIterator extensions) deleted entirely — [Obsolete] tombstones that threw at runtime after the migration and were referenced by nothing live. Production iteration runs through NpyIter/NpyIterRef (kernels), GetAtIndex (flat reads), and NpyFlatIterator (np.broadcast(...).iters).
~400 per-dtype NPTypeCode switch sites replaced by a generic NpFunc dispatch utility.

4. C/F/A/K memory-layout support

Shape now tracks F-contiguity with NumPy-convention contiguity computation; new OrderResolver resolves C/F/A/K for every API with an order parameter.
Order support wired through: copy, array, asarray, asanyarray, *_like, astype, flatten, ravel, reshape, eye, concatenate, cumsum, argsort, tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.
New: np.asfortranarray, np.ascontiguousarray.
np.where selects C/F output layout the way NumPy does; ravel('F') of an F-contig source returns a view (was a 3,000× copy).
~68 layout bugs fixed across 9 TDD fix groups, backed by ~3,300 lines of new order tests (Sections 41–51: reductions/keepdims, matmul/dot/outer/convolve, broadcasting-from-F, manipulation, file I/O fortran_order, Decimal scalar path, fancy-write isolation, …).

5. New & completed `np.*` APIs

New functions (36):

Area	APIs
Fused / ufunc	`np.evaluate` (fused expressions — see §2), `np.bitwise_and`, `np.bitwise_or`, `np.bitwise_xor`, `np.positive`
Sorting	`np.sort` (+ `ndarray.sort`; `np.argsort` reimplemented) — radix line-kernel on NpyIter, stable, NaN-last, all axes / orders (`IterAllButAxis` drive mirroring NumPy's `_new_sortlike`)
Manipulation	`np.pad` (all 11 NumPy modes + callable), `np.tile`, `np.delete`, `np.insert`, `np.append`
Indexing/selection	`np.take`, `np.put`, `np.place`, `np.extract`, `np.compress`, `np.argwhere`, `np.flatnonzero`, `np.diagonal`, `np.trace`, `np.unravel_index`, `np.ravel_multi_index`, `np.indices`
Statistics	`np.median`, `np.percentile`, `np.quantile` (all 13 interpolation methods, tuple axis, `out=`, `keepdims`, QuickSelect engine), `np.average` (`weights`, `returned`, tuple-axis; fused kernel 1.3–1.6× faster than NumPy at 1M), `np.ptp`, `np.nanmedian`, `np.nanpercentile`, `np.nanquantile`
Math	`np.diff`, `np.ediff1d`
Creation	`np.asfortranarray`, `np.ascontiguousarray`
Runtime	`np.multithreading(enabled, max_threads)` — opt-in threaded kernels

Rebuilt to full NumPy 2.x parity:

np.clip — min=/max= keyword aliases, default-None bounds, NumPy 2.x dtype promotion, out= validation.
np.unique — 5 missing kwargs, sort+mask algorithm (up to 43× faster), NaN partitioning, n > Array.MaxLength fallback.
np.searchsorted — side=, sorter=, multidim validation; IL binary-search kernels 5–25× faster (beats NumPy on 20/22 benchmarks).
np.copyto — casting=, where= masked copies at NumPy speed (was 7–72× slower).
np.asarray — copy=, like=, device=, dtype-as-string. np.concatenate — full parity + C/F fast paths. np.all/np.any — tuple-axis, out=, where=. np.expand_dims — tuple axis. np.repeat — axis= parameter. np.power — integer-power semantics, negative-exponent ValueError, crash fix.
np.broadcast — N-operand form (0..64, then unlimited — NumPy parity, was 2-operand only), live index cursor, lazy .iters, .numiter.
Engine completeness: bool/char max/min, Complex quantile, IsInf implemented (was a stub); the six Complex transcendentals sinh/cosh/tanh/arcsin/arccos/arctan implemented (hybrid BCL + C99 edge fix-ups, NumPy 2.4.2 parity — were NotSupportedException).
Full 15-dtype coverage pushed through the hot paths — the SByte/Half/Complex dtypes introduced in [new dtypes, NEP50] fully supported Half/Complex/SByte, np.* alias overhaul, NumPy 2.x type alias alignment #612 now work across every kernel family this PR touches (reductions, indexing, trace, casts, quantile, …).

out= / where= / dtype= ufunc kwargs (NumPy parity):

The kwargs present on every NumPy ufunc now span the elementwise core — binary (add, subtract, multiply, divide, true_divide, mod, power, floor_divide), unary-math (sqrt, exp, log, sin, cos, tan, abs/absolute, negative, square), the six comparisons, predicates (isnan/isfinite/isinf), bitwise, invert, arctan2 — each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:

out joins the broadcast but never stretches (mismatched/stretchable out raise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50), out only needs a same_kind cast; the provided instance is returned (reference identity).
where must be exactly bool (mask cast under 'safe'); it broadcasts over operands and participates in output shape; mask-false slots keep prior out contents.
out aliasing an input is well-defined via COPY_IF_OVERLAP — add(x[:-1], x[:-1], out=x[1:]) matches NumPy exactly.
dtype= computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the bool add→OR / multiply→AND remap keyed off the final loop dtype so add(True, True, dtype=i32) = 2.

6. Linear algebra

Stride-native GEMM for all 12 numeric dtypes — BLIS-style GEBP with stride-aware packers; the 8×16 Vector256 FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxing GetValue fallback chain.
Full matmul gufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a dedicated differential matrix (816 cases).
Fused single-pass 1-D dot — 3.5–9× faster, zero GC (was up to 446 gen-0 collections per call at 100K).
np.multithreading — opt-in parallel 1-D dot: 1M float dot 172 → 60 µs, ~2× faster than NumPy's default build. Off by default; bitwise-identical summation order when off.

7. Performance (beyond NpyIter and linalg)

Op	Improvement
Axis reductions, narrow ints	Widening SIMD (int16→int32 accum etc.): `sum(int16, axis=1)` 1058 ms → 2.7 ms (389×, now faster than NumPy); int32/uint32 2.3–4.6×; also fixes a uint32 axis-sum corruption bug
`mean` (axis)	217× (Phase-0 bug surgery); `var`/`std` 21×; `count_nonzero` 20×
`np.nonzero`	IL SIMD kernel closes an 8–241× gap to NumPy
`np.where`	IL kernels for scalar-broadcast & non-contiguous (1.2–2× NumPy on broadcast conditions)
Strided 1-D unary	Fused strided-SIMD kernel: 0.55 ns/elem flat — beats NumPy at every size; strided `sqrt` reached parity via gather→tile→SIMD buffering
Strided flat reductions	Incremental-advance path: strided sum 8.3× faster (11.8× behind NumPy → 1.4×)
Comparisons	PDEP-based packed mask→bool store; broadcast/strided compares routed via NpyIter
Axis-0 reductions	Column-tiled accumulation (breaks the output RAW dependency); 8× pairwise unrolled flat reductions
Allocation	tcache-style size-bucketed buffer pool with a 1 B – 64 MiB window (covers both the small-N ufunc result and 4M+ outputs that previously paid a fresh `VirtualAlloc` + demand-zero faults); ≥1 MiB buckets capped at 2 buffers; pool-side GC memory pressure tracking live state; `GC.SuppressFinalize` on free; `using`/ARC adopted across `concatenate`, `allclose`, `convolve`, `tile`, `eye`, masking, shuffle, …
Casts (SIMD campaign)	Strided/gathered SIMD kernels across the full 15×8×15 `astype` matrix — `cvtt` float→int, Giesen f16↔ widen/narrow, complex deinterleave, sub-word VPSHUFB shuffles, fused VPGATHER whole-array kernels, single-pass KEEPORDER same-type copy. Cliffs eliminated: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPy
`np.zeros`	`calloc` / Windows `VirtualAlloc` demand-zero — O(1) regardless of size (10M f64: 14.3 ms → ~0.01 ms, was ~1000× slower)
Broadcast-reduce	Stride-0 axes folded algebraically in the flat-reduction chokepoint (no O(D×N) materialize) — `sum(broadcast_to(...))` now ~534–700× faster, beats NumPy, bit-exact
`sum`/`mean` (float)	Bit-exact NumPy pairwise summation ported onto the per-chunk reduce path — matches `np.add.reduce` bit-for-bit (unblocks float32)
`np.any`/`np.all` (bool/char)	Reinterpret to byte/ushort → existing integer SIMD path (was a 5–12× scalar cliff); fixes a latent AVX2 32-lane mask-overflow correctness bug
Complex/Half/Decimal reductions	NpyIter chunked `ForEach` axis reductions — Decimal 5–13×, Half mean 1.6–3.7×, Complex mean 15–45×→parity; float16 negate ~10× via sign-bit flip
Casts (`float→int32`)	NumPy-faithful SIMD `cvtt`, strided/reversed/gathered variants
`np.split` family	O(1) sub-shape derivation, direct views — 1.5–4× faster than NumPy
Where/copyto/searchsorted/unique	see §5

8. Official benchmark suite + honest methodology

New cross-platform run_benchmark.py entry point: BenchmarkDotNet Full rigor (50 iters, InProcessEmit) × all suites × {1K, 100K, 10M} vs NumPy 2.x — 1,813 C# measurements, 1,111 matched op×dtype×size comparisons, structural op-name join, tracked markdown report + per-suite artifacts + history snapshots. Coverage spans all 15 dtypes (SByte/Half/Complex suites added).
Headline: geomean NumSharp÷NumPy = 1.00× at N=10M (166 ops faster / 171 close / 36 slower) — parity across the whole op surface at memory-bound sizes; ~1.9× at 1K where per-call dispatch dominates (tracked as the next focus).
Found and neutralized a benchmark-invalidating tooling bug: dotnet run file-based apps compile the project reference in Debug (optimizations off) even with Configuration=Release properties — hand loops measured ~2× slow while DynamicMethod IL was immune. Benchmarks now assert IsJITOptimizerDisabled == false and refuse to mislead; the rule is documented.
Canonical NpyIter benchmark — a section-addressable harness covering 33 op families × {scalar/1K/100K/1M/10M}, integrated into run_benchmark.py, plus a post-release CI workflow (.github/workflows/benchmark.yml) that auto-commits report cards to master.
Frontier findings — found, then fixed. Adversarial probes flagged real losses; the headline ones are now closed: np.sum over a broadcast_to view (was 54× slower) folds stride-0 axes algebraically and runs ~534–700× faster than NumPy, bit-exact; scalar np.any/np.all on bool/char (was 5–12× slower) reinterpret onto the integer SIMD path; np.zeros (was ~1000× slower) is calloc-backed. Remaining tracked items: small-N (~1K) per-call dispatch overhead and a few iterator edge cases pinned as [OpenBugs]/skipped repros. A win surfaced too: hand-rolled 8-band parallel iteration 4.7×.

9. Differential fuzzing vs NumPy (new infrastructure)

37,445 bit-exact corpus cases across 24 JSONL tiers generated from real NumPy 2.4.2 outputs: casts (full 15×15 matrix), binary arith (NEP50), div/mod/power, comparisons, unary (incl. float16 inputs + all narrow ints), reductions, NaN-aware reductions, cumulative, statistics, logic/extrema, bitwise+shift, where/place, manipulation, matmul, modf multi-output, sorting/searching, parameter sweeps, SIMD-tail boundaries (900 cases around vector-width edges), operand aliasing, and error-parity (exception-for-exception).
Seeded random fuzzer with an element-wise shrinker for minimal repros; metamorphic invariant tier (11 algebraic properties).
CI integration: FuzzMatrix gate wired into the build workflow + a new nightly fuzz-soak workflow (.github/workflows/fuzz-soak.yml).
Findings inventoried in docs/FUZZ_FINDINGS.md; every fixed class re-armed as a permanent regression gate. The error-parity tier alone surfaced 1 critical crash; the op tiers surfaced 17+ distinct bug classes that are now fixed (see §10).

10. Correctness — NumPy-parity bug fixes

Semantics (behavioral changes, may affect callers):

floor_divide / mod: NumPy-exact floored semantics and divide-by-zero results.
Comparisons: <= / >= now return False for NaN (IEEE/NumPy).
Flat min/max propagate NaN.
np.negative(uint) wraps modulo 2ⁿ instead of throwing; bool - bool and -bool/np.negative(bool) now throw (NumPy behavior).
Transcendental ufuncs use NEP50 width-based float promotion.
np.power: negative integer exponent raises ValueError; exact integer-power semantics.
Cast semantics aligned with NumPy across all dtype pairs (IL kernels + ConvertValue); complex→bool no longer drops the imaginary part; float→int SIMD uses truncation (cvtt) like NumPy.
Broadcasting keeps rank when a 1-D [1] meets a lower-rank operand; quantile-family dtype & bool handling; Complex np.where.
Integer reciprocal(0) is per-width exact: int32/int64 → MinValue, uint64 → 2⁶³, but 0 for int8/int16/uint8/uint16/uint32 (was MinValue/0 across the board); bool → int8.
clip/maximum/minimum: float16 signed-zero scalar tail, NaN propagation through the SIMD kernel, and correct F-contiguous/strided element pairing.
float16 axis sum accumulates in float32 (NumPy parity); Complex flat min/max return the NaN-bearing element verbatim; Complex unary math ported from NumPy's own C99 algorithms.

Crashes & corruption:

Overlapping-operand corruption eliminated iterator-wide (COPY_IF_OVERLAP, §1).
Masked iteration: a buffered WRITEMASKED write landed garbage in exactly the slots NumPy preserves (silent corruption of the elements the caller asked to protect) — now writes back only mask-nonzero elements.
uint32 axis-sum produced wrong values past 8 distinct columns (widening-SIMD rewrite).
np.pad: 5 correctness/crash bugs (battle-tested against NumPy 2.4.2); linear_ramp preserved Complex dtype.
UnmanagedStorage/ArraySlice: CopyTo direction + bounds bugs; CloneData partial-buffer bug; scalar offset lost on Clone; buffered NpyIter.Clone shared buffers; DTypeSize reported Marshal.SizeOf instead of in-memory stride; NPTypeCode.Char.SizeOf returned 1 (real: 2); stale Decimal priority.
TensorEngine now propagates through Cast/Transpose/copy/reshape/ravel (custom engines were silently dropped).
take with out= enforces NumPy's safe-cast direction; put/place non-contiguous writeback fixes; argsort on non-C-contiguous input.
NpyIter ForEach/ExecuteGeneric/ExecuteReducing read past the end without EXTERNAL_LOOP.
np.exp2 float32-output IL kernel was malformed (InvalidProgramException); np.power with a Half exponent threw InvalidCastException; a narrowing dtype= on a complex float-ufunc segfaulted — all fixed.
Complex nansum axis reduction read uninitialized memory for ndim ≥ 3; the AVX2 32-lane any() mask overflow (byte/sbyte) returned wrong results; net8.0 complex abs and axis min/max NaN propagation corrected.

11. Memory management — ARC + `IDisposable`

NDArray now implements IDisposable backed by atomic reference counting on the unmanaged block: CAS-driven TryAddRef/Release, idempotent Dispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.
Hammered by a 15-case lifecycle suite incl. 32-thread × 1,000-op concurrency races and 50-way parallel dispose — zero corruption.
Deterministic release means hot loops no longer wait on the finalizer queue; combined with the buffer pool this removes most steady-state GC pressure (dot at 100K: 446 collections → 0).

12. `Char8` primitive

New 1-byte character type (NumSharp.Char8) — the NumPy S1/Python bytes(1) equivalent — with conversions, operators, span helpers, and 100% Python bytes API parity validated against a Python oracle. Vendored .NET ASCII/Latin-1 reference sources under src/dotnet/ document the upstream implementations it mirrors.

13. Examples — trainable MNIST MLP

New examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive implementation and a fused one (single-NpyIter bias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). Originally needed a "copy transposed views before np.dot" workaround (31× training speedup at the time); the stride-native GEMM (§6) made the workaround unnecessary. Converges to >99% test accuracy in the bundled demo.

14. Kernel architecture & hygiene

ILKernelGenerator split into DirectILKernelGenerator (legacy whole-array kernels, 51 partials under Kernels/Direct/) and ILKernelGenerator (NpyIter-driven per-chunk kernels — the target model matching NumPy's PyUFuncGenericFunction); migration path documented per kernel family.
All Vector128/256/512 and Math/MathF reflection centralized in VectorMethodCache / ScalarMethodCache; IL-emitted typed-field copier replaces the UnmanagedStorage.Alias switch.
24 dead kernel methods removed outright (were [Obsolete(error: true)] tombstones, referenced by nothing); dead axis-reduction SIMD paths removed.

15. Documentation

NpyIter/NDIter book: docs/website-src/docs/NDIter.md (7-technique quick reference, decision tree, memory model, gotchas) + ndarray.md.
DocFX website — Benchmarks vs NumPy: benchmarks.md (head-to-head evidence companion to the IL-generation page), benchmark-iterator.md, benchmark-matrix.md, driven by the auto-committed report artifacts.
Engineering ledgers: PERF_LEDGER.md (every optimization with before/after), NPYITER_GAPS_AND_ROADMAP.md (gap analysis vs NumPy 2.4.2 + prioritized roadmap), MIGRATE_NPYITER.md, IL-kernel playbook, fuzz findings/coverage.
Branch quality audit findings are pinned as test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs — every Tier-1 finding fixed or reproduced as an [OpenBugs] test.

16. Tests & CI

+2,600 test methods; suite now 9,990 passed / 0 failed on net8.0 + net10.0. Zero regressions maintained commit-by-commit.
New suites: np.evaluate (per-node wraparound, dtype matrices, weak scalars + overflow, fused-vs-unfused, out= identity/cast/aliasing, fused reductions), out=/where=/dtype= parity suites (broadcast/cast/error-text pins), WRITEMASKED/VIRTUAL parity; NpyIter battletests (566 scenarios), order-support sections 41–51, ARC lifecycle, clone regression, np.pad/average/median/percentile/ptp/diff battle tests, IL-kernel battle tests, behavioral audit harness.
CI: fuzz gate in build-and-release.yml, nightly fuzz-soak.yml, new post-release benchmark.yml (auto-commits NumPy-comparison report cards to master).
Known gaps stay visible: the still-unimplemented NumPy functions are flip/fliplr/flipud/rot90, diag, gradient, and round (np.sort is now done); small-N (~1K) per-call dispatch overhead is the headline performance focus (docs/NPYITER_GAPS_AND_ROADMAP.md); a few iterator edge cases remain pinned as [OpenBugs]/skipped repros. Every open issue found by the audits/fuzzers/benches is checked in as a failing-by-design test rather than ignored.

Breaking changes

Change	Impact	Migration
`bool - bool`, `-bool`, `np.negative(bool)` now throw	Matches NumPy	Use `^` / cast to int first
NaN `<=` / `>=` returns `False`	Matches IEEE & NumPy	Use `np.isnan` explicitly
`floor_divide`/`mod` divide-by-zero & floored results	Matches NumPy	—
`np.negative(uint)` wraps instead of throwing	Matches NumPy	—
`np.power(int, negative int)` raises `ValueError`	Matches NumPy	Use float exponents
Cast edge cases (overflow/NaN/complex→bool/float→int truncation)	Matches NumPy	—
Transcendental ufuncs: NEP50 width-based promotion	Return dtype may change	—
`np.clip`/quantile-family dtype promotion	Return dtype may change	—
Broadcast views are read-only; broadcasting keeps rank for 1-D `[1]`	Matches NumPy	`.copy()` to write
`MultiIterator` and `NDIterator` (+ `NDIterator<T>`, `AsIterator`) removed	Public types removed (threw at runtime anyway)	Use `NpyIter` / `NpyIter.Copy` / `NpyFlatIterator`
NpyIter: `MaxOperands=8` and 64-dim limits removed	None (loosening)	—
`np.copyto` unwriteable-destination error type corrected	Exception type change	—

Everything above was validated against NumPy 2.4.2 ground truth — by 37k differential corpus cases, 566 iterator parity scenarios, and per-feature battle tests run on actual NumPy output.

… throws) NumPy has no subtract ufunc loop for the bool dtype: bool - bool, np.subtract(bool, bool) and np.ptp(bool) (which computes max - min) all raise TypeError ("numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead."). NumSharp previously computed a boolean result instead - a silent divergence surfaced while implementing np.ediff1d. DefaultEngine.Subtract now guards the both-operands-boolean case and throws NotSupportedException with NumPy's message. Mixed bool + numeric still promotes to the numeric type and subtracts (e.g. bool - int32 -> int32), so only the genuine bool/bool case is rejected. The guard sits at the single engine chokepoint, so it covers the - operator, np.subtract, the scalar-scalar path, and any op that subtracts internally (e.g. np.ptp). Side effect: np.ptp(bool) now raises the same error as NumPy instead of returning a bool, fixing a previously-documented [Misaligned] divergence. Tests - np_subtract_tests.cs: dropped the (Boolean, Boolean) DataRow from SubtractAllPossabilitiesBoolean (it contradicted the test's own _REGEN template, which excludes Boolean) and added SubtractBoolBool_ThrowsLikeNumPy asserting the operator, np.subtract and the scalar path all throw. - np.ptp.BattleTests.cs: Ptp_Bool_DivergesFromNumPy -> Ptp_Bool_ThrowsLikeNumPy (dropped [Misaligned]; now asserts the throw, matching NumPy). Full suite green on net10.0 (9404 passed, 0 failed). Two net8.0 failures (Trunc_FContig_PreservesFContig, Max_WithNaN_PropagatesNaN) are pre-existing and unrelated: they fail identically with this change reverted and involve no boolean subtraction.

…e(bool) throw) NumPy has no negative ufunc loop for the bool dtype: unary -bool, np.negative(bool) and ndarray.negative() all raise TypeError ("The numpy boolean negative, the `-` operator, is not supported, use the `~` operator or the logical_not function instead.") - even for empty arrays. NumSharp previously returned logical_not instead, a silent divergence (companion to the bool-subtract divergence fixed in the previous commit). Changes - DefaultEngine.Negate (the unary `-` operator / TensorEngine.Negate path): the bool branch now throws instead of routing to LogicalNot. - NDArray.negative() (the np.negative / ndarray.negative() path, a separate implementation): guards bool up-front - before the size==0 short-circuit and the cast allocation, so empty bool also throws - and drops the now-dead boolean switch case. - np.logical_not(bool) previously relied on Negate(bool) routing to LogicalNot. It now routes through np.invert (for bool, logical_not == ~ == invert), so logical_not keeps working after the Negate guard. The `!` and `~` operators were already independent of Negate and are unaffected. Numeric negate (int / float / decimal / complex / half) is unchanged. Tests - NDArray.negative.Test.cs: Negative_Bool_ThrowsLikeNumPy asserts -bool, np.negative(bool), ndarray.negative() and empty-bool all throw, and that ~bool / np.logical_not(bool) still return the boolean flip [F, T, F]. Full suite green on net10.0 (9405 passed, 0 failed). The two net8.0-only failures (Max_WithNaN_PropagatesNaN, Trunc_FContig_PreservesFContig) are pre-existing and unrelated: they reproduce on a clean build with this change reverted and involve no boolean negation.

…cross all dtype pairs NumSharp had THREE cast implementations with divergent out-of-range / NaN semantics: - Converts (scalar; used by astype/CastTo): NumPy-faithful on all 3276 cast cases. - DirectILKernelGenerator cast kernel (SIMD; used by NpyIter.Copy contig/strided): SATURATED float->int and sign-extended signed->UInt64 to only 32 bits -> 142 wrong. - NpyIterCasting.ConvertValue (scalar fallback; used by NpyIter.Copy/copyto/where/ concatenate): read every source through a double intermediate (dropping Int64/UInt64 low bits) and used C# saturating casts -> 346 wrong. NumPy uses MODULAR WRAPPING for integer narrowing/sign-change and C truncation (type-min / int.MinValue sentinel on NaN/overflow) for float->int. The IL kernel and ConvertValue instead SATURATED (clamped to dst min/max), so every out-of-range, negative->unsigned, float-overflow or NaN value produced a wrong result through NpyIter.Copy -- a latent bug reachable today via np.where, np.copyto and np.concatenate cross-dtype casts (not just the astype fast-path under development). Spec established empirically: a 15x15 dtype x edge-value ground-truth matrix generated from NumPy 2.x (.claude/scripts/cast_spec_numpy.py), replayed bit-exactly through all three NumSharp paths (cast_spec_numsharp.cs). Converts matched NumPy on every case and is the single source of truth. Fixes: - NpyIterCasting.ConvertValue: read the source through a LOSSLESS intermediate (long / ulong / double / decimal categories instead of blanket double) and route integer narrowing and float->int conversions through the Converts.* table. Integer casts now wrap (unchecked); float->int truncates with the NumPy overflow/NaN sentinel. Result: 0/3276 divergences (was 346). - DirectILKernelGenerator: TryGetCastKernel / TryGetStridedCastKernel now decline the two families whose emitted IL diverges from NumPy (float->integer; signed-narrow->UInt64) via DivergesFromNumpyCast(), so those casts fall back to the NumPy-exact ConvertValue path. All other pairs keep their SIMD kernels. NumPy-exact SIMD emission for the declined families (raw cvtt truncation, full 64-bit sign extension) is a perf follow-up. Result: all three cast paths are now bit-exact with NumPy across all 3276 cast cases (was il=142, cv=346, effective fast path=271 wrong). Full unit-test suite: 0 failures, 9405 passed (TestCategory!=OpenBugs,HighMemory).

astype's cross-dtype path used a scalar per-element Converts loop (UnmanagedMemoryBlock.CastTo) that was 1.5-3x slower than NumPy on contiguous arrays and paid a cache-hostile reorder for transposed/F-contiguous sources. Replace it with a SIMD copy-with-cast through NpyIter.Copy into a KEEPORDER output: the output's memory order mirrors the source's contiguity, matching NumPy's array_astype default order='K' / PyArray_NewLikeArray (numpy/_core/src/multiarray/methods.c:769). An F-contiguous or transposed source now casts as a flat both-F copy instead of a strided reorder. Wins (min of 3 runs, old astype -> new, vs NumPy 2.x), 2048x2048: i32->f64 contig 8.55ms -> 4.56ms (0.89x NumPy - beats it) i32->f64 transpose 10.76ms -> 4.54ms (0.89x NumPy) f32->f64 transpose 10.99ms -> 4.63ms (0.93x NumPy) i32->i32 transpose 1.83ms -> 1.82ms (0.91x NumPy) All non-narrowing cast families (int->float, float->float, int->int, same-width) are now within 10% of NumPy or faster across contig/strided/transpose/fortran/broadcast/negstride. float->int and signed-narrow->UInt64 are excluded via DivergesFromNumpyCast: they have no NumPy-faithful SIMD kernel yet (see 71b6c38), and NpyIter.Copy's scalar strided fallback is slower than the legacy contiguous Converts loop, so they keep the legacy path - correct and no slower than before. Restoring SIMD for float->int with NumPy wrapping semantics (raw cvtt truncation, no saturation) is the follow-up that closes the f64->i32 downcast gap. Correctness rests on the cast-semantics alignment in 71b6c38 (all three cast paths now bit-exact with NumPy across the full 15x15 matrix). Full suite: 0 failures, 9405 passed.

…float->int & signed->UInt64 Implements float->int casting correctly AND fast, closing the f64->i32 downcast gap and un-declining the family from DivergesFromNumpyCast. Root cause (proven in .claude/scripts/f64_to_i32_shootout.cs): the hardware truncating convert VCVTTPD2DQ/VCVTTPS2DQ already yields NumPy's exact result -- truncate toward zero, INT_MIN sentinel on NaN/overflow -- while C#'s (int)d and the IL conv.* opcodes SATURATE. So the fix is to USE cvtt, not to clamp. Changes: - EmitConvertTo (scalar element conversion; used by every cast-kernel tail and the ScalarOnly float->int pairs): route float->{integer,char} through the Converts.* table (NumPy-exact) instead of the saturating conv.* opcodes, and sign-extend signed->UInt64 via Conv_I8 instead of the zero-extending Conv_U8. The entire scalar cast path is now bit-exact with NumPy. - CastDoubleToInt32Contig / CastSingleToInt32Contig: width-adaptive cvtt helpers (AVX512 8/16-wide, AVX 4/8-wide, SSE2 2/4-wide, scalar Converts tail) registered for the float->int32 contig pairs in TryGetCastKernel. The IL generator's width-matched emitter can't express the double->int32 lane halving cleanly, so these use the "tidy helper -> single Call" pattern. - DivergesFromNumpyCast narrowed from {all float->int, signed->UInt64} to just signed->UInt64 (its vectorized widen still sign-extends to only 32 bits, so it stays on the correct ConvertValue scalar fallback). float->int is NumPy-faithful on every path now. Perf (min of 3, 2048x2048, new vs NumPy 2.x astype): f64->i32 contig 9.01ms -> 2.90ms (0.85x NumPy - beats) f64->i32 transpose 13.68ms -> 2.95ms (0.87x) f64->i32 fortran 13.76ms -> 3.04ms (0.88x) f64->i32 broadcast 14.35ms -> 2.39ms (0.92x) f64->i32 now beats NumPy on contig/transpose/fortran/broadcast. Strided/negstride f64->i32 improved 2.4-2.6x (8.3->3.4ms, 16.9->6.5ms) but remain on the scalar strided path (1.4-2.0x NumPy); a strided-cvtt inner loop is the follow-up. Correctness: all three cast paths bit-exact with NumPy across the full 3276-case matrix (cast_spec) plus multi-element SIMD-body verification (cast_multi_correctness: float->all 9 int dsts x contig/strided/reversed, signed->u64). Full suite: 0 failures, 9405 passed.

…non-contiguous gap) The contig float->int32 cvtt helper (26b710b) left strided and negative-stride f64->i32 on the scalar strided path (1.4-2.0x NumPy). Add a NumPy-faithful cvtt strided kernel that vectorizes the innermost axis for the three common dst-contiguous run shapes: - unit-stride source -> VCVTTPD2DQ forward - reversed source ([::-1])-> load forward chunk, cvtt, reverse lanes (Vector128.Shuffle) - positive-strided source -> AVX2 gather (a[:,::2] etc.), then cvtt Outer dims walk via an incremental-offset odometer (element strides). Anything else (non-contiguous dst, exotic strides, no AVX2) falls back to the NumPy-faithful scalar Converts path. Registered in TryGetStridedCastKernel for double->int32. Perf (min of 3, 2048x2048, new vs NumPy 2.x): f64->i32 strided large 3.41ms -> 1.98ms (0.83x NumPy - beats) f64->i32 negstride large 6.48ms -> 2.99ms (0.85x) f64->i32 negstride med 0.385 -> 0.154 (0.80x) With this, the astype variation matrix has ZERO real-array (>=256K-element) failures: every meaningful-size cast cell across f64<->i32 / f32<->f64 / i32<->i32 x contig/strided/transpose/fortran/broadcast/negstride is now within 10% of NumPy or faster. The remaining 21 matrix fails are all trivial-size (<=4K elements, sub-microsecond), bound by NDArray allocation overhead rather than the cast kernel. Correctness: cast_spec 0/3276 mismatches on all three paths; multi-element SIMD-body test (double->int32 strided/reversed/contig) 0 failures. Full suite: 0 failures, 9405 passed.

…ex->bool dropping imaginary part Builds the differential-fuzzer harness (Plan A items #1+#2 foundation): an offline, committed NumPy 2.4.2 oracle corpus replayed bit-exact by MSTest, so every NpyIter-backed operation can be proven identical to NumPy across the full input space -- caught systematically, not by hand-picked cases. The motivating failure class (the cast saturate-vs-wrap bug, latent in where/copyto/concatenate) must be impossible to ship again. Harness - oracle/gen_oracle.py + oracle/layout_catalog.py: Python generator emitting JSONL cases {op, params, operands[], expected}. Each operand is serialized bytes-exact as (dtype, shape, element-strides, element-offset, bufferSize, base-buffer-hex). 26 single-array layouts: C/F-contiguous (1/2/3-D), transposed, strided, negative-stride, offset slices, broadcast (stride-0), scalar-broadcast, 0-D views, newaxis, singleton dim, empty, empty-composed, 5-D. Edge values are FRONT-LOADED (NaN / +/-inf / overflow / -0.0) so even an 8-element operand exercises the float->int cvtt sentinel paths. describe() self-validates that every operand is a genuine VIEW into its base buffer, so a layout that accidentally yields a copy/scalar fails loudly at generation. - test/NumSharp.UnitTest/Fuzz/FuzzCorpus.cs: reconstructs the EXACT operand views from raw bytes via contiguous-storage + Alias (handles broadcast where buffer < size, negative strides, offset slices, 0-D, empty) -- no Python at test time. - BitDiff.cs: element-wise bit-exact compare with NaN tokenized (payload non-contractual); -0.0 and +/-inf are bit-compared since NumPy preserves signed zero / canonical infinity. - OpRegistry.cs: op-name -> NumSharp call (astype today; binary/comparison/unary/reduction /where tiers extend the switch). - FuzzCorpusTests.cs: one data-driven [TestCategory("FuzzMatrix")] test per corpus file; a failure lists every divergent cell (not first-failure-wins). - HarnessSelfTests.cs: proves the harness has teeth -- BitDiff detects value/NaN-vs-number /signed-zero divergence and treats differing NaN payloads as equal; and the corpus pipeline genuinely exercises the float->int wrap-not-saturate semantics on real edges. - Corpus committed under Fuzz/corpus/ and copied to test output: astype_smoke (312 cases) + astype_full (4394 = 13 x 13 NumPy-representable dtypes x 26 layouts). Bug found and fixed (the matrix's first real catch) - complex->bool through the NpyIter scalar-cast fallback (NpyIterCasting.ConvertValue) dropped the imaginary part: it took c.Real and applied real != 0, so complex(0, nonzero) and complex(-0.0, nonzero) returned False. NumPy defines bool(z) == (z != 0): truthy when EITHER part is non-zero. Latent wherever complex->bool casts through the iterator (astype, np.where, np.copyto, np.concatenate). Fixed to honor both parts, matching the already-correct Converts.ToBoolean(Complex). All 15 divergent cells across every layout now match NumPy. Verification - astype_full: 0 / 4394 divergences from NumPy. - Full CI suite (net10.0, TestCategory!=OpenBugs&!=HighMemory): 0 failed / 9412 passed. Scope note: Char and Decimal have no NumPy analog, so they are excluded from the differential corpus; both remain covered by the existing Converts-oracle cast tests.

…classifier Extends the differential fuzzer to binary arithmetic (add/subtract/multiply/divide) across 20 NEP50 dtype pairs x 9 pairwise operand layouts, bit-comparing value AND result dtype AND broadcast shape against NumPy 2.4.2. 720 cases. Pairwise layouts (layout_catalog.PAIR_LAYOUTS): pp_contig_contig (SimdFull), pp_contig_fortran, pp_contig_strided (SimdChunk), pp_strided_strided (General), pp_scalar_right / pp_scalar_left (SimdScalar*), pp_broadcast_row / pp_broadcast_col, pp_negstride_both. Operands are emitted at natural shapes; the op performs broadcasting. The matrix now asserts three things per case: result dtype == NumPy (the headline NEP50 promotion failure), result shape == NumPy (broadcasting), and bit-exact bytes. Misaligned classifier (never-silent intended divergences) - test/.../Fuzz/MisalignedRegistry.cs: the explicit, documented set of intended NumSharp-vs-NumPy differences. The matrix excuses ONLY these patterns (logging a per-reason count) and fails on anything else. Cases stay in the corpus so a future fix simply starts passing, and any drift past tolerance turns back into a hard failure. - BitDiff.WithinUlp: ULP-distance helper used ONLY to classify documented near-misses, never to relax the default bit-exact gate. Two intended divergences documented (maintainer decision: keep as Misaligned) 1. NEP50 weak-scalar (41 cases): NumSharp treats a 0-D array operand as a weak scalar (the other operand's dtype drives promotion), e.g. array(int32)+zerod(int64) -> int32 where NumPy gives int64; array(uint8)+zerod(int8) -> uint8 vs NumPy int16. NEP50 makes 0-D arrays full promotion participants; only Python scalar literals are weak. NumSharp cannot distinguish the two (both are 0-D NDArrays), and `arr + 5` ergonomics were chosen over strict parity. Array+array promotion is correct across all layouts. 2. Complex true-division ~1 ULP (11 cases): complex / {float,complex,int} differs from NumPy's npy_cdivide by one ULP (System.Numerics.Complex uses different scaling). Complex add/sub/mul are bit-exact; only division diverges. Verification: Binary_Arith excuses exactly 52 divergences (11 complex-divide + 41 weak-scalar) and bit-matches the other 668 cases; full FuzzMatrix category 0 failed / 8 passed. No NumSharp.Core changes in this commit (harness + corpus + generator only).

…Bugs] Extends the binary differential generator to floor_divide / mod / power and runs them as a known-failing [OpenBugs] reproduction (excluded from CI; remove the tag when fixed). The bit-exact add/sub/mul/divide matrix (Binary_Arith) stays green; these three ops are split into a separate binary_divmod_power.jsonl corpus + Binary_DivModPower_KnownDivergences so the CI gate is not broken while the divergences are tracked. 206/430 cases diverge from NumPy 2.4.2 (27 NEP50 weak-scalar cases auto-excused). The real bugs, confirmed directly against NumPy: integer ÷0 / mod 0 NumPy returns 0 (with RuntimeWarning). NumSharp: - int32 //0 -> 2147483647 / -2147483648 (saturated sentinel), not 0 - int32 %0 -> the dividend (1,-7,5), not 0 - uint8 //0 and %0 -> throw DivideByZeroException float //0 NumPy: 1.0//0 -> +inf, -1.0//0 -> -inf, 0.0//0 -> nan. NumSharp: NaN for all (loses the ±inf sign result). mixed-precision mod mod(float32, float64) computes in float32 then widens (low mantissa zeroed) instead of promoting to float64 first. add/sub/mul/div promote correctly; only mod loses precision. complex power complex ** {float,complex,int} differs by ~1 ULP (like complex divide) plus some gross edge divergences (NumPy NaN vs NumSharp 0). Correct already: mod sign convention is floored (mod(-7,3)==2, matches NumPy), and float %0 is nan on both sides. CI suite (net10.0, !OpenBugs&!HighMemory): 0 failed / 9413 passed. FuzzMatrix: 8 passed.

Adds the comparison tier (equal/not_equal/less/greater/less_equal/greater_equal -> bool) across the 20 NEP50 dtype pairs x 9 pairwise layouts. 1080 cases. Found: NumSharp's <= and >= return True when an operand is NaN; IEEE/NumPy return False (only != is True against NaN). <, >, ==, != all handle NaN correctly. 122 cases, every one the NumPy-false -> NumSharp-true signature on the NaN element. Documented in MisalignedRegistry (tagged [known bug]) with a tight signature so the matrix stays green; tracked for fix. Comparison matrix excuses exactly 122 NaN-ordering divergences and bit-matches the other 958. FuzzMatrix category green.

…classes Adds the unary tier (negative/abs/sign/sqrt/cbrt/square/reciprocal/floor/ceil/trunc/ sin/cos/tan/exp/log) across 7 dtypes x 26 single-array layouts. 2574 cases. Adds Half-ULP support to BitDiff and routes THREW through the Misaligned classifier (new DivergenceKind.Threw). Also fixes a generator artifact: np.ascontiguousarray forces ndim>=1, which corrupted the expected shape of 0-D results to [1]. gen_unary/gen_binary now read r.shape BEFORE ascontiguousarray (used only for the bytes), so scalar results compare as [] (matching NumSharp). Documented divergence classes (excused, never silent; remove from MisalignedRegistry as fixed): 494 unary NEP50 promotion NumSharp promotes int->float64 for unary float ops regardless of width; NumPy uses width-based float (bool/int8->float16, int16->float32, int32+->float64). square/reciprocal/floor dtype also differ. 90 complex unary square catastrophic cancellation (re^2-im^2 -> 0); sin/cos/tan/log inf/NaN edge handling differs (System.Numerics vs npy_c*). 72 unary ~ULP transcendental / magnitude (abs complex) within 2 ULP. 61 reciprocal integer reciprocal returns 0 vs NumPy ÷0 sentinel; non-contiguous operand throws "Can't return a memory address". 24 negative on unsigned np.negative(uint8) throws NotSupportedException; NumPy wraps modulo. Unary matrix excuses these and bit-matches the rest. FuzzMatrix category: 10 passed. Tracked for fixes in task #9.

…ergences Adds the reduction tier (sum/prod/min/max/mean/std/var/argmax/argmin/all/any) across axis in {None, 0, ndim-1} x keepdims {false,true} x 7 dtypes x 10 layouts. 3640 cases. OpRegistry gains a reduction dispatch handling the heterogeneous NumSharp signatures (mean/std/var split None-vs-int overloads; argmax/argmin require an int axis; prod takes a Type arg). Reductions are mostly solid — only 80/3640 diverge. all/any/argmax/argmin/prod are bit-exact. Documented divergence classes (excused in MisalignedRegistry; remove as fixed, task #10): reduction NaN propagation min/max/mean/std/var return the non-NaN result; NumPy keeps NaN (regular reductions propagate; only nan* variants skip). complex axis reduction mean/std/var/sum of complex along an axis throws "Can't construct NDCoordinatesAxisIncrementor with a vector shape". bool min/max along axis returns True where NumPy returns False. summation precision sum/mean/std/var floating accumulation differs from NumPy pairwise / two-pass (algorithm order). reduction result dtype NEP50 accumulator width / complex->real (std/var) differences. FuzzMatrix category: 11 passed.

Adds the selection tier: np.where(cond,x,y) across 8 dtype pairs x 5 triple-operand layouts (contig / broadcast-x-y / strided / scalar-cond / broadcast-cond) and np.place(arr,mask,vals) across 3 contiguous layouts x 5 dtypes. where=40 cases, place=15 cases. np.place is bit-exact with NumPy on all 15 cases. np.where bit-matches 33/40; documented: 3 NEP50 weak-scalar where(cond, array, 0-D scalar) promotes weakly (generalized the weak-scalar rule to any multi-operand op with a 0-D operand). 4 complex np.where throws NotSupportedException "Zero-push unsupported for Complex". With this, all six A1 op tiers (T1 cast, T2 binary, T3 comparison, T4 unary, T5 reduce, T6 where/place) are implemented and green. FuzzMatrix category: 13 passed. Every divergence is either bit-exact, documented in MisalignedRegistry, or a tracked [OpenBugs] reproduction.

Adds oracle/fuzz_random.py: a seeded, reproducible random generator that draws random transposed/strided/reversed views (random ndim<=4, dims, stride perms/signs), random dtypes, and random ops across all tiers, computes the NumPy 2.4.2 result, and writes a committed corpus the C# harness replays. FuzzRandom replays a 2000-case seed (1234); FuzzRegression replays shrunk soak failures from corpus/regressions/ (empty until the soak finds one). Shrinker.cs minimizes a failing element-wise case to a single 0-D element using the corpus's own expected byte at the diverging index (no NumPy needed); RunCorpus emits the minimal repro in the failure message so a soak find is immediately committable. A self-test proves the shrunk case reproduces the divergence (bool add True+True). The random fuzzer is scoped to NumSharp-PRODUCIBLE layouts: NumSharp's slicing normalizes offset into the storage base (Shape.offset stays 0) and keeps consistent size-1 strides, so reconstructing NumPy's RAW offset!=0 / junk-size-1-stride representation exposes ops that assume the normalized form (real but unreachable via the API; tracked as task #11). With that scoping the random gate is green; it surfaced three NEW divergence classes the fixed matrices missed (tasks #11, #12): bool arithmetic True+True -> NumSharp 2 vs NumPy bool True. size-1 collapse NumSharp collapses an all-size-1 result to 0-D; NumPy keeps [1]. complex binary multiply (and divide on some data) catastrophic cancellation + ~1 ULP. Shape divergences now route through the classifier too. FuzzMatrix category: 16 passed.

…EADME The FuzzMatrix differential gate (cast/binary/comparison/unary/reduce/where/place matrices + seeded FuzzRandom + FuzzRegression) already runs on every push/PR: the tests are [TestCategory("FuzzMatrix")], not excluded by the CI filter, and replay the committed offline corpora with no Python. Documented that in build-and-release.yml's filter comment. Adds .github/workflows/fuzz-soak.yml: a nightly (04:00 UTC) + manual job that installs NumPy 2.4.2, sweeps seeds through oracle/fuzz_random.py (~200k cases/seed, ~1M/night), and replays each through the harness. On a divergence the shrunk minimal repro is printed and the failing corpus uploaded as an artifact, to be pinned under corpus/regressions/ for FuzzRegression. Adds test/NumSharp.UnitTest/Fuzz/README.md: how the offline oracle/replay model works, corpus regeneration commands, the CI cadence, and the full documented-divergence ledger with task refs. Completes Plan A (A0 harness, A1 T1-T6 op matrices, A2 random fuzzer + shrinker, A3 CI). Full suite (net10.0, !OpenBugs&!HighMemory): 0 failed / 9421 passed; FuzzMatrix: 16 passed.

Adds docs/FUZZ_FINDINGS.md: every NumSharp-vs-NumPy-2.4.2 divergence the differential fuzzer surfaced, each bit-exact verified, with minimal NumPy-vs-NumSharp reproductions, root cause where known, and disposition (FIXED / BUG / INTENDED / SCOPING) mapped to tasks #7-#12. 22 findings: 1 fixed (complex->bool), 18 confirmed bugs (integer ÷0/mod0, float //0, mixed-precision mod, complex power; NaN <=/>=; NEP50 unary promotion, negative-unsigned, reciprocal, complex unary; reduction NaN propagation, complex-axis throw, bool min/max, summation precision, result dtype; complex where throw; bool arithmetic, size-1 collapse, complex binary cancellation), 2 intended Misaligned (NEP50 weak-scalar, complex-divide ULP), 1 scoping note (ops vs raw offset!=0 / junk size-1 strides).

…r behaviors) Adds docs/FUZZ_PLAN_NEXT.md: an actionable plan for the two open coverage rows. Part 1 — #2 sections C (per-operand flags) + E (composite paths): black-box, extends the existing differential harness with out= / where= params and operand-relationship layouts (aliased, overlapping, write-masked, out-strided/broadcast/cross-dtype). C5 + error parity fold in test-kind #4. Prerequisite: OpRegistry/corpus support for a pre-seeded out operand and a mask. Part 2 — #3 NpyIter behaviors (= section D, white-box): port NumPy's test_nditer.py (~106 fns). B0 coverage ledger -> B1 expose iterator state (ChosenOrder/CoalescedNDim/IsBuffered/InnerStrides) -> B2 order+coalescing (highest value, guards the KEEPORDER stride-sort) -> B3 buffered casting -> B4 op_axes/remove_axis/ranged (triage feature gaps) -> B5 nditer-config error parity. Acceptance: every test_iter_* green / [Misaligned] / tracked [OpenBugs] - nothing untracked. Plus an adjacent-gaps appendix (op breadth incl matmul/manipulation/bitwise/nan-aware, order=/dtype=/ ddof params, SIMD-tail shapes, leak/perf/metamorphic) and an effort-shape table.

…ion north star) Adds docs/ROADMAP.md consolidating the entire path from the current baseline to the /np-function goal (every np.* bit-identical to NumPy 2.4.2 OR >=1.5x faster), correctness-first with the differential fuzzer as the regression net for the perf refactor. Five phases: 0 Baseline (Plan A done) — harness, T1-T6 matrices, random fuzzer, CI gate. 1 Correctness backlog — fix the 22 findings, grouped by shared root (F1 div-by-zero, F2 NaN, F3 NEP50 promotion, F4 unsupported throws, F5 complex algorithms, F6 bool, F7 size-1 shape, F8 summation precision, F9 representation). Each fix flips its classifier branch and re-arms the gate; F5/F8/F9 are implement-vs-document judgment calls. 2 Coverage breadth — 2A finish #2 sections C+E (out=/where=/aliasing/overlap/mask), 2B op tiers T7-T15 (manipulation, matmul/dot, bitwise, nan-aware, cumulative, stat, logic, sorting, multi-output) = the ~75 untested transformation ops. 3 NpyIter behavioral parity (#3 / Plan B) — port test_nditer.py; de-risks Phase 5. 4 Depth — params (order=/dtype=/ddof/axis), SIMD-tail & large shapes, error parity (#4), unmanaged lifecycle (#5), metamorphic invariants (#7). 5 Performance — the >=1.5x-NumPy mission (#6): the DirectILKernelGenerator -> ILKernelGenerator (NpyIter-driven) migration in CLAUDE.md priority order, benchmark ledger, perf CI gates; the differential matrices keep the kernel rewrite from breaking parity. Includes a dependency graph, value-weighted recommended order, and an effort-shape table. Detail for phases 2A/3 lives in FUZZ_PLAN_NEXT.md; findings in FUZZ_FINDINGS.md.

… (Phase 1 F1) Ports NumPy's division/modulo C kernels into a new NpyDivision helper and routes both IL emission paths through it, fixing the floor_divide/mod parity defects the differential fuzzer surfaced (FUZZ_FINDINGS #2, #3, #4 — all now bit-exact). Root cause ---------- The old EmitFloorDivideOperation/EmitModOperation emitted inline IL that: * threw DivideByZeroException for unsigned integer ÷0 (and returned a saturated sentinel / the raw dividend for signed int ÷0 and mod-0); * force-converted float //0 to NaN via EmitFloorWithInfToNaN, losing NumPy's ±inf (sign-of-dividend) result and mis-handling inf//x, x//inf, overflow; * computed signed-integer floor division by round-tripping through double, losing precision for large int64 and skipping NumPy's snap-to-nearest. Fix --- New src/NumSharp.Core/Utilities/NpyDivision.cs replicates NumPy exactly: * Integer floor_div_@TYPE@ (loops_arithmetic.dispatch.c.src): ÷0 -> 0, signed floor toward -inf, MIN/-1 wraps to MIN (overflow), sub-int types compute in the int domain so the #DE trap can't fire. * Integer remainder (loops_modulo.dispatch.c.src): ÷0 -> 0, floored (Python) sign convention, MIN % -1 == 0. * Float npy_floor_divide / npy_remainder (CPython divmod port in npy_math_internal.h.src): fmod -> sign-fixup -> snap-to-nearest; b==0 returns a/b (±inf / nan), never a forced NaN. Computed in float32 for Single to stay bit-exact. Both the same-type generic path (DirectILKernelGenerator.Binary.cs EmitFloorDivideOperation<T>) and the resultType path (DirectILKernelGenerator.cs EmitFloorDivideOperation/EmitModOperation) now emit a single Call to the matching helper, mirroring the existing NpyIntegerPower pattern. Decimal/Half/Complex are unaffected (handled by their own emit paths before this point). Verification ------------ * 26 hand-picked edge cases (0.7//0.1==6.0, -2.0//inf==-1.0, inf//2.0==nan, 1e308//1e-300==inf, MIN//-1, MIN%-1, signed/unsigned ÷0) all bit-exact. * mod(f32,f64) now promotes to float64 and matches NumPy bit-exact (#4). * binary_divmod_power.jsonl promoted from [OpenBugs] to [FuzzMatrix]: floor_divide and mod are now CI-gated bit-exact; the only remaining excused divergence is complex power (~ULP/edge), documented for Phase 1 F5. * net10.0 FuzzMatrix gate: 16/16 green. Clears FUZZ_FINDINGS #2, #3, #4 (the reciprocal-int part of F1 tracks with the unary batch). Complex power (#5) reclassified BUG -> F5 (algorithmic, deferred).

…se 1 F2) The scalar comparison kernel emitted 'a <= b' as '!(a > b)' using the ordered Cgt opcode (Clt for '>='). For a NaN operand the ordered compare yields false, so the negation produced **true** — but IEEE/NumPy make every ordered comparison with NaN false (only '!=' is true). FUZZ_FINDINGS #6, surfaced by the comparison matrix (122 cases, all on the NaN element). Fix: for float operands use the UNORDERED compare (Cgt_Un / Clt_Un). With NaN these yield true, so the negation is false — correct. The unordered opcodes were already used for unsigned-integer comparison, so the change is just isUnsigned -> (isUnsigned || isFloat); signed integers keep the ordered Cgt/Clt (no NaN domain). '<' and '>' were already correct (Clt/Cgt already yield false for NaN), as were '==' / '!='. The SIMD path (Vector.LessThanOrEqual/GreaterThanOrEqual) was already IEEE-correct, so only EmitComparisonOperation (scalar/strided/tail) needed the change. Verified bit-exact on scalar (1-D), SIMD (NaN at lanes 0 and 17 of a 32-wide array), strided (every-2nd view), and float32 paths. The MisalignedRegistry NaN-ordering branch is removed so the comparison matrix now verifies this bit-exact; net10.0 FuzzMatrix gate 17/17 green. Clears FUZZ_FINDINGS #6.

…ase 1 F4) np.negative(NDArray) called the legacy Regen-templated NDArray.negative(), a per-dtype switch that threw NotSupportedException for every unsigned dtype (Byte, UInt16, UInt32, UInt64, Char) and read @out.Address (so non-contiguous operands also threw 'Can't return a memory address when sliced/broadcasted'). NumPy negates unsigned integers by two's-complement wrap: np.negative(uint8(1)) == 255. FUZZ_FINDINGS #8. Fix: route np.negative through nd.TensorEngine.Negate(nd) — the same engine path the unary '-' operator (NDArray.Primitive.cs) and nd.negate() already use. Its IL kernel emits the two's-complement (~v + 1) for unsigned types and drives strided / non-contiguous operands through NpyIter. The legacy hand-written nd.negative() is left untouched (still the public method) but is no longer on the np.negative path. Verified bit-exact vs NumPy on contiguous and strided uint8/uint16/uint32 (e.g. [0,1,2,255] -> [0,255,254,1]) and unchanged for signed/float. The MisalignedRegistry 'negative on unsigned throws' branch is removed so the unary matrix verifies it bit-exact. Clears FUZZ_FINDINGS #8 (reciprocal non-contiguous, the sibling Address bug in #9, tracked separately).

…cs (Phase 1 F3a) NumSharp promoted every integer input to float64 for the float-producing unary ufuncs (sqrt/cbrt/exp/log/trig/...), regardless of input width. NumPy NEP50 uses WIDTH-BASED promotion: bool/int8/uint8 -> float16, int16/uint16 -> float32, int32/uint32/int64/uint64 -> float64 (float/complex preserved). FUZZ_FINDINGS #7. Fix: new DefaultEngine.ResolveUnaryFloatReturnType implements the width-based rule; the 20 transcendental Default.<Op>.cs files (ACos/ASin/ATan/Cbrt/Cos/Cosh/Deg2Rad/ Exp/Exp2/Expm1/Log/Log10/Log1p/Log2/Rad2Deg/Sin/Sinh/Sqrt/Tan/Tanh) now call it instead of ResolveUnaryReturnType (which widened to float64 via GetComputingType). The int->Half / int->Single unary kernels already existed, so this is a pure result-dtype routing change; int32/int64 are unaffected (already float64). The dtype-PRESERVING ufuncs (square/floor/ceil/trunc/round/reciprocal) are intentionally left on the old resolver pending F3b (they need integer identity / x*x / int-reciprocal kernels to preserve the integer dtype). Verification: * 364 of 494 unary dtype divergences clear bit-exact; the remaining 130 are the F3b-pending preserve-dtype ops, now scoped in MisalignedRegistry so a transcendental promotion regression fails the gate (not silently excused). * Half/Single transcendental values stay within 2 ULP of NumPy's float16/float32 libm (documented algorithm difference, same class as complex-divide ULP). * DtypeCoverageTests.Sqrt_IntegerDtypes updated: it asserted the old uniform float64 (codified the bug); now asserts the NumPy width-based dtype per input. * Full net10.0 suite green: 9422 passed / 0 failed; FuzzMatrix 17/17. Clears the transcendental half of FUZZ_FINDINGS #7 (and #15 dtype for those ops).

…Phase 1 F2-reductions) NumPy's (non-nan*) min/max PROPAGATE a NaN operand; NumSharp's SIMD reduction used hardware Vector.Min/Max (MINPS/MAXPS), which drop NaN and return the other lane, so np.min([nan,-inf,1]) gave -inf instead of nan. sum/mean/std/var already propagated (NaN arithmetic). FUZZ_FINDINGS #11. Fix: emit/compute the NaN-propagating form for float/double Min/Max — ConditionalSelect(Equals(acc,acc) & Equals(v,v), MinMax(acc,v), acc + v) — hardware min/max where both lanes are ordered (non-NaN), and acc+v (= NaN) where either is NaN (the self-equality mask is the standard SIMD NaN test). Applied at every combine site of the FLAT reduction: * EmitVectorBinaryReductionOp — unrolled body, pairwise tree merge, remainder (DirectILKernelGenerator.Reduction.cs); * EmitVectorReductionOp — the GetLower/GetUpper horizontal tree-reduce fold; * CombineVectors256/128 — the C# axis-SIMD/gather helpers (DirectILKernelGenerator.Reduction.Axis.Simd.cs); the typeof(T) guard is a JIT-time constant, so integer Min/Max keep the single-instruction path. The scalar tail / lane combine already used Math.Min/Max (which propagate). Verified: flat np.min/np.max now propagate NaN for every array size 3..257 and every NaN position, double and float32; integer min/max unchanged; full net10.0 suite green (9422 passed / 0 failed). Scope: the axis (vertical/strided) SIMD min/max in CreateAxisReductionKernelTyped has a further combine site that still drops NaN (e.g. np.min(a2d, axis=0)); the classifier now excuses ONLY axis!=null NaN-propagation, so the flat path is gate-verified and a flat regression fails. Partially clears FUZZ_FINDINGS #11 (flat); axis tracked under F2-reductions.

…Phase 1 F6 + F4) Two independent parity defects the differential fuzzer surfaced, both now bit-exact and CI-gated (their MisalignedRegistry branches removed). F6 — bool arithmetic (FUZZ_FINDINGS #17). NumPy's bool dtype has no integer add/ multiply loop: '+' is logical OR, '*' is logical AND, so True + True == True (raw byte 1). NumSharp did byte arithmetic and stored 2 in a bool slot. Fix: in ExecuteBinaryOp, when both operands are bool (resultType == Boolean) remap Add -> BitwiseOr and Multiply -> BitwiseAnd before kernel dispatch, so every SIMD/scalar and same-type/mixed path writes a normalized 0/1 byte. '-' has no bool loop and keeps throwing like NumPy. Verified scalar + 32-wide SIMD. F4 — complex np.where (FUZZ_FINDINGS #16). where(cond, x, y) threw 'NotSupportedException: Zero-push unsupported for Complex' whenever the promoted result was complex: NpyExpr.EmitPushZeroPublic (WhereNode pushes a typed zero for the unselected branch) had no Complex case. Both-complex where already worked; only the MIXED promotion where(cond, complex, real) hit it. Fix: add Complex (Complex.Zero) and Half cases. Now where([T,F,T], complex, float) == [(1+2j),(8+0j),(5+6j)] bit-exact. Harness: Shrinker_MinimalCaseReproducesDivergence used bool True+True as its fixture (now fixed, so it no longer diverges); retargeted to a synthetic int32-add case with a planted-wrong expected buffer, so it exercises the shrinker mechanism without depending on a live bug. Verified: net10.0 FuzzMatrix 17/17; full suite 9422 passed / 0 failed. Clears FUZZ_FINDINGS #16 and #17.

…ank operand (Phase 1 F7) Shape.Broadcast's size-1 fast path treated a 1-D [1] (NDim==1, size==1) identically to a 0-D scalar and broadcast it to the OTHER operand's dimensions. When that other operand had fewer dimensions, [1] adopted the lower rank — so [1] + 0-D scalar gave shape [] instead of [1] (a dropped dimension). The path was asymmetric: scalar + [1] was already correct. FUZZ_FINDINGS #18 (found by the random fuzzer). Fix: guard each size-1 collapse on the other operand having at least equal rank — left collapses only when rightShape.NDim >= leftShape.NDim, and symmetrically for the right — so the result keeps ndim == max(ndims), matching NumPy. The [1] -> higher-rank broadcasts (e.g. [1] + [2,3] -> [2,3]) are unchanged (guard is satisfied there). Verified: [1]+scalar -> [1], scalar+[1] -> [1], [1]+[1] -> [1], [2]+scalar -> [2], [1]+[2,3] -> [2,3], [3]+[1] -> [3], [2,1]+[3] -> [2,3], [[1]]+scalar -> [1,1], scalar+scalar -> []. Shape.Broadcast is core (used by every broadcasting op); full net10.0 suite green (9422 passed / 0 failed), FuzzMatrix 17/17. Clears FUZZ_FINDINGS #18.

…al-noncontig (Phase 1) Two findings turned out PARTIALLY resolved in committed code; tightened the classifier to verify the resolved sub-cases bit-exact while excusing only the precise residuals. FUZZ_FINDINGS #12 (complex axis reduction): reducing a complex 2-D+ array along an axis (sum/mean/prod/std/var/min/max) now works bit-exact (std(complex2d, axis=0) -> real float64, verified without the uncommitted NpyIter.cs). Only a 1-D complex array reduced along its only axis still throws 'NDCoordinatesAxisIncrementor with a vector shape'. The classifier now excuses just that 1-D Threw case; the 2-D matrix is gate-verified. FUZZ_FINDINGS #9 (reciprocal non-contiguous): reciprocal on a non-contiguous FLOAT operand (transposed/strided) now works. reciprocal(int) still returns the float64 reciprocal (vs NumPy's integer ÷0 sentinel) and still throws on a non-contiguous integer operand (the int->float path needs a flat Address) — both pending F3b. The classifier scoped to integer input (value diff or non-contig throw); float non-contig is gate-verified. No production code change — corpus replay confirmed the resolutions are in committed code. net10.0 FuzzMatrix 17/17 green. Documents the exact resolved-vs-pending split for #9 and #12.

…l matrix (Phase 2B) Built the T8 linear-algebra differential corpus (matmul/dot/outer, 408 cases across the gufunc shape space x 6 dtypes x C/F layouts). It immediately surfaced three real bugs — all now fixed and bit-exact CI-gated: 1. np.matmul only handled 2-D x 2-D. Every other NumPy-supported shape was broken: matmul(1d,1d) -> Debug.Assert crash; matmul(2d,1d) -> [n,1] (extra dim); matmul(1d,2d) -> throws 'core dimension mismatch'; batched (b,n,k)@(b,k,m) -> throws 'cannot be broadcast'. Rewrote Matmul to the NumPy gufunc (n?,k),(k,m?)->(n?,m?): promote 1-D operands (lhs->[1,k], rhs->[k,1]), run a batched matmul that broadcasts ONLY the stack dims (the old path broadcast the full shapes incl. the matrix dims, hence the throw) and does a 2-D MultiplyMatrix per batch element, then squeeze the inserted dims. Verified bit-exact vs NumPy for 2-D, 1-D@1-D, 2-D@1-D, 1-D@2-D, 3-D/4-D batched, broadcast batch. 2. np.squeeze(arr, axis) over-collapsed: after removing the named axis it also collapsed a remaining length-1 dim to 0-D (squeeze([1,1],0) -> [] instead of [1]), diverging from NumPy and breaking matmul's 1-D-promotion squeeze. Fixed squeeze_fast to drop ONLY the named axis (collapse to scalar only when no axis remains). 3. np.dot(N-D, 1-D) returned the wrong dtype: it did np.sum(a*b, axis=1), which used the reduction accumulator (int32 -> int64, uint8 -> uint64) and a hard-coded axis=1. NumPy's dot preserves the input dtype and contracts the last axis. Routed the N-D-dot-1-D case through Matmul (same contraction, correct dtype + axis). Harness: OpRegistry gains matmul/dot/outer; gen_oracle gains a mode (gen_matmul with C/F operand layouts via the C-contig-base + transposed-view pattern); new matmul.jsonl corpus + [FuzzMatrix] Matmul test. Verified: matmul.jsonl 408/408 bit-exact; net10.0 FuzzMatrix 18/18; full suite 9423/0.

Continues SubwordCopy into the cross-WIDTH family: {i16,u16,char} -> {i8,u8} (low-byte truncate) and {i16,u16,char} -> bool (!=0) for strided views. 2-byte sources are not VPGATHER-eligible, so TryGetIntToNarrowStridedKernel (4/8-byte gather) declines them and the generic NarrowInt IL kernel fell to a scalar inner loop on a non-unit inner stride. All 36 domain cells (3 src x 3 dst x 4 strided layouts) now win, clean best-of-7: i16|strided|i8 0.76 -> 1.50 char|strided|i8 0.79 -> 1.73 i16|strided|bool 0.65 -> 2.10 char|strided|bool 0.68 -> 2.39 i16|negcol|i8 0.89 -> 2.73 i16|negcol|bool ~ -> 4.63 sliced/negrow (ss==1 path) 1.19-5.x — NO regression vs the prior winners. Kernel (DirectILKernelGenerator.Cast.SubwordNarrow.cs), two-stage SIMD reusing the SubwordCopy shuffles, inner loop inlined in the odometer (per the SubwordCopy PERF NOTE — a per-row helper costs ~6x): * ss==2 deinterleave even 2B elems (VPACKUSDW+VPERMQ), then narrow; * ss==-1 reverse 2B elems (VPSHUFB+VPERMQ), then narrow; * ss==1 contiguous truncating Vector256.Narrow (32/iter — the sliced/negrow path the generic kernel already won, kept fast so no regression); * narrow step: int = mask low byte + VPACKUSWB (== unchecked (byte) == NumPy wrap); bool = AndNot(CompareEqual(v,0), 1) then pack. f16 source is excluded (float->int is a value truncate, kept on the Half kernels); only C-NON-contiguous layouts reach here (C-contig uses TryGetCastKernel, untouched). Routed in TryGetStridedCastKernel after SubwordCopy, ahead of the int-narrow / to-bool resolvers. Validation * BIT-EXACT vs NumPy 2.4.2: 208/208 hashes (26 byte-copy + narrow pairs x 8 layouts), base arange%65521 (incl. 0 for the bool zero-branch + diverse high bytes for the truncation) — subword_cross_npref.py / subword_cross_verify.cs. * Standalone math proven 0-diff + benchmarked (subword_narrow_poc.cs). * Full suite green: 9956 passed / 0 failed (net10.0, excl OpenBugs/HighMemory). Note: scoreboard refresh deferred — this session's repeated sweeps ran in a degraded environment (accumulated dotnet build-server/run processes inflating the best-of-3 NS times ~7pp globally; the clean baseline committed at 9aea3ac is 93.4%). Per-cell wins above are best-of-7 (min over rounds, robust to load); the full-matrix refresh should be re-run on a clean box.

…+ plan docs) Reflects the Wave 16/16b/16c kernels in the project docs: - .claude/CLAUDE.md: DirectILKernelGenerator partial count 60 -> 62; Cast & Copy list adds .Cast.SubwordCopy.cs + .Cast.SubwordNarrow.cs; the cast coverage row now documents the sub-word strided/negcol SIMD lane shuffles (VPACKUS deinterleave / VPSHUFB reverse for same-type & same-size-bit copies; +low-byte/ !=0 narrow for 2B-int->{1B,bool}) and the inline-in-odometer perf rule (a per-row helper costs ~6x). The old "2-byte/1-byte strided stay scalar" caveat is removed. - CAST_BEAT_NUMPY_PLAN.md: new §12 (Wave 16 table + kernels + the inline gotcha + 208/208 bit-exact + the best-of-3/throttle measurement caveat + the noise finding); §11 residual list updated (sub-word strided no longer hardware-limited). - CAST_REMAINDER_133_PLAN.md: §0 banner marking §2 CLOSED by Wave 16 and confirming the §6 noise hypothesis (true >=0.9 ~96%, not the sweep headline); remaining real laggards narrowed to §3 (alloc-bound), §4 (AVX512 i64/u64->f16), §5 (memory-bound), + a couple genuine large-type cells. Scoreboard refresh deferred to a clean environment (this session's repeated sweeps ran degraded by accumulated dotnet processes; per-cell wins are best-of-7, robust).

…6 shipped) Full cast-matrix sweep on a settled environment (no stray dotnet processes): 1438/1568 (91.7%) >=0.9, 1374 (87.6%) win >=1.0, geomean 1.705. The net count vs the prior 93.4% snapshot is within the best-of-3 noise band (~140 narrowly-winning cells jitter ±0.9 run-to-run); the Wave 16/16b/16c cells now win in this run. Confirmed via clean best-of-7 (robust to load) that the sweep's in-domain LAGs are noise: char|strided|i16 1.66, char|strided|u16 1.84, char|strided|u8 1.82 (sweep read 0.77-0.88). The remaining sub-word strided LAGs are a DIFFERENT, unaddressed family: 1B->2B WIDENING (i8/u8/bool -> i16/u16/char) at 0.55-0.91 (the inverse of Wave 16c narrowing) — the natural next target.

The inverse of Wave 16c: {bool,u8,i8} -> {i16,u16,char} for strided views. 1-byte sources aren't VPGATHER-eligible, so the generic WidenInt IL kernel fell to a scalar inner loop on a non-unit inner stride (the i8|strided|i16 0.73 / bool|strided|i16 0.66 / u8|strided|char 0.62 cliff). All 36 domain cells (3 src x 3 dst x 4 strided layouts) now win, clean best-of-7 / best-of-3 sweep: i8|strided|i16 0.73 -> 2.09 bool|strided|i16 0.66 -> 4.17 u8|strided|u16 0.67 -> 2.61 bool|strided|u16 0.67 -> 7.03 i8|strided|char 0.71 -> 2.94 u8|strided|char 0.62 -> 2.96 + negcol/sliced/negrow (ss==1/-1) 1.9-3.5x — sliced/negrow already won via the generic kernel (1.1-1.8x); the new ss==1 path is as-fast-or-faster (no regression). Kernel (DirectILKernelGenerator.Cast.SubwordWiden.cs), two-stage SIMD reusing the SubwordCopy shuffles, inner loop inlined in the odometer: * ss==2 deinterleave even bytes (VPACKUSWB+VPERMQ), then widen; * ss==-1 reverse bytes (VPSHUFB+VPERMQ), then widen; * ss==1 contiguous Vector256.WidenLower/Upper (32/iter); * widen step: signed src (i8) = WidenLower/Upper on sbyte = SIGN-extend; unsigned src (u8,bool) = on byte = ZERO-extend. The dst signedness (i16/u16/char) is irrelevant — same 16-bit pattern (i8 -56 -> 0xFFC8 == int16 -56 / uint16 65480). dst==Half excluded (int->float value cast, kept on the Half kernels). Routed in TryGetStridedCastKernel after SubwordNarrow; C-contig untouched (TryGetCastKernel). Validation * BIT-EXACT vs NumPy 2.4.2: 280/280 hashes (35 pairs x 8 layouts), incl. i8 sign- extend across -128..127 (subword_cross_npref.py / subword_cross_verify.cs). * Standalone math 0-diff + benchmarked (subword_widen_poc.cs). * Full suite green: 9977 passed / 0 failed (net10.0, excl OpenBugs/HighMemory). Note: scoreboard refresh deferred (this session's sweeps degraded ~7pp by accumulated dotnet processes; the 36 widening cells win even in the degraded best-of-3).

… (bcast-reduce ~700x NumPy) The flat reduction's broadcast handling (DefaultEngine.ExecuteElementReduction) folded each stride-0 (broadcast) axis by ITERATING its D identical copies via ExecuteAxisReduction — cache-hot but still O(D×unique), landing at ~1.0x NumPy (the prior "50x->0.9x" fix). A stride-0 axis repeats its data exactly, so reducing over it is CLOSED-FORM — nothing to iterate: sum -> *D, prod -> ^D, min/max -> identity (mean = sum/count, in the caller). Drop every broadcast axis to a non-broadcast view of the UNIQUE data (index 0 along each stride-0 axis), reduce THAT once via the existing fast contiguous/IL-SIMD path, then apply the collapsed multiplicity to the scalar. O(unique), not O(D×unique). Measured (8192-unique x 1024 broadcast = 8.4M logical, -c Release, best-of-9): before (iterate fold) : 1.10 ms (1.03x NumPy) after (algebraic) : 0.0016 ms (~700x NumPy) — AT the floor: the SIMD sum of the 8192-row alone is 0.00149 ms; C# dispatch overhead is 0.11 us. No per-element kernel can approach this: a perfect contiguous SIMD reduce of the materialized 8.4M is 2.16 ms = 1350x slower — it pays O(D×N) while the collapse is O(unique). IL/NpyIter/IL-generation are throughput tools for that slower class and cannot beat the algorithmic win; the inner unique-reduce already rides the IL SIMD kernel. Correctness: - Integer & min/max BIT-EXACT (modular */pow compose under truncation; min/max are order-independent). Float sum/prod differ only ULP-level — the summation-order class the codebase already documents here, well within the broadcast fuzz's 1e-5 tolerance. - CombineWithCount applies the closed form in the accumulator dtype (IPowS/IPowU wrap like a materialized integer reduction; DPow throws on decimal overflow, matching np.prod). - DropBroadcastAxes returns the unique view CONTIGUOUS (an O(unique) copy only when the broadcast wraps a strided view): the prior per-axis fold always finished on a fresh contiguous intermediate, and reducing a strided unique view directly tripped a PRE-EXISTING sub-word strided-prod bug (np.prod over a strided view returns 0 — collected for filing). Verified: bcast_consistency fuzz 4689/4727 pass (the 38 fails are all pre-existing — 30 NaN-engine + 8 decimal argmax/argmin — zero on the flat sum/prod/min/max surface this touches); 586/586 non-OpenBugs reduction unit tests pass (net8 + net10).

@1m

…with a radix line kernel Closes one of the 8 documented Missing Functions (np.sort) and replaces the pathological argsort. Sorting is structured exactly like NumPy's _new_sortlike / _new_argsortlike (item_selection.c): an NpyIter "IterAllButAxis" drive hands one 1-D line at a time to a high-performance inner sort kernel. ARCHITECTURE - Driver (AxisSort.DriveAllButAxis): NpyIterRef.AdvancedNew with opAxesNDim=ndim-1 and op_axes listing every axis EXCEPT the sort axis. Dropping the sort axis from the iterator stops it coalescing into the contiguous run (the bug that made an earlier EXTERNAL_LOOP+CORDER attempt hand the kernel count=20 into an N=5 buffer); the kernel then walks each line itself via the axis stride. Handles 1-D..N-D, any axis (incl. negative and None=flatten), and strided/transposed/F-contig inputs. - Inner kernel = native-width LSD radix (RadixSort): u32 core for 1/2/4-byte dtypes, u64 core for 8-byte; 8-bit digits, one count + one scatter per byte, double-buffered, trivial-pass skip. Stable, so argsort ties resolve in ascending index order (== NumPy kind='stable'). Caller fills keys with a monotonic unsigned transform (so unsigned ascending order == NumPy value order) and un-transforms on the way out. - Floats partition NaN to the end before radix (NumPy sorts NaN last); the float->key transform is the standard IEEE sign-flip so -inf<..<+inf orders correctly. - Half/Complex/Decimal route to BCL introsort (Span.Sort) with exact NumPy comparators (npysort_common.h): real NaN-last; complex CDOUBLE_LT lexicographic real-then-imag, any-NaN-part last. Complex sorts at all now (it has no IComparable). - argsort always returns int64 indices; the radix carries a parallel index column. WHY RADIX (POC, i9-13900K, ratios are NumPy_ms/NS_ms, >1 = NS faster) - int32 sort: radix 164-217 M/s = 0.72-0.90x NumPy, 8-9x over BCL Span.Sort. - int32 argsort: radix 87-135 M/s = 2.1-4.3x FASTER than NumPy (NumPy uses slow indirect quicksort; radix carries indices for one extra scatter stream). - Variation shoot-out @1m int32 sort: radix 197 M/s >> AVX2 vectorized quicksort 52 M/s (correct, but out-of-place partition + no index path) >> Span.Sort 22 M/s. 8-bit beats 11-bit (L1 thrash) and beats ulong-widened combined-histogram for <=4-byte. A SIMD quicksort could only edge radix on int *sort* and would forfeit the argsort win, so radix is the right max-perf, all-dtype, correct-and-stable choice. IL-generating the radix gives no gain (memory-bandwidth-bound, not dispatch-bound). CORRECTNESS (all vs NumPy 2.4.2) - 82/82 .npy round-trip parity: int32/int64/uint8/int16/float32/float64 x 1-D/2-D/3-D x every axis (incl. negative + None) x NaN/+-inf x empty/single/ties/constant. - Half/Complex/Decimal verified (incl. the explicit complex lexicographic NaN-last case). - 20 new unit tests (NpSortTests) incl. transposed-view, in-place ndarray.sort, 3-D argsort-reconstructs-sort. Sort_ApiGap retagged [OpenBugs] -> [Fixed]. - Full suite: 9977 passed, 0 failed. API - np.sort(a, axis=-1, kind=null), np.argsort(a, axis=-1), ndarray.sort(axis=-1) in-place. - ndarray.argsort<T> kept for source compat (delegates; element type from the array). Deleted the old LINQ implementation (per-element tuple + NDArray view allocation).

…esentative run) Full cast-matrix sweep reflecting Wave 16a-d: 1503/1568 (95.8%) >=0.9, 92.5% win >=1.0, geomean 1.941. The geomean rise from the pre-Wave-16 1.70 is largely the real Wave 16 contribution — ~100 sub-word strided cells moved from ~0.6-0.9 to 2-7x. All 36 1B->2B widening cells (Wave 16d) now win 1.9-5.0x; the full sub-word strided family (copy + same-size + narrow + widen) is done. Remaining <0.9 (~48-65 depending on run): AVX512-gated i64/u64->f16, memory-bound i64/u64->narrow strided, alloc-bound 1M same-type contig, + a few genuine large-type cells and best-of-3 jitter. Measurement caveat: the sweep headline oscillates run-to-run (this session saw 87-97% across back-to-back runs) — degraded by accumulated dotnet build-server/run processes inflating either the NumSharp or NumPy subprocess. This run (geomean ~1.94) is representative; per-cell Wave 16 wins are best-of-7 + 280/280 bit-exact, robust to load.

- .claude/CLAUDE.md: partial count 62 -> 63; Cast & Copy list adds .Cast.SubwordWiden.cs; coverage row notes 1B-int->2B via deinterleave/reverse + sign/zero Vector256.Widen. - CAST_BEAT_NUMPY_PLAN.md §12: Wave 16d row (1B->2B widen 0.55-0.92 -> 1.9-7.0x) + SubwordWiden kernel in the kernels list. - CAST_REMAINDER_133_PLAN.md §0: banner now covers the WHOLE sub-word family (copy + narrow + widen, ~100 cells, 280/280 bit-exact).

…VX2 mask overflow np.any / np.all on the two non-JIT-vectorizable dtypes (bool, char) fell through to the scalar per-element loop in AnySimdHelper/AllSimdHelper because Vector256<bool>/<char>.IsSupported == false. For the benchmarked any(all-false) case at scale this was the 0.08-0.18x cliff (5-12x SLOWER than NumPy): a full scalar scan of every element, no SIMD. bool/char are bitwise identical to byte/ushort for a zero/non-zero test (bool: 0/1; char: UTF-16 code unit), so reinterpret at the top of both helpers and run the existing vectorized integer path. (Applies to the direct ReduceBool route AND the IL-emitted reduction route, which both call these helpers.) While fixing that, found and fixed a latent CORRECTNESS bug in AnySimdHelper: the "all lanes zero" reference mask was `(1u << vectorCount) - 1`, which for Vector256<byte>/<sbyte> (32 lanes) is `(1u << 32) - 1`. C# masks uint shift counts to 5 bits, so `1u << 32 == 1u << 0 == 1`, making allZeroMask = 0 instead of 0xFFFFFFFF. An all-zero vector then hit `bits (0xFFFFFFFF) != 0` -> returned true on the FIRST vector, so: np.any(np.zeros(1_000_000, dtype=byte)) -> True (NumPy: False) byte/sbyte any() was WRONG for any all-zero array >= 32 elements on AVX2 hardware. Masked until now because the benchmark uses bool (scalar path) and arrays < 32 use the Vector128 path where the shift doesn't overflow. Widened to `(uint)((1UL << vectorCount) - 1)` (correct for all 4..32 lane counts). This is also prerequisite for the bool->byte redirect, which now routes through the byte path. Result (i9-13900K, Release, best-of-9, vs NumPy 2.4.2), np.any(bool all-false): 100K 0.18x -> 1.90x 1M 0.08x -> 1.04x 10M 0.09x -> 1.04x NumSharp now runs at 79-95 GB/s (memory-bandwidth-bound) -- at/above NumPy; the block-unroll rewrite considered next was unnecessary once at the memory ceiling. Verified: 33 correctness checks (bool/byte/sbyte/char/int x all-false/all-true/ early-hit/all-zero x sizes 1M/40/10 hitting V256 / V256+tail / scalar paths) pass, incl. byte all-zero any=False at 1M; 19 NpAnyTest/np_all_Test pass on net8+net10; broader reduction sweep green except 3 pre-existing [OpenBugs] (convolve int64, var/std ddof>n) in unrelated kernels.

…p the broadcast-reduce copy Follow-up to the broadcast-reduce algebraic collapse. That commit added a .copy() in DropBroadcastAxes to dodge a "np.prod over a strided view returns 0" bug — a symptom patch. This removes the copy and fixes the bug at its root so the strided unique view reduces through NpyIter directly (NpyIter handles non-contiguous via coalesce + stride-ordered axis-permute, and is the always-faster path — no materialization). ROOT CAUSE — NpyIter.Execution.DetermineAccumulatorType had a hand-rolled integer-widening switch that had DRIFTED from the canonical NPTypeCode.GetAccumulatingType (the map the direct/contiguous reduction path uses): - SByte and Char were MISSING -> fell through to `_ => src`, so a strided sub-word sum/prod/cumsum/cumprod accumulated in 1-/2-byte lanes and OVERFLOWED. np.prod over a strided sbyte/char view returned garbage (e.g. 47 for a true 3375) or 0. - Byte was mis-widened to Int64 instead of UInt64 (latent: a UInt64 TResult was returned through an Int64 accumulator; positive products happened to fit). The contiguous path was unaffected because it never uses this method (it passes the GetAccumulatingType result down explicitly). The bug was dormant until the algebraic collapse started reducing strided unique views directly. FIX: - DetermineAccumulatorType now delegates the Sum/Prod/CumSum/CumProd widening to src.GetAccumulatingType() — single source of truth, can never drift again. (Mean/Var/Std -> Double and the pass-through for floats/decimal/complex are unchanged; GetAccumulatingType preserves Half/Single/Double/Decimal/Complex, so it is an exact drop-in.) - DropBroadcastAxes drops the .copy(): returns the (possibly strided) unique view as-is; the flat reduction routes non-contiguous inputs through NpyIter, now correct. This also fixes strided sub-word reductions OUTSIDE the broadcast path (any np.sum/np.prod/ np.cumsum/np.cumprod over a strided/transposed sbyte/char view) — the fuzz improved 38 -> 37 fails (the remaining 37 are all pre-existing: 30 NaN-engine + 7 decimal argmax/argmin). Verified (-c Release): - strided-prod pinpoint: sbyte/char/int32/int64/double x {1-D strided, transposed, sliced} all match the contiguous reference (was sbyte 47/0, char garbage). - bcast-reduce canary still 0.0021 ms (~534x NumPy), bit-exact, NO copy. - broadcast fuzz 4690/4727 (zero failures on the sum/prod/min/max surface). - 2522 unit tests green across reduction/cast/strided/cumulative/scan/statistics/iterator (net10, excl OpenBugs).

…e 17, bucket A) Closes the last large genuine cast laggard: i64->f16 and u64->f16 across all 8 layouts were the only AVX512-gated family left (was 0.68-0.89x NumPy on best-of-3, falling to the scalar generic emitter). There is NO AVX2 i64->f32 (cvtqq2ps is AVX512DQ, and this box is AVX2-only: Vector512 not accelerated, Avx512F/DQ/BW all false), so synthesize the conversion exploiting f16's tiny finite range. KEY INSIGHT: f16 saturates to +-inf at 65520, so EVERY finite i64/u64->f16 result has |v| < 65520 << 2^17 -- which fits in i32 *exactly* (well under the 2^24 f32 mantissa limit, so the i32->f32 step never rounds, single RTNE at f32->f16). Any |v| >= 65520 is inf regardless of the low bits. KERNEL (DirectILKernelGenerator.Cast.ToHalf.cs): - Int64x8ToHalfBits / UInt64x8ToHalfBits: clamp out-of-range lanes to a +-70000 sentinel (VPCMPGTQ + byte-granular VPBLENDVB; u64 uses the sign-bias unsigned compare), so the low-32-bit narrow is exact and the sentinel maps to +-inf. Pack 8 lanes' low words via PermuteVar8x32(RtoSel)+GetLower, then the exact cvtdq2ps and the already-proven Giesen FloatToHalfBits. - BulkInt64ToHalf / BulkUInt64ToHalf: 16-at-a-time contig bulk + scalar tail. - Int64ToHalfBits / UInt64ToHalfBits scalar ports (tail + strided conv); finite |v| < 2^24 so (float)v is exact, then SingleToHalfBits. - Contig (CastInt64/UInt64ToHalfContig) + strided (StridedInt64/UInt64ToHalf via the existing StridedNarrowDriver, srcSize=8) wired into TryGetXToHalfKernel / TryGetXToHalfStridedKernel (Int64/UInt64 switch cases). No Cast.cs routing change -- the X->Half resolvers were already in the chain. VALIDATION (bit-exact vs NumPy 2.4.2): - 190,043 i64 values (every int in the full +-70010 boundary region + 2^16..2^63 magnitudes + 50K randoms) and 120,024 u64 values: SIMD-diff=0, scalar-diff=0. - All 8 layouts x {i64,u64} = 16 SHA256 hashes through the real astype path, with boundary-spanning values (-inf/finite/+inf/0): 16/16 bit-exact. - Full suite 9977/9977 on net10.0. (The lone net8.0 failure is a PRE-EXISTING log1p/complex128 1-ULP/NaN-payload libm difference, confirmed by stash-test; unrelated to this cast.) PERF (best-of-7, canonical (arange%17)+1 base, 1M, NPY/NS): i64->f16 C 1.95 F 1.99 T 1.95 sliced 1.46 negrow 1.90 negcol 1.34 strided 1.20 bcast 1.95 u64->f16 C 1.66 F 2.53 T 2.68 sliced 2.50 negrow 2.57 negcol 1.69 strided 1.55 bcast 2.58 geomean 1.914x (was ~0.80). Every cell now >0.9; whole bucket beats NumPy. PoCs: benchmark/poc/i64_u64_half_poc.cs (value proof), i64u64_half_layout_{npref.py, verify.cs} (8-layout hash proof), i64u64_half_bench.cs (best-of-7 timing).

…(Wave 17, bucket B) Bucket B was ~30 cells the best-of-3 scoreboard flagged at 0.77-0.90 for {f32,f64,c128} -> {i64,u64}. A clean best-of-7 re-measure (all 6 src->dst x 8 layouts, canonical (arange%17)+1 base, GC between cells) showed 25 of them were pure best-of-3 JITTER -- now 0.9-2.5x, many >1.0 (e.g. f32/f64/c128 -> i64 all contig/F/T/sliced/negrow >=0.95, bcast 2.3-2.5x). float->i64 has no dedicated kernel (scalar Converts == NumPy's cvttsd2si) and is already at parity. Only 5 cells were GENUINELY <0.9, all -> u64 on negcol/strided -- and all because the strided u64 path fell to a gather even when the data is contiguous: - f32|negcol|u64 0.88->0.96, f64|negcol|u64 0.86->0.93: negcol ([:, ::-1]) is a contiguous-REVERSED row, so load contiguously + VPERMQ/VPERMD lane-reverse instead of a -1-stride gather (Cast.FloatToUInt.cs). Reads stay inside the row. - f64|strided|u64 0.86->0.97: ::2 deinterleaves even doubles from two contiguous loads (Permute4x64 0x08/0xDD + Blend 0x0C). The 2nd vector is loaded at 2i+3 (not 2i+4) so the max read is 2i+6 = the last needed element L(i+3) -> provably no over-read past the row (the gather's only safety advantage, recovered). - c128|negcol|u64 0.80->0.96: the reals are interleaved with imags, but for negcol the 8 surrounding doubles are contiguous (re3,im3,..,re0,im0), so two loads + UnpackLow + VPERMQ-0x72 extract the 4 reals in logical order, beating the -2-stride real gather (Cast.Complex.cs FusedComplexToUInt64). Over-read-safe. - c128|strided|u64 ~0.87 (memory-bound): the ::2 reals sit every 4th double; a contiguous-load alternative reads 2x the data, so the gather is already near optimal. Left as-is; straddles 0.9 on jitter. VALIDATION: {f32,f64,c128} -> u64 across all 8 layouts = 24 SHA256 hashes vs NumPy 2.4.2, with values that exercise the modular wrap (negatives), the 2^63 sentinel (NaN/+-inf/overflow) and truncation: 24/24 bit-exact. Full suite 9977/9977 on net10.0 (net8.0's lone log1p/complex128 failure is pre-existing, stash-confirmed). PoCs: benchmark/poc/float_wideint_bench.cs (best-of-7 6x8 matrix), float_u64_layout_{npref.py,verify.cs} (24-hash layout proof).

…tude mask (Wave 17, bucket E) f16|strided|bool was a genuine 0.14x cliff (clean best-of-7: NumSharp 1.53ms vs NumPy 0.22ms, ~7x slower) -- the scoreboard's best-of-3 0.87 was a lucky sample. f16 is 2-byte so it is not VPGATHER-eligible; StridedHalfToBool routed through the generic StridedNarrowDriver, which scalar-converts the ss==2 ([:, ::2]) and ss==-1 ([:, ::-1]) inner rows. Wave 16's SubwordNarrow excluded f16 (it needs a magnitude mask, not a plain low-byte narrow), so f16->bool never got the shuffle treatment. FIX (DirectILKernelGenerator.Cast.ToBool.cs): a dedicated StridedHalfToBool that applies the SubwordNarrow deinterleave (ss==2: mask even words -> VPACKUSDW -> VPERMQ 0xD8) / reverse (ss==-1: VPSHUFB _revWords -> VPERMQ 0x4E) shuffles, then NumPy's half-truthiness (bits AND 0x7fff) != 0 (so +-0.0 -> False, NaN/inf -> True). Inner ss==1 rows (sliced/negrow/bcast) run the contiguous magnitude compare; arbitrary strides (F/T column-gather) and the tail keep the scalar test. Inner loops inlined in the odometer (a per-row helper costs ~6x -- SubwordCopy PERF NOTE). RESULT (best-of-7, 1M, NPY/NS): every f16->bool layout now beats NumPy 3.9-7.3x: C 4.00 F 3.87 T 4.02 sliced 7.25 negrow 7.01 negcol 6.49 strided 5.27 bcast 6.92 (strided 0.14 -> 5.27). NumPy's f16->bool is uniformly slow (0.30-0.57ms). VALIDATION: 8 layouts x SHA256 vs NumPy 2.4.2 with a base carrying +-0.0 / NaN / +-inf / subnormal / 65504 -> 8/8 bit-exact (-0.0 bits 0x8000 -> False confirmed). Full suite 9977/9977 net10.0 (net8.0's lone log1p/complex128 failure pre-existing). Bucket E also RE-MEASURED 6 other scoreboard singletons (f64->u8/i32/char, u64->f32, c128->u8, u64->u8) -- all were best-of-3 jitter, clean best-of-7 is 1.08-1.87x. PoCs: benchmark/poc/f16bool_{bench.cs,layout_verify.cs,layout_npref.py}, subword_arb_bench.cs.

Scoreboard: 1508/1568 >=0.9 (96.2%, was 95.8%); the i64/u64->f16 (A) and f16->bool (E) families dropped off the <0.9 list entirely. geomean 1.89 on this best-of-3 run (clean best-of-7 across the touched buckets is 1.94) -- this sweep was mildly degraded by accumulated dotnet build-server processes (cannot be cleared without taskkill), so the float->i64/u64 cells (bucket B) read 0.84-0.89 here despite measuring 0.95-2.5x under clean back-to-back best-of-7 (documented in the plan + commit faa2754). The scoreboard is best-of-3 and noisy by construction. Remaining <0.9 (60), by class: - ~15 same-type small-dtype contig (bool/u8/i8 -> self, C/T/sliced/negrow/bcast, 0.16-0.29): NOT a kernel deficiency -- a cache-residency artifact. A warm Buffer.MemoryCopy of 1M is 0.0138ms = NumPy's 0.014ms exactly; NumSharp wins 1.9-3.5x at every cache-busting size (4M/16M/64M). NumPy only wins at 1M because CPython refcounting reuses one warm buffer per iteration while .NET GC hands back cold ones. Unwinnable without deterministic disposal; left as-is (optimal). - float/c128 -> i64/u64 (best-of-3 jitter on this degraded run; clean >=0.95). - c128->bool / c128->u64 strided (memory-bound, gather of strided reals already near-optimal -- the contiguous alternative reads 2-4x the data). Plan (CAST_REMAINDER_133_PLAN.md) gets a full Wave 17 section: bucket-by-bucket A-E with the kernel approach, clean best-of-7 numbers, bit-exact proof counts, and the cache-residency characterization for C/D so it is not re-chased.

…s+imags (Wave 17) c128|negcol|bool was a genuine 0.66x (clean best-of-7: NumSharp 0.635ms vs NumPy 0.418ms) -- FusedComplexToBool gathered reals (idx) and imags (idx+1) with a -2-stride double gather for the reversed inner row. As with c128->u64 negcol, the 8 doubles around p are contiguous-reversed (re3,im3,re2,im2,re1,im1,re0,im0), so two loads + UnpackLow/UnpackHigh + VPERMQ-0x72 deinterleave BOTH reals and imags into logical order, then the existing OrderedEqual !=0 | test. Over-read-safe (p-6 of the last group is the row's first physical real). RESULT (best-of-7, 1M): c128|negcol|bool 0.66 -> 1.50x. (c128|strided|bool ~0.9-1.07 is NumPy-jitter / memory-bound -- the ::2 reals+imags sit every 4th double, so the contiguous alternative reads 4x the data; gather stays.) The other 6 c128->bool layouts were already 1.08-1.19x. VALIDATION: 8 layouts x SHA256 vs NumPy 2.4.2 with a base carrying re=0&im=0 (False), im!=0 (True), NaN/inf (True), -0.0-0.0j (False): 8/8 bit-exact. Full suite 9977/9977 net10.0 (net8.0's lone log1p/complex128 failure pre-existing). PoCs: benchmark/poc/c128bool_{bench.cs,layout_verify.cs,layout_npref.py}.

np.sort, ndarray.sort and np.argsort were O(N^2) for every 1-D array and every axis=None call - the most common cases. Root cause: AxisSort's IterAllButAxis driver (DriveAllButAxis) drops the sort axis via op_axes, so a 1-D input yields a 0-dimensional iterator (oa_ndim=0, op_axes=[[]]). NpyIter mis-drives that degenerate shape as `size` single-element iterations instead of 1, and because each line kernel re-sorts the whole line (it walks by LineCtx.n, ignoring the per-call count) total work is N x O(N) = O(N^2). Measured before fix (i9-13900K, Release), 1-D int32: N=10K sort 514 ms argsort 768 ms N=100K sort 53,419 ms argsort 83,132 ms (~104x per 10x in N = quadratic) 1M+ effectively hung (minutes). Unit tests never caught it (5-12 element arrays, N^2 trivial) and the radix POC benched RadixSort directly, bypassing the driver. Fix: at the single choke point DriveAllButAxis, promote a 1-D operand to a (1, N) memory-sharing view (np.expand_dims aliases storage, so in-place sort still mutates the original; a leading size-1 axis leaves the data-axis stride - and thus LineCtx in/out strides - unchanged). Now exactly one axis (size 1) is kept, so the iterator makes exactly one line-kernel call. Multi-D was already correct (one call per line) and is untouched. Measured after fix: N=100K sort 0.68 ms (147 M/s) argsort 1.06 ms (94 M/s) ~78,000x faster 1-D throughput now flat across N (quadratic gone). vs NumPy 2.4.2 argsort: int32 1.47x (1M) / 2.1x (4M) FASTER; int64 ~parity; f64 0.6x (1M) -> 0.96x (4M). Correctness: NpSortTests 20/20 pass; validated 1-D sort/argsort, in-place on strided a[::2] view, axis=None, int64/f64. NOTE (separate latent bug, not fixed here): the underlying NpyIter 0-dimensional / empty-op_axes iteration emits `size` iterations instead of 1; this still affects any other 0-dim op_axes use. The sort path now works around it rather than relying on the iterator's 0-dim behavior.

@out

…ializing a contiguous copy Removes the copy-to-contiguous anti-pattern (materializing a strided/transposed/ sliced/F-order/broadcast view because a kernel only handled C-contiguous data) from seven compute paths, replacing each `.copy()`/`.copy('C')` with in-place strided iteration — the same principle applied to the reduction fold earlier on this branch ("must use NpyIter, not copy"). Every element is now read through its own strides in C-order (via a shared `FlatStrideOffset` odometer) or through the existing IL strided kernel; results are bit-identical to the prior copy path and verified against NumPy across all dtypes x layouts. CumSum / CumProd flat (Default.Reduction.CumAdd.cs, CumMul.cs): - Both flat scans now share ScanElementwiseFlat, which drives the IL scan kernel directly on strided/transposed/sliced/reversed/F-order/3-D inputs. The flat scan generator already emitted a correct C-order strided walk (EmitScanStridedLoop); the old `cumsum_elementwise(arr.copy())` simply prevented it from ever being reached. - Enabled CumProd in the IL scan kernel (DirectILKernelGenerator.Scan.cs): EmitScanCombine now emits Mul, EmitScanHelperCall maps to a new CumProdHelperSameType. cumprod previously had NO IL kernel at all (TryGet returned null at generation, always falling to the scalar copy path). - Half/Complex accumulators can't be combined with il Add/Mul except same-type- contiguous, so for non-contiguous Half/Complex (and decimal cumprod) we keep a single C-order ravel — exactly what NumPy does — not a kernel-dodge copy. - Base address folds Shape.offset in (sliced views retain it; simple contiguous slices bake it into Address). NEP50 accumulator widening preserved. Var / Std flat (Default.Reduction.Var.cs, Std.cs): - The fallback two-pass now iterates the input in place via FlatStrideOffset (var/std are order-independent, so any visiting order is valid — mirrors NumPy, which reduces strided arrays without copying). Shared moment helpers (VarMomentsReal<T> via INumberBase, VarMomentsDecimal, VarMomentsComplex) are used by both var and std (std = sqrt). Removes the old double-materialization (arr.copy() + Cast-to-double). Integer Reciprocal (Default.Reciprocal.cs): - The 1/x integer loop reads the source through its strides instead of materializing sliced/strided/broadcast views (the comment admitted the previous raw read "THREW for views"). Broadcast stride=0 is handled naturally. - Pre-existing 1/0 semantics preserved verbatim (see known-bug note below). Left / Right Shift (Default.Shift.cs): - Array-operand shift walks both the values and the per-element shift counts through their own strides (ExecuteShiftArrayStrided), so broadcast (stride=0), transposed, sliced and mixed-layout operands no longer get copied. The contiguous fast path still uses the IL scalar-loop kernel. Clip (Default.ClipNDArray.cs, DirectILKernelGenerator.Clip.cs): - New ClipStrided reads src and (array) bounds through their own strides in C-order, writing a fresh C-contiguous result — eliminating the copy('C') normalizations in the general path, PrepareBound, and the @out path. The contiguous fast path still drives the IL SIMD kernel. One strided loop is shared across dtypes via Max/Min function pointers. - Clamp semantics mirror the IL kernel bit-for-bit: Math.Max/Min for sized numerics+decimal, the (now internal) Half/Complex NaN helpers, Char as UInt16, and NaN-propagating float/double Max/Min that resolve the signed-zero tie like hardware MAXPS/MINPD (maximum(+0,-0) = -0) rather than Math.Max (+0). np.maximum /minimum/fmax/fmin are implemented on top of clip, so this keeps their signed-zero/NaN results identical to the SIMD path and to NumPy. - Fixed a latent allocation bug: result buffers are now built from a fresh contiguous Shape, never np.empty(lhs.Shape) which would clone a strided/F-order layout and mis-address the linear dst writes. Verification: per-op fuzz harnesses compare strided/transposed/sliced/reversed/ F-order/3-D/broadcast views against the contiguous ground truth and against live NumPy across all dtypes (scan 308/308, var-std 660/660, recip+shift 220/220, clip 350/350). Full unit suite green (9987 passed, 0 failed); net8.0 + net10.0 build clean. Known pre-existing bug surfaced (NOT changed here): NumPy integer reciprocal(0) is 0 for int8/int16/uint8/uint16/uint32 and the sign-bit sentinel only for int32/int64/uint64 (uint64 -> 2^63); NumSharp uses MinValue for all signed and 0 for all unsigned, so reciprocal(int8/int16 0) and reciprocal(uint64 0) diverge. This is the existing documented behavior, faithfully preserved by the strided rewrite; flagged for a separate fix.

…/maximum signed-zero scalar tail Two NumPy-parity bugs, both surfaced while removing the strided copy-hacks. Integer reciprocal 1/0 (Default.Reciprocal.cs) Re-probed NumPy 2.4.2 bit-exact and deterministic across array sizes / lane positions: reciprocal(0) is the sign-bit sentinel 0x80..0 ONLY for the 32/64-bit loops (int32 -> int.MinValue, int64 -> long.MinValue, uint64 -> 2^63) and 0 for every narrower type (int8, int16, uint8, uint16, AND uint32). NumSharp returned MinValue for ALL signed (so int8 -> -128, int16 -> -32768) and 0 for ALL unsigned (so uint64 -> 0) — wrong for int8/int16/uint64. Fixed the three per-type 1/0 results; the other five already matched. (A prior pin had mis-probed int8 1/0 as -128; corrected.) reciprocal(bool): NumPy promotes bool to the int8 loop -> int8 [True->1, False->0] (probed np.reciprocal([True,False]) -> int8 [1,0]). NumSharp took the float loop (Double [1, inf]). The dispatcher now maps a bool loop type to int8 and casts the input, so bool reciprocal returns int8 with the int8 1/0 = 0 rule. dtype was already preserved for the int types (the registry's "widen to float64" note was stale). clip / maximum / minimum / fmax / fmin signed-zero scalar tail (DirectILKernelGenerator.Clip.cs, Default.ClipNDArray.cs) np.maximum etc. are built on clip. For float32/float64 the SIGNED-ZERO tie resolves to the SECOND operand (hardware MAXPS/MINPD: maximum(+0,-0) = -0) — the OPPOSITE of float16 (ties to the first operand) and of Math.Max (returns +0). The IL kernel's SIMD body already matched NumPy, but its SCALAR TAIL (EmitScalarClamp) used Math.Max and so diverged on a signed-zero tie landing in the last partial vector. Added NaN-propagating, tie-to-second FloatMaxNaN/FloatMinNaN/DoubleMaxNaN/DoubleMinNaN helpers and wired EmitScalarClamp to them for Single/Double; the engine's ClipStrided now shares the same helpers (dropping its local copies), so the contiguous SIMD body, contiguous scalar tail, and strided path all agree bit-for-bit with NumPy. The Half helper (tie-to-first) is correct for float16 and left unchanged. Tests - MisalignedRegistry: removed the two now-obsolete reciprocal excuses (the int->float64 dtype excuse and the int ÷0 / non-contig-throw value excuse); the unary fuzz corpus now verifies reciprocal(int/bool) bit-exact. - NewDtypesCoverageSweep: corrected B36_SByte (int8 1/0 -> 0, not MinValue) and added B36_Int16 (1/0 -> 0), B36_UInt64 (1/0 -> 2^63), B36_Bool (-> int8 [1,0]) pinning the per-type rule. Verified: contiguous maximum/minimum with -0/+0 across SIMD body AND scalar tail match NumPy bit-for-bit (f32 and f64); reciprocal(all 8 int types + bool) matches NumPy; strided reciprocal still equals the contiguous result. Full unit suite green (9990 passed, 0 failed); net8.0 + net10.0 build clean.

…O(N^2)) CalculateBroadcastShape gated the op_axes branch on `opAxesNDim > 0`, so an explicit 0-dimensional iteration (opAxesNDim == 0, op_axes = [[]]) fell through to natural broadcasting and returned the operand's own shape [N]. That made IterSize = N, so a kernel driven without EXTERNAL_LOOP ran N times instead of once. NumPy's NpyIter_RemoveAxis leaves a 0-dim iterator with GetIterSize() == 1 (one iteration, data pointer = operand base; any dropped axis is the caller's responsibility via its saved stride). This is the ROOT of the 1-D / axis=None sort & argsort O(N^2) addressed at the symptom level in b4d2128: AxisSort's IterAllButAxis drops the only axis of a 1-D operand -> opAxesNDim == 0 -> the line kernel (which re-sorts the whole line per call, ignoring the per-call count) was invoked N times. Fix: treat opAxesNDim == 0 as an explicit 0-dim iteration and return an empty iteration shape (IterSize == 1). The `> 0` -> `>= 0` gate plus an early `return Array.Empty<long>()` leaves the opAxesNDim > 0 and the unspecified (-1, natural-broadcast) paths untouched. TDD (NpyIterZeroDimOpAxesTests, 10 tests, direct NpyIterRef.AdvancedNew): - RED against the unfixed iterator: 7 fail / 3 pass, every failure the exact symptom (NDim 1 not 0; Calls/IterSize = 100000/1000/500/256/10/20 not 1). - GREEN after fix: 10/10. Functional cases (single-call whole-line walk, in-place fill, two-operand lockstep copy, sliced-offset view) confirm the NDim=0 base pointer is offset-correct and read/write-valid; 3 regression guards (no-op_axes 1-D -> N calls; drop-axis-1 -> per-row; drop-axis-0 -> per-col) hold before and after. Regression: 42/42 (NpSortTests 20 + NpyIterZeroDimOpAxes 10 + AxisStride 12). Note: AxisSort's (1,N) promotion workaround from b4d2128 is now redundant (the sort path could rely on this root fix directly) but left in place; it is correct and independent.

…iterals (NEP50) `a * 2`, `a + 5`, `a / 3.0` route through operator(NDArray, object) -> np.asanyarray(literal), so a Python int literal is wrapped as an int32 0-d scalar (a float literal as float64). For `f32[] * 2` the operands therefore arrive as f32 x int32. resultType already reflects NumPy's weak promotion (f32), but the fast-path gates key off the RAW operand dtypes: - TryTrivialContiguousBinaryOp: `if (lhsType != rhsType ...) return null;` bails BEFORE the scalar-broadcast arm, and - TryExecuteBinaryOpViaNpyIter: sameDtype=false -> per-element-convert mixed loop (vector body null). So the op fell to the heavy mixed path instead of the SimdScalarRight/Left scalar-broadcast loop it should take. Measured (100K f32, Release): a * 2 (int32 scalar) 0.080 ms vs a * 2f (f32 scalar) 0.035 ms -- 2.25x slower for IDENTICAL work and an identical f32 result. This hit EVERY `array <op> python-literal` and dominated the 100K-tier elementwise losses in the full benchmark (`a * 2 (literal) f32 @100K` 0.067x was among the worst non-degenerate op-matrix cells). Root cause is path-check ORDERING: the same-dtype gate is evaluated first, against un-promoted operand dtypes, so a case that is effectively same-dtype after NumPy's weak-scalar promotion never reaches the scalar-broadcast fast path. Fix (ExecuteBinaryOp, before the trivial bypass): when exactly one operand is a 0-d/size-1 scalar, the OTHER (array) operand already IS resultType, and only the scalar's dtype differs, materialize the scalar at resultType (one element -- exactly what NumPy does, casting the scalar to the loop dtype). Both operands then share resultType and the same-dtype SIMD paths (contiguous SimdScalarRight/Left and the same-dtype NpyIter SIMD loop, incl. strided) light up. When the ARRAY operand also differs (e.g. int32[] * 2.5 -> float64) the guard skips: the array genuinely needs a cast, which is the mixed path's job. Value-identical: the mixed path converted the same scalar to resultType per element; this converts it once up front. Verified across f32/f64/i16/i32/i64/u8/i8 (contig + strided): (a * 2) == (a * same-type-scalar 2), every dtype eq=True; i32 * 2.5 still -> float64 = 2.5 (guard correctly skips). Post-fix: a*2 ~= a*2f ~= a*s32 (gap closed). 778 binary/arithmetic/promotion/scalar/broadcast + 2345 math/reduction/statistics/ ufunc/comparison/evaluate tests pass on net8.0 + net10.0. Settled while investigating (the suspected second ordering bug): DivergesFromNumpyCast is a dead stub (returns false), so the cast *->bool slowness is NOT a gating-order bug -- it is fixed overhead on a 1-byte output, like the same-type copy diagonal.

These 14 documents were working notes for work that has since shipped and is covered by durable artifacts (tests, the benchmark scoreboards, and the website docs). Removed as dead weight: Audit working notes (findings already fixed or pinned as [OpenBugs] under test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs, which remain): docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md docs/plans/audit_v2/01_iterators.md docs/plans/audit_v2/02_ilkernel_simd.md docs/plans/audit_v2/03_default_math_reductions.md docs/plans/audit_v2/04_logic_shape_storage.md docs/plans/audit_v2/05_ndarray_creation.md docs/plans/audit_v2/06_manipulation_apis_logic.md docs/plans/audit_v2/07_math_ops_selection_sorting_stats.md docs/plans/audit_v2/08_casting_random_utilities.md Implementation plans for shipped features: docs/OUT_WHERE_NPYITER_FAMILIES_PLAN.md (out=/where= families — shipped) docs/plans/CAST_BEAT_NUMPY_PLAN.md (cast matrix — shipped) docs/plans/CAST_REMAINDER_133_PLAN.md (cast continuation — shipped) POC scratch (the real implementation landed in NpyIter): benchmark/poc/NPYITER_REDUCTION_PLAN.md benchmark/poc/POC_RESULTS.md Build/tests unaffected — every inbound reference is a prose/comment citation, not a code dependency. Four stale citations remain to be scrubbed separately: src/.../Math/DefaultEngine.BitwiseOp.cs:13 -> OUT_WHERE_NPYITER_FAMILIES_PLAN.md src/.../Unmanaged/UnmanagedStorage.Cloning.cs:426 -> CAST_BEAT_NUMPY_PLAN.md docs/NPYITER_GAPS_AND_ROADMAP.md:50 -> benchmark/poc/POC_RESULTS.md docs/PR611_CHANGELOG.md:210 -> docs/plans/audit_v2/

…32b0 wave) The branch advanced ~143 commits past the last changelog refresh. Updated the doc to represent the current/final state, not the prior snapshot: Stats: 312 -> 455 commits, 615 -> 806 files, +217,949/-16,402 -> +234,348/-19,179. Tests: 9,709 -> 9,990 passed / 0 failed (net8.0 + net10.0). Flipped now-false claims: - np.sort: was 'remains unimplemented' -> implemented (radix line-kernel on NpyIter, stable, NaN-last, all axes/orders); np.argsort reimplemented. Missing-functions list narrowed to flip family / diag / gradient / round. - NDIterator: was 'survives as a thin lazy wrapper' -> deleted entirely (interface + class + AsIterator); all iteration via NpyIter/NpyFlatIterator/ GetAtIndex. Breaking-changes row + TL;DR updated. - Frontier 'open losses' -> fixed: broadcast-reduce (was 54x slow) now ~534-700x NumPy via algebraic stride-0 fold; scalar any/all (bool/char) on the integer SIMD path; np.zeros calloc-backed (was ~1000x slow). Added current-state coverage: - §5: np.sort row, np.broadcast N-operand parity, six Complex transcendentals (sinh..arctan). New-API count 30/35 -> 36 (consistent across TL;DR + §5). - §7: SIMD strided-cast campaign (astype 15x8x15: 716->~391 lagging, 852->1,177 wins), np.zeros calloc, broadcast-reduce fold, bit-exact pairwise sum, any/all SIMD, Complex/Half/Decimal axis reductions, f16 negate. - §10: integer reciprocal 1/0 per-width, clip/maximum signed-zero + NaN, f16 axis-sum float32 accum, exp2/power-Half/complex-narrowing crashes, complex nansum uninit read, AVX2 any() mask overflow, net8.0 complex abs. Also fixed the dangling docs/plans/audit_v2/ reference (those docs were removed last commit) -> points at test/.../AuditV2/AuditV2_*.cs instead. Live PR #611 description updated to match via REST PATCH.

…+ scrub dangling plan-doc refs The Direct/ kernels carried 24 methods marked [Obsolete("Unused...", error: true)] as 'pending removal' tombstones. error:true makes any reference a compile error, so they were provably unreachable. Removed outright across 12 partials: Argwhere(1) Binary(1) Cast(1) Comparison(1) Masking.NaN(4) MatMul(1) MixedType(1) Reduction.Axis.Simd(8 + orphaned header comment) Reduction.NaN(1) Scan(3) Search(1) Unary(1) The Reduction.Axis.Simd block (Add/Mul/Min/Max 256/128) was superseded by the type-specialized AddOp*/MulOp* structs (435a07e); the rest by their non-throwing Get*/TryGet* twins. Net -683 lines, zero behavior change. Also scrubbed 3 comments/prose that still cited the plan docs deleted earlier in this branch (kept each substantive note, dropped the dead file path): - DefaultEngine.BitwiseOp.cs (OUT_WHERE_NPYITER_FAMILIES_PLAN.md) - UnmanagedStorage.Cloning.cs (CAST_BEAT_NUMPY_PLAN.md) - docs/NPYITER_GAPS_AND_ROADMAP.md (benchmark/poc/POC_RESULTS.md) PR611 changelog §14 updated: 'pending deletion' -> removed. Verified: NumSharp.Core builds clean (0 errors); full suite 9990 passed / 0 failed / 11 skipped on net10.0 (CI filter). No dangling refs to deleted docs remain.

Dead-code sweep (Roslyn IDE analyzers + whole-repo reference-frequency probe): each method's name occurs exactly once across all .cs in src/test/benchmark/ examples/tools — i.e. declared and never referenced anywhere, so unreachable. All private/protected, so scope confirms no external caller. Removed: CreateLogicalResultShape Default.LogicalReduction.cs ExpandStartDim, ExpandEndDim Default.Dot.cs ExecuteShiftOpScalar Default.Shift.cs BulkHalfToBoolV, ConvHalfToBool DirectILKernelGenerator.Cast.ToBool.cs FindFirstNaN_Double/_Float NDArray.unique.Kwargs.cs TryUniformConstant np.pad.cs uniqueComplex (protected) NDArray.unique.cs Checked for cascade: BulkHalfToBool (non-V) stays — still used by CastHalfToBoolContig; the removed BulkHalfToBoolV was the unused strided-driver adapter. No other symbol orphaned. Verified: NumSharp.Core builds clean; full suite 9990 passed / 0 failed / 11 skipped on net10.0 (CI filter). Zero references to the 10 names remain.

…bsystems) Full run of benchmark/run_benchmark.py (~3h wall, numpy 2.4.2, Release): the BenchmarkDotNet op/dtype/N matrix across all 14 comparison suites, the NpyIter iterator benchmark (+ README cards), and the layout/operand/cast/fusion matrix subsystems. The committed report files were stale (predated several perf fixes already in HEAD); this refreshes them. layout_results.md/.tsv are generated for the FIRST time (the Layout subsystem had never been run on this branch). Recent fixes now visible in the numbers: - bcast_reduce canary: 0.02x (51.8x SLOWER) -> 538.56x faster [cb62e37] - any(F) reduction: ~0.1x -> 8.89x/8.41x small-N, ~1x at scale, geomean 2.74x (+ byte/sbyte AVX2 all-zero-mask fix) [c79ed6f] - np.evaluate F/T: fusion F 0.11x->1.85x, T 0.12x->1.74x [16afbac] - forder_out canary: 0.46x -> 1.28x Op-matrix summary moved the right way (fewer bad cells): before 1851 ops | OK 797 close 289 slow 219 red 81 negl 389 n/a 76 after 1851 ops | OK 792 close 357 slow 177 red 72 negl 384 n/a 69 NpyIter headline 1.17x -> 1.18x geomean; iterator construction 2.88x -> 3.33x. Known weak spots persist (NOT new regressions - pre-existing, predate this run): - left_shift/right_shift @100k ~0.05x (scalar-loop shift path, not vectorized) - np.sum(f64) @100k 0.074x (np.sum() API dispatch overhead; the isolated iterator reduction still wins 2.79x) - copy/cast + index-math at small N, float16 scalar path (no Vector<Half> in BCL) Generated against the current working tree (branch HEAD + local engine WIP); result files only - no source changed.

Nucs · 2026-06-23T07:31:05Z

📊 Performance snapshot vs NumPy 2.4.2

Full benchmark/run_benchmark.py regen on this branch — the BenchmarkDotNet op/dtype/N matrix (1,851 ops × 1K/100K/10M), the NpyIter iterator harness (+ cards), and the layout/operand/cast/fusion subsystems. Convention throughout is NPY/NS = NumPy_ms ÷ NumSharp_ms, so >1.0× = NumSharp faster. Absolute ms aren't comparable across machines; the ratios are (same box, same run).

Headline: op-matrix geomean 1.18×, scaling to 1.26× at 10M. Of 1,851 ops: 792 faster · 357 within 2× · 177 + 72 slower — and the slow buckets shrank this cycle (🔴 81→72, 🟠 219→177).

🟢 Where it's great (often an order of magnitude)

Area	Result
Reductions / stats	`prod` 13.8×, `nanstd`/`nanvar`/`percentile`/`quantile` 11–13×, axis-`sum` (int8) ~11× at 10M
`dot` / matmul	`dot(f64)` 14.2× @100k
Fused `np.evaluate`	`(a-b)/(a+b)` 4.16×, `ab+c` 1.56× — and it now holds on F/transposed (1.85× / 1.74×) and broadcast (3.60×*) operands
Broadcast-then-reduce	538× vs NumPy (algebraic axis-collapse)
`any` / `all`	2.7–8.4× (SIMD bool/char scan + early-exit)
`astype` matrix	1,439 / 1,568 cells win; per-layout geomean 1.47–2.21×; the old float→narrow-int cliff is now a win (f16→narrow 3.77×)
Iterator construction	3.3× vs `np.nditer`

🔴 Where it's still bad (the honest worklist)

Area	Result	Cause
`left_shift` / `right_shift` @100k	~0.05× (≈20× slower)	array-by-scalar shift runs a scalar loop, not vectorized
`np.sum(f64)` @100k	0.074×	`np.sum()` API dispatch overhead (~200µs); the raw iterator reduction still wins 2.79×
copy/cast & index-math @ small N	0.3–0.6×	iterator-build cost dominates µs-scale ops (`flatten` / `astype` / `unravel` / `ravel_multi_index` @1k–100K)
`float16` across the board	~0.65×	no `Vector<Half>` in the .NET BCL → scalar path
`less` → bool	0.54×	comparison-to-bool at small N
100K tier overall	0.90×	small/medium arrays are where per-call overhead bites; 10M is a clean win

TL;DR

Reductions, fusion, broadcasting, casts, and large-array throughput are clearly ahead of NumPy — frequently 5–500×. The losses cluster in per-call overhead at small/medium N (API dispatch + iterator construction), the non-vectorized scalar-shift path, and float16 (a BCL limitation, not a design one). All performance, no correctness — it's the optimization backlog, not a bug list.

_{numpy 2.4.2 · Release · best-of-rounds · shared-hardware caveat applies to absolute ms, not ratios.}

Nucs · 2026-06-23T16:36:51Z

📐 Performance matrices — `>1.0×` = NumSharp faster

Every cell is NPY/NS = NumPy_ms ÷ NumSharp_ms → >1.0× = NumSharp faster · <1.0× = NumSharp slower · =1.0× = parity. (numpy 2.4.2 · Release · best-of-rounds · same box.)

1) Operations — category × cache tier (geomean)

category	scalar	1K	100K	1M	10M
elementwise	1.05	1.54	1.18	1.09	1.11
reductions	2.67	1.99	1.51	1.44	1.42
copy/cast	0.61	0.59	0.40	1.39	1.06
index-math	0.32	0.51	0.97	1.22	1.22
dtypes	0.71	0.85	1.97	1.54	1.47

2) Per-family × cache tier

family	scalar	1K	100K	1M	10M	geomean
— elementwise —
add	1.01	1.48	1.03	0.88	1.01	1.06
sqrt	0.85	1.15	1.00	1.01	1.02	1.00
copy	0.88	2.59	1.78	1.33	1.72	1.56
strided	0.89	1.12	1.00	1.02	0.99	1.00
bcast	0.89	1.13	1.02	0.98	1.03	1.01
reversed	0.85	1.28	0.90	0.99	1.00	0.99
castbuf	1.98	2.29	1.65	1.35	1.16	1.64
mixbuf	1.49	1.94	1.40	1.24	1.09	1.40
— reductions —
sum	1.84	1.78	2.79	2.21	1.76	2.04
sum axis=0	1.90	0.86	0.96	1.00	0.94	1.08
sum axis=1	1.85	0.86	1.51	1.83	1.57	1.47
sum dtype=	1.97	1.47	0.49	0.47	0.55	0.82
amin	1.70	1.61	0.71	0.70	0.82	1.02
cumsum	1.47	1.09	1.06	1.80	1.68	1.39
any (all-false)	8.89	8.41	2.12	0.98	1.00	2.74
any (hit)	9.01	8.50	8.50	7.87	8.22	8.41
— copy/cast —
flatten	0.43	0.44	0.17	2.17	0.90	0.57
astype	0.30	0.53	0.59	1.97	1.90	0.81
ravel.T	0.45	0.73	0.48	2.11	1.01	0.80
copy in-place	1.77	0.81	0.81	1.06	1.02	1.05
less → bool	0.81	0.52	0.26	0.54	0.76	0.54
— index-math —
unravel_index	0.33	0.50	0.95	1.01	0.97	0.68
ravel_multi_index	0.32	0.52	0.99	1.49	1.53	0.82
— dtypes —
complex128	0.74	0.63	1.01	0.76	0.89	0.80
float16	0.72	0.65	0.62	0.62	0.62	0.65
int8	0.67	1.47	12.09	7.70	5.78	3.51

3) Operand & broadcast layouts — case × dtype (1M elements)

case	f64	f32	f16	i32	i64	c128	geomean
1-D contiguous (a+a)	2.55	2.18	0.62	2.04	2.54	2.38	1.87
1-D strided a[::2]	1.83	1.36	0.52	1.37	1.77	1.86	1.34
1-D reversed a[::-1]	2.26	2.09	0.56	2.00	2.14	2.23	1.71
array + scalar	2.63	2.11	0.63	1.80	2.17	2.58	1.81
scalar + array	2.35	2.10	0.64	1.98	2.48	2.59	1.85
mixed C + F	2.12	2.01	0.62	1.98	2.02	2.26	1.70
mixed C + T	2.59	2.13	0.62	2.04	2.20	2.51	1.84
broadcast +row(1,C)	2.72	2.28	0.63	2.00	2.42	2.42	1.89
broadcast +col(R,1)	2.68	2.14	0.56	1.99	2.45	2.82	1.88
col-broadcast unary	2.55	1.76	0.86	1.69	2.66	6.09	2.18

4) `astype` cast matrix — geomeans (1M, over all 1,568 src→dst × layout cells)

By memory layout

C	F	T	sliced	negrow	negcol	strided	bcast
1.82	1.97	1.88	1.87	1.89	1.81	1.47	2.21

By source dtype

bool	u8	i8	i16	u16	i32	u32	i64	u64	char	f16	f32	f64	c128
2.06	1.99	2.02	2.26	2.21	1.96	1.94	1.64	1.58	1.98	2.41	1.75	1.45	1.16

By destination dtype

bool	u8	i8	i16	u16	i32	u32	i64	u64	char	f16	f32	f64	c128
1.95	1.76	1.80	1.63	1.61	1.82	1.66	1.93	1.76	1.63	2.48	1.63	2.04	2.55

_{Reading: 8.89 = NumSharp ~8.9× faster; 0.40 = NumSharp ~2.5× slower. The float16 ≈0.6 column reflects the missing Vector<Half> in the .NET BCL (scalar path). decimal omitted from cast geomeans (no NumPy dtype). Absolute ms aren't cross-machine comparable; these ratios are.}

The benchmark/history/<date>_<sha>/ provenance convention held only the 2026-06-05 snapshot; the just-regenerated full run (report commit e3b7c26) was never persisted there — its raw archive sat only in the gitignored benchmark/results/20260623-065155/. This promotes the four core artifacts + a MANIFEST, matching the existing snapshot's shape. Snapshot: benchmark/history/2026-06-23_e3b7c268/ - MANIFEST.md provenance / env / methodology / headline geomeans - benchmark-report.md op-matrix + appended NpyIter/Layout/Operand/Cast/Fusion - benchmark-report.json unified machine-readable results - benchmark-report.csv spreadsheet form - numpy-results.json raw NumPy timings (merge input) Run 20260623-065155 · i9-13900K / Win11 Pro N / .NET 10.0.101 / Python 3.12.12 / numpy 2.4.2 · Release · ~3h. Benchmarked branch HEAD b8623db + local engine WIP (built clean). Convention NPY/NS (>1 = NumSharp faster). Headline: op-matrix 1.14x/0.90x/1.26x at 1K/100K/10M; NpyIter 1.18x geomean (bcast_reduce 538x, any(F) 2.74x); fusion (a-b)/(a+b) 4.16x. Raw BDN per-class JSON omitted (regenerable per the MANIFEST).

… symlink + publish ritual Make a benchmark run produce a complete, committable provenance snapshot under benchmark/history/ (the only tracked home for the json/csv/numpy-results that are gitignored at the benchmark root), and expose the newest run through a git-tracked benchmark/history/latest symlink so docs and tooling can reference a stable path. Previously a run only left the gitignored results/<ts>/ raw scratch plus root-level report files (json/csv ignored), so "what we commit and reference" had no home and history/ snapshots were assembled by hand. WHY results/<ts>/ = gitignored raw scratch (BDN per-class JSON, ~tens of MB) history/<date>_<sha>/ = tracked provenance snapshot (MANIFEST + report + all sheets) history/latest = git symlink -> newest snapshot (the stable reference path) The benchmark/.gitignore drops benchmark-report.{json,csv} and numpy-results.json at the root, so history/<snap>/ is the ONLY committed location that carries the full, machine-readable result set alongside the rendered .md. NEW benchmark/scripts/snapshot_history.py Assembles a snapshot from the newest (or --results-dir) run + the rendered subsystem sheets into history/<date>_<sha>/ and (re)points history/latest at it: - 15 artifacts: benchmark-report.{md,json,csv}, numpy-results.json, npyiter/layout/operand/cast/fusion *_results.{md,tsv}, cards/{ops,cat}.png - auto-generated MANIFEST.md: provenance + environment (CPU parsed from the BDN ProcessorName, OS, .NET SDK, Python, NumPy) + NPY/NS convention + methodology + the per-size geomean headline parsed back out of benchmark-report.md + the NpyIter/Cast headlines. - PROVENANCE accuracy: the working-tree marker reads DIRTY only when BUILD-AFFECTING sources differ from HEAD (uncommitted-or-new src/ code or *.csproj/props/targets), NOT on untracked scratch (benchmark/poc/*.cs), gitignored output, or the report files a run always regenerates. So a clean-checkout CI run correctly reads "clean (HEAD exactly)" instead of a permanent false DIRTY. All MANIFEST cells are |-escaped so a commit subject / WIP path containing '|' can't inject a table column. - history/latest as a relative mode-120000 symlink (repo has core.symlinks=true and already tracks symlinks); text-file fallback if the OS refuses symlinks. - <date>_<sha> names the BENCHMARKED commit; --head lets an after-the-fact re-snapshot record the run's real report commit (and reads THAT commit's subject). - flags: --results-dir / --snap-name / --head / --commit / --no-stage. Raw BDN per-class JSON is intentionally not persisted (regenerable). WIRE run_benchmark.py New step 6 after the root-copy: unless --no-history, shell out to snapshot_history.py with the just-produced --results-dir, so every full run ends with a committable snapshot + repointed latest. Adds HISTORY_DIR const, the --no-history opt-out, and a closing "Snapshot: .../latest -> ..." line. Passes --no-stage: writing the snapshot must NOT mutate the git index — staging is the human's "review" step (run -> review -> commit) and CI stages benchmark/history/ explicitly; a local perf check shouldn't silently `git add` ~16 files. WIRE .github/workflows/benchmark.yml (the post-release publish ritual) git add now includes benchmark/history/ so the snapshot + latest symlink are committed to master alongside the refreshed report/cards; commit message and header comment updated to mention the history snapshot. DOCS - benchmark/CLAUDE.md: documents history/ (latest + <date>_<sha>/), results/ as gitignored scratch, scripts/snapshot_history.py, and a new "History snapshots & the publish ritual" section (results/ vs history/<snap>/ vs latest table + run -> review -> commit ritual). - .claude/CLAUDE.md (Performance Convention): "what we commit & reference is benchmark/history/, not results/<ts>/"; reference the stable benchmark/history/latest/benchmark-report.md. Also reconciles the canonical-harness line to FIVE appended subsystems (adds the previously omitted benchmark/operand/). BACKFILL benchmark/history/2026-06-23_e3b7c268/ Re-ran the script over the e3b7c26 report run: the snapshot carries the full 15 artifacts (was 4 hand-copied core files) + a regenerated MANIFEST that correctly reads "clean (HEAD exactly)" (no build-affecting WIP at that commit), and history/latest points at it.

Add a detailed compatibility matrix comparing NumPy (v2.4.2) string dtype/features to NumSharp (nditer branch f7be5f7). Documents support across dtypes, parsing, scalars, array creation, memory layout, casting, comparisons, string operations, array functions and I/O, and highlights that NumSharp lacks a real string dtype (only a UTF-16 char vector and several hacks). Notes two spec corrections found against live NumPy, links to NUMPY_STRING_TYPES.md, and tracks issue #592.

Nucs force-pushed the nditer branch from f5c05a7 to 574a0d8 Compare April 23, 2026 09:34

Nucs mentioned this pull request Apr 28, 2026

Add NDIterator<T> overload with support for specific axis. #363

Open

Nucs mentioned this pull request May 17, 2026

[Core] Layout 'F/A/K' support #610

Open

8 tasks

Nucs added 27 commits May 29, 2026 20:44

Nucs added 25 commits June 20, 2026 19:57

Nucs added 3 commits June 24, 2026 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 462 commits into
masterfrom
nditer

Nucs commented Apr 22, 2026 •

edited

Loading

Uh oh!

Nucs commented Jun 23, 2026

Uh oh!

Nucs commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Nucs commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

1. NpyIter — full NumPy nditer port

Execution at NumPy speed

2. NpyExpr DSL + three-tier custom-op API

3. Legacy iterator stack retired

4. C/F/A/K memory-layout support

5. New & completed np.* APIs

6. Linear algebra

7. Performance (beyond NpyIter and linalg)

8. Official benchmark suite + honest methodology

9. Differential fuzzing vs NumPy (new infrastructure)

10. Correctness — NumPy-parity bug fixes

11. Memory management — ARC + IDisposable

12. Char8 primitive

13. Examples — trainable MNIST MLP

14. Kernel architecture & hygiene

15. Documentation

16. Tests & CI

Breaking changes

Uh oh!

Nucs commented Jun 23, 2026

📊 Performance snapshot vs NumPy 2.4.2

🟢 Where it's great (often an order of magnitude)

🔴 Where it's still bad (the honest worklist)

TL;DR

Uh oh!

Nucs commented Jun 23, 2026

📐 Performance matrices — >1.0× = NumSharp faster

1) Operations — category × cache tier (geomean)

2) Per-family × cache tier

3) Operand & broadcast layouts — case × dtype (1M elements)

4) astype cast matrix — geomeans (1M, over all 1,568 src→dst × layout cells)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Nucs commented Apr 22, 2026 •

edited

Loading

1. NpyIter — full NumPy `nditer` port

5. New & completed `np.*` APIs

11. Memory management — ARC + `IDisposable`

12. `Char8` primitive

📐 Performance matrices — `>1.0×` = NumSharp faster

4) `astype` cast matrix — geomeans (1M, over all 1,568 src→dst × layout cells)