Conversation
… throws)
NumPy has no subtract ufunc loop for the bool dtype: bool - bool,
np.subtract(bool, bool) and np.ptp(bool) (which computes max - min) all raise
TypeError ("numpy boolean subtract, the `-` operator, is not supported, use
the bitwise_xor, the `^` operator, or the logical_xor function instead.").
NumSharp previously computed a boolean result instead - a silent divergence
surfaced while implementing np.ediff1d.
DefaultEngine.Subtract now guards the both-operands-boolean case and throws
NotSupportedException with NumPy's message. Mixed bool + numeric still
promotes to the numeric type and subtracts (e.g. bool - int32 -> int32), so
only the genuine bool/bool case is rejected. The guard sits at the single
engine chokepoint, so it covers the - operator, np.subtract, the
scalar-scalar path, and any op that subtracts internally (e.g. np.ptp).
Side effect: np.ptp(bool) now raises the same error as NumPy instead of
returning a bool, fixing a previously-documented [Misaligned] divergence.
Tests
- np_subtract_tests.cs: dropped the (Boolean, Boolean) DataRow from
SubtractAllPossabilitiesBoolean (it contradicted the test's own _REGEN
template, which excludes Boolean) and added SubtractBoolBool_ThrowsLikeNumPy
asserting the operator, np.subtract and the scalar path all throw.
- np.ptp.BattleTests.cs: Ptp_Bool_DivergesFromNumPy -> Ptp_Bool_ThrowsLikeNumPy
(dropped [Misaligned]; now asserts the throw, matching NumPy).
Full suite green on net10.0 (9404 passed, 0 failed). Two net8.0 failures
(Trunc_FContig_PreservesFContig, Max_WithNaN_PropagatesNaN) are pre-existing
and unrelated: they fail identically with this change reverted and involve no
boolean subtraction.
…e(bool) throw)
NumPy has no negative ufunc loop for the bool dtype: unary -bool,
np.negative(bool) and ndarray.negative() all raise TypeError ("The numpy
boolean negative, the `-` operator, is not supported, use the `~` operator or
the logical_not function instead.") - even for empty arrays. NumSharp
previously returned logical_not instead, a silent divergence (companion to the
bool-subtract divergence fixed in the previous commit).
Changes
- DefaultEngine.Negate (the unary `-` operator / TensorEngine.Negate path):
the bool branch now throws instead of routing to LogicalNot.
- NDArray.negative() (the np.negative / ndarray.negative() path, a separate
implementation): guards bool up-front - before the size==0 short-circuit and
the cast allocation, so empty bool also throws - and drops the now-dead
boolean switch case.
- np.logical_not(bool) previously relied on Negate(bool) routing to LogicalNot.
It now routes through np.invert (for bool, logical_not == ~ == invert), so
logical_not keeps working after the Negate guard. The `!` and `~` operators
were already independent of Negate and are unaffected.
Numeric negate (int / float / decimal / complex / half) is unchanged.
Tests
- NDArray.negative.Test.cs: Negative_Bool_ThrowsLikeNumPy asserts -bool,
np.negative(bool), ndarray.negative() and empty-bool all throw, and that
~bool / np.logical_not(bool) still return the boolean flip [F, T, F].
Full suite green on net10.0 (9405 passed, 0 failed). The two net8.0-only
failures (Max_WithNaN_PropagatesNaN, Trunc_FContig_PreservesFContig) are
pre-existing and unrelated: they reproduce on a clean build with this change
reverted and involve no boolean negation.
…cross all dtype pairs
NumSharp had THREE cast implementations with divergent out-of-range / NaN semantics:
- Converts (scalar; used by astype/CastTo): NumPy-faithful on all 3276 cast cases.
- DirectILKernelGenerator cast kernel (SIMD; used by NpyIter.Copy contig/strided):
SATURATED float->int and sign-extended signed->UInt64 to only 32 bits -> 142 wrong.
- NpyIterCasting.ConvertValue (scalar fallback; used by NpyIter.Copy/copyto/where/
concatenate): read every source through a double intermediate (dropping Int64/UInt64
low bits) and used C# saturating casts -> 346 wrong.
NumPy uses MODULAR WRAPPING for integer narrowing/sign-change and C truncation
(type-min / int.MinValue sentinel on NaN/overflow) for float->int. The IL kernel and
ConvertValue instead SATURATED (clamped to dst min/max), so every out-of-range,
negative->unsigned, float-overflow or NaN value produced a wrong result through
NpyIter.Copy -- a latent bug reachable today via np.where, np.copyto and np.concatenate
cross-dtype casts (not just the astype fast-path under development).
Spec established empirically: a 15x15 dtype x edge-value ground-truth matrix generated
from NumPy 2.x (.claude/scripts/cast_spec_numpy.py), replayed bit-exactly through all
three NumSharp paths (cast_spec_numsharp.cs). Converts matched NumPy on every case and is
the single source of truth.
Fixes:
- NpyIterCasting.ConvertValue: read the source through a LOSSLESS intermediate
(long / ulong / double / decimal categories instead of blanket double) and route
integer narrowing and float->int conversions through the Converts.* table. Integer
casts now wrap (unchecked); float->int truncates with the NumPy overflow/NaN sentinel.
Result: 0/3276 divergences (was 346).
- DirectILKernelGenerator: TryGetCastKernel / TryGetStridedCastKernel now decline the two
families whose emitted IL diverges from NumPy (float->integer; signed-narrow->UInt64)
via DivergesFromNumpyCast(), so those casts fall back to the NumPy-exact ConvertValue
path. All other pairs keep their SIMD kernels. NumPy-exact SIMD emission for the
declined families (raw cvtt truncation, full 64-bit sign extension) is a perf follow-up.
Result: all three cast paths are now bit-exact with NumPy across all 3276 cast cases
(was il=142, cv=346, effective fast path=271 wrong). Full unit-test suite: 0 failures,
9405 passed (TestCategory!=OpenBugs,HighMemory).
astype's cross-dtype path used a scalar per-element Converts loop (UnmanagedMemoryBlock.CastTo) that was 1.5-3x slower than NumPy on contiguous arrays and paid a cache-hostile reorder for transposed/F-contiguous sources. Replace it with a SIMD copy-with-cast through NpyIter.Copy into a KEEPORDER output: the output's memory order mirrors the source's contiguity, matching NumPy's array_astype default order='K' / PyArray_NewLikeArray (numpy/_core/src/multiarray/methods.c:769). An F-contiguous or transposed source now casts as a flat both-F copy instead of a strided reorder. Wins (min of 3 runs, old astype -> new, vs NumPy 2.x), 2048x2048: i32->f64 contig 8.55ms -> 4.56ms (0.89x NumPy - beats it) i32->f64 transpose 10.76ms -> 4.54ms (0.89x NumPy) f32->f64 transpose 10.99ms -> 4.63ms (0.93x NumPy) i32->i32 transpose 1.83ms -> 1.82ms (0.91x NumPy) All non-narrowing cast families (int->float, float->float, int->int, same-width) are now within 10% of NumPy or faster across contig/strided/transpose/fortran/broadcast/negstride. float->int and signed-narrow->UInt64 are excluded via DivergesFromNumpyCast: they have no NumPy-faithful SIMD kernel yet (see 71b6c38), and NpyIter.Copy's scalar strided fallback is slower than the legacy contiguous Converts loop, so they keep the legacy path - correct and no slower than before. Restoring SIMD for float->int with NumPy wrapping semantics (raw cvtt truncation, no saturation) is the follow-up that closes the f64->i32 downcast gap. Correctness rests on the cast-semantics alignment in 71b6c38 (all three cast paths now bit-exact with NumPy across the full 15x15 matrix). Full suite: 0 failures, 9405 passed.
…float->int & signed->UInt64
Implements float->int casting correctly AND fast, closing the f64->i32 downcast gap and
un-declining the family from DivergesFromNumpyCast.
Root cause (proven in .claude/scripts/f64_to_i32_shootout.cs): the hardware truncating convert
VCVTTPD2DQ/VCVTTPS2DQ already yields NumPy's exact result -- truncate toward zero, INT_MIN
sentinel on NaN/overflow -- while C#'s (int)d and the IL conv.* opcodes SATURATE. So the fix is
to USE cvtt, not to clamp.
Changes:
- EmitConvertTo (scalar element conversion; used by every cast-kernel tail and the ScalarOnly
float->int pairs): route float->{integer,char} through the Converts.* table (NumPy-exact)
instead of the saturating conv.* opcodes, and sign-extend signed->UInt64 via Conv_I8 instead
of the zero-extending Conv_U8. The entire scalar cast path is now bit-exact with NumPy.
- CastDoubleToInt32Contig / CastSingleToInt32Contig: width-adaptive cvtt helpers (AVX512
8/16-wide, AVX 4/8-wide, SSE2 2/4-wide, scalar Converts tail) registered for the float->int32
contig pairs in TryGetCastKernel. The IL generator's width-matched emitter can't express the
double->int32 lane halving cleanly, so these use the "tidy helper -> single Call" pattern.
- DivergesFromNumpyCast narrowed from {all float->int, signed->UInt64} to just signed->UInt64
(its vectorized widen still sign-extends to only 32 bits, so it stays on the correct
ConvertValue scalar fallback). float->int is NumPy-faithful on every path now.
Perf (min of 3, 2048x2048, new vs NumPy 2.x astype):
f64->i32 contig 9.01ms -> 2.90ms (0.85x NumPy - beats)
f64->i32 transpose 13.68ms -> 2.95ms (0.87x)
f64->i32 fortran 13.76ms -> 3.04ms (0.88x)
f64->i32 broadcast 14.35ms -> 2.39ms (0.92x)
f64->i32 now beats NumPy on contig/transpose/fortran/broadcast. Strided/negstride f64->i32
improved 2.4-2.6x (8.3->3.4ms, 16.9->6.5ms) but remain on the scalar strided path (1.4-2.0x
NumPy); a strided-cvtt inner loop is the follow-up.
Correctness: all three cast paths bit-exact with NumPy across the full 3276-case matrix
(cast_spec) plus multi-element SIMD-body verification (cast_multi_correctness: float->all 9 int
dsts x contig/strided/reversed, signed->u64). Full suite: 0 failures, 9405 passed.
…non-contiguous gap) The contig float->int32 cvtt helper (26b710b) left strided and negative-stride f64->i32 on the scalar strided path (1.4-2.0x NumPy). Add a NumPy-faithful cvtt strided kernel that vectorizes the innermost axis for the three common dst-contiguous run shapes: - unit-stride source -> VCVTTPD2DQ forward - reversed source ([::-1])-> load forward chunk, cvtt, reverse lanes (Vector128.Shuffle) - positive-strided source -> AVX2 gather (a[:,::2] etc.), then cvtt Outer dims walk via an incremental-offset odometer (element strides). Anything else (non-contiguous dst, exotic strides, no AVX2) falls back to the NumPy-faithful scalar Converts path. Registered in TryGetStridedCastKernel for double->int32. Perf (min of 3, 2048x2048, new vs NumPy 2.x): f64->i32 strided large 3.41ms -> 1.98ms (0.83x NumPy - beats) f64->i32 negstride large 6.48ms -> 2.99ms (0.85x) f64->i32 negstride med 0.385 -> 0.154 (0.80x) With this, the astype variation matrix has ZERO real-array (>=256K-element) failures: every meaningful-size cast cell across f64<->i32 / f32<->f64 / i32<->i32 x contig/strided/transpose/fortran/broadcast/negstride is now within 10% of NumPy or faster. The remaining 21 matrix fails are all trivial-size (<=4K elements, sub-microsecond), bound by NDArray allocation overhead rather than the cast kernel. Correctness: cast_spec 0/3276 mismatches on all three paths; multi-element SIMD-body test (double->int32 strided/reversed/contig) 0 failures. Full suite: 0 failures, 9405 passed.
…ex->bool dropping imaginary part Builds the differential-fuzzer harness (Plan A items #1+#2 foundation): an offline, committed NumPy 2.4.2 oracle corpus replayed bit-exact by MSTest, so every NpyIter-backed operation can be proven identical to NumPy across the full input space -- caught systematically, not by hand-picked cases. The motivating failure class (the cast saturate-vs-wrap bug, latent in where/copyto/concatenate) must be impossible to ship again. Harness - oracle/gen_oracle.py + oracle/layout_catalog.py: Python generator emitting JSONL cases {op, params, operands[], expected}. Each operand is serialized bytes-exact as (dtype, shape, element-strides, element-offset, bufferSize, base-buffer-hex). 26 single-array layouts: C/F-contiguous (1/2/3-D), transposed, strided, negative-stride, offset slices, broadcast (stride-0), scalar-broadcast, 0-D views, newaxis, singleton dim, empty, empty-composed, 5-D. Edge values are FRONT-LOADED (NaN / +/-inf / overflow / -0.0) so even an 8-element operand exercises the float->int cvtt sentinel paths. describe() self-validates that every operand is a genuine VIEW into its base buffer, so a layout that accidentally yields a copy/scalar fails loudly at generation. - test/NumSharp.UnitTest/Fuzz/FuzzCorpus.cs: reconstructs the EXACT operand views from raw bytes via contiguous-storage + Alias (handles broadcast where buffer < size, negative strides, offset slices, 0-D, empty) -- no Python at test time. - BitDiff.cs: element-wise bit-exact compare with NaN tokenized (payload non-contractual); -0.0 and +/-inf are bit-compared since NumPy preserves signed zero / canonical infinity. - OpRegistry.cs: op-name -> NumSharp call (astype today; binary/comparison/unary/reduction /where tiers extend the switch). - FuzzCorpusTests.cs: one data-driven [TestCategory("FuzzMatrix")] test per corpus file; a failure lists every divergent cell (not first-failure-wins). - HarnessSelfTests.cs: proves the harness has teeth -- BitDiff detects value/NaN-vs-number /signed-zero divergence and treats differing NaN payloads as equal; and the corpus pipeline genuinely exercises the float->int wrap-not-saturate semantics on real edges. - Corpus committed under Fuzz/corpus/ and copied to test output: astype_smoke (312 cases) + astype_full (4394 = 13 x 13 NumPy-representable dtypes x 26 layouts). Bug found and fixed (the matrix's first real catch) - complex->bool through the NpyIter scalar-cast fallback (NpyIterCasting.ConvertValue) dropped the imaginary part: it took c.Real and applied real != 0, so complex(0, nonzero) and complex(-0.0, nonzero) returned False. NumPy defines bool(z) == (z != 0): truthy when EITHER part is non-zero. Latent wherever complex->bool casts through the iterator (astype, np.where, np.copyto, np.concatenate). Fixed to honor both parts, matching the already-correct Converts.ToBoolean(Complex). All 15 divergent cells across every layout now match NumPy. Verification - astype_full: 0 / 4394 divergences from NumPy. - Full CI suite (net10.0, TestCategory!=OpenBugs&!=HighMemory): 0 failed / 9412 passed. Scope note: Char and Decimal have no NumPy analog, so they are excluded from the differential corpus; both remain covered by the existing Converts-oracle cast tests.
…classifier
Extends the differential fuzzer to binary arithmetic (add/subtract/multiply/divide)
across 20 NEP50 dtype pairs x 9 pairwise operand layouts, bit-comparing value AND
result dtype AND broadcast shape against NumPy 2.4.2. 720 cases.
Pairwise layouts (layout_catalog.PAIR_LAYOUTS): pp_contig_contig (SimdFull),
pp_contig_fortran, pp_contig_strided (SimdChunk), pp_strided_strided (General),
pp_scalar_right / pp_scalar_left (SimdScalar*), pp_broadcast_row / pp_broadcast_col,
pp_negstride_both. Operands are emitted at natural shapes; the op performs broadcasting.
The matrix now asserts three things per case: result dtype == NumPy (the headline NEP50
promotion failure), result shape == NumPy (broadcasting), and bit-exact bytes.
Misaligned classifier (never-silent intended divergences)
- test/.../Fuzz/MisalignedRegistry.cs: the explicit, documented set of intended
NumSharp-vs-NumPy differences. The matrix excuses ONLY these patterns (logging a
per-reason count) and fails on anything else. Cases stay in the corpus so a future
fix simply starts passing, and any drift past tolerance turns back into a hard failure.
- BitDiff.WithinUlp: ULP-distance helper used ONLY to classify documented near-misses,
never to relax the default bit-exact gate.
Two intended divergences documented (maintainer decision: keep as Misaligned)
1. NEP50 weak-scalar (41 cases): NumSharp treats a 0-D array operand as a weak scalar
(the other operand's dtype drives promotion), e.g. array(int32)+zerod(int64) -> int32
where NumPy gives int64; array(uint8)+zerod(int8) -> uint8 vs NumPy int16. NEP50 makes
0-D arrays full promotion participants; only Python scalar literals are weak. NumSharp
cannot distinguish the two (both are 0-D NDArrays), and `arr + 5` ergonomics were chosen
over strict parity. Array+array promotion is correct across all layouts.
2. Complex true-division ~1 ULP (11 cases): complex / {float,complex,int} differs from
NumPy's npy_cdivide by one ULP (System.Numerics.Complex uses different scaling).
Complex add/sub/mul are bit-exact; only division diverges.
Verification: Binary_Arith excuses exactly 52 divergences (11 complex-divide + 41
weak-scalar) and bit-matches the other 668 cases; full FuzzMatrix category 0 failed /
8 passed. No NumSharp.Core changes in this commit (harness + corpus + generator only).
…Bugs]
Extends the binary differential generator to floor_divide / mod / power and runs them as
a known-failing [OpenBugs] reproduction (excluded from CI; remove the tag when fixed).
The bit-exact add/sub/mul/divide matrix (Binary_Arith) stays green; these three ops are
split into a separate binary_divmod_power.jsonl corpus + Binary_DivModPower_KnownDivergences
so the CI gate is not broken while the divergences are tracked.
206/430 cases diverge from NumPy 2.4.2 (27 NEP50 weak-scalar cases auto-excused). The real
bugs, confirmed directly against NumPy:
integer ÷0 / mod 0 NumPy returns 0 (with RuntimeWarning). NumSharp:
- int32 //0 -> 2147483647 / -2147483648 (saturated sentinel), not 0
- int32 %0 -> the dividend (1,-7,5), not 0
- uint8 //0 and %0 -> throw DivideByZeroException
float //0 NumPy: 1.0//0 -> +inf, -1.0//0 -> -inf, 0.0//0 -> nan.
NumSharp: NaN for all (loses the ±inf sign result).
mixed-precision mod mod(float32, float64) computes in float32 then widens (low mantissa
zeroed) instead of promoting to float64 first. add/sub/mul/div promote
correctly; only mod loses precision.
complex power complex ** {float,complex,int} differs by ~1 ULP (like complex divide)
plus some gross edge divergences (NumPy NaN vs NumSharp 0).
Correct already: mod sign convention is floored (mod(-7,3)==2, matches NumPy), and float %0
is nan on both sides.
CI suite (net10.0, !OpenBugs&!HighMemory): 0 failed / 9413 passed. FuzzMatrix: 8 passed.
Adds the comparison tier (equal/not_equal/less/greater/less_equal/greater_equal -> bool) across the 20 NEP50 dtype pairs x 9 pairwise layouts. 1080 cases. Found: NumSharp's <= and >= return True when an operand is NaN; IEEE/NumPy return False (only != is True against NaN). <, >, ==, != all handle NaN correctly. 122 cases, every one the NumPy-false -> NumSharp-true signature on the NaN element. Documented in MisalignedRegistry (tagged [known bug]) with a tight signature so the matrix stays green; tracked for fix. Comparison matrix excuses exactly 122 NaN-ordering divergences and bit-matches the other 958. FuzzMatrix category green.
…classes
Adds the unary tier (negative/abs/sign/sqrt/cbrt/square/reciprocal/floor/ceil/trunc/
sin/cos/tan/exp/log) across 7 dtypes x 26 single-array layouts. 2574 cases. Adds Half-ULP
support to BitDiff and routes THREW through the Misaligned classifier (new DivergenceKind.Threw).
Also fixes a generator artifact: np.ascontiguousarray forces ndim>=1, which corrupted the
expected shape of 0-D results to [1]. gen_unary/gen_binary now read r.shape BEFORE
ascontiguousarray (used only for the bytes), so scalar results compare as [] (matching NumSharp).
Documented divergence classes (excused, never silent; remove from MisalignedRegistry as fixed):
494 unary NEP50 promotion NumSharp promotes int->float64 for unary float ops regardless of
width; NumPy uses width-based float (bool/int8->float16,
int16->float32, int32+->float64). square/reciprocal/floor dtype also differ.
90 complex unary square catastrophic cancellation (re^2-im^2 -> 0); sin/cos/tan/log
inf/NaN edge handling differs (System.Numerics vs npy_c*).
72 unary ~ULP transcendental / magnitude (abs complex) within 2 ULP.
61 reciprocal integer reciprocal returns 0 vs NumPy ÷0 sentinel; non-contiguous
operand throws "Can't return a memory address".
24 negative on unsigned np.negative(uint8) throws NotSupportedException; NumPy wraps modulo.
Unary matrix excuses these and bit-matches the rest. FuzzMatrix category: 10 passed.
Tracked for fixes in task #9.
…ergences
Adds the reduction tier (sum/prod/min/max/mean/std/var/argmax/argmin/all/any) across axis
in {None, 0, ndim-1} x keepdims {false,true} x 7 dtypes x 10 layouts. 3640 cases. OpRegistry
gains a reduction dispatch handling the heterogeneous NumSharp signatures (mean/std/var split
None-vs-int overloads; argmax/argmin require an int axis; prod takes a Type arg).
Reductions are mostly solid — only 80/3640 diverge. all/any/argmax/argmin/prod are bit-exact.
Documented divergence classes (excused in MisalignedRegistry; remove as fixed, task #10):
reduction NaN propagation min/max/mean/std/var return the non-NaN result; NumPy keeps NaN
(regular reductions propagate; only nan* variants skip).
complex axis reduction mean/std/var/sum of complex along an axis throws
"Can't construct NDCoordinatesAxisIncrementor with a vector shape".
bool min/max along axis returns True where NumPy returns False.
summation precision sum/mean/std/var floating accumulation differs from NumPy
pairwise / two-pass (algorithm order).
reduction result dtype NEP50 accumulator width / complex->real (std/var) differences.
FuzzMatrix category: 11 passed.
Adds the selection tier: np.where(cond,x,y) across 8 dtype pairs x 5 triple-operand layouts
(contig / broadcast-x-y / strided / scalar-cond / broadcast-cond) and np.place(arr,mask,vals)
across 3 contiguous layouts x 5 dtypes. where=40 cases, place=15 cases.
np.place is bit-exact with NumPy on all 15 cases. np.where bit-matches 33/40; documented:
3 NEP50 weak-scalar where(cond, array, 0-D scalar) promotes weakly (generalized the
weak-scalar rule to any multi-operand op with a 0-D operand).
4 complex np.where throws NotSupportedException "Zero-push unsupported for Complex".
With this, all six A1 op tiers (T1 cast, T2 binary, T3 comparison, T4 unary, T5 reduce,
T6 where/place) are implemented and green. FuzzMatrix category: 13 passed. Every divergence
is either bit-exact, documented in MisalignedRegistry, or a tracked [OpenBugs] reproduction.
Adds oracle/fuzz_random.py: a seeded, reproducible random generator that draws random transposed/strided/reversed views (random ndim<=4, dims, stride perms/signs), random dtypes, and random ops across all tiers, computes the NumPy 2.4.2 result, and writes a committed corpus the C# harness replays. FuzzRandom replays a 2000-case seed (1234); FuzzRegression replays shrunk soak failures from corpus/regressions/ (empty until the soak finds one). Shrinker.cs minimizes a failing element-wise case to a single 0-D element using the corpus's own expected byte at the diverging index (no NumPy needed); RunCorpus emits the minimal repro in the failure message so a soak find is immediately committable. A self-test proves the shrunk case reproduces the divergence (bool add True+True). The random fuzzer is scoped to NumSharp-PRODUCIBLE layouts: NumSharp's slicing normalizes offset into the storage base (Shape.offset stays 0) and keeps consistent size-1 strides, so reconstructing NumPy's RAW offset!=0 / junk-size-1-stride representation exposes ops that assume the normalized form (real but unreachable via the API; tracked as task #11). With that scoping the random gate is green; it surfaced three NEW divergence classes the fixed matrices missed (tasks #11, #12): bool arithmetic True+True -> NumSharp 2 vs NumPy bool True. size-1 collapse NumSharp collapses an all-size-1 result to 0-D; NumPy keeps [1]. complex binary multiply (and divide on some data) catastrophic cancellation + ~1 ULP. Shape divergences now route through the classifier too. FuzzMatrix category: 16 passed.
…EADME
The FuzzMatrix differential gate (cast/binary/comparison/unary/reduce/where/place matrices +
seeded FuzzRandom + FuzzRegression) already runs on every push/PR: the tests are
[TestCategory("FuzzMatrix")], not excluded by the CI filter, and replay the committed offline
corpora with no Python. Documented that in build-and-release.yml's filter comment.
Adds .github/workflows/fuzz-soak.yml: a nightly (04:00 UTC) + manual job that installs NumPy
2.4.2, sweeps seeds through oracle/fuzz_random.py (~200k cases/seed, ~1M/night), and replays
each through the harness. On a divergence the shrunk minimal repro is printed and the failing
corpus uploaded as an artifact, to be pinned under corpus/regressions/ for FuzzRegression.
Adds test/NumSharp.UnitTest/Fuzz/README.md: how the offline oracle/replay model works, corpus
regeneration commands, the CI cadence, and the full documented-divergence ledger with task refs.
Completes Plan A (A0 harness, A1 T1-T6 op matrices, A2 random fuzzer + shrinker, A3 CI). Full
suite (net10.0, !OpenBugs&!HighMemory): 0 failed / 9421 passed; FuzzMatrix: 16 passed.
Adds docs/FUZZ_FINDINGS.md: every NumSharp-vs-NumPy-2.4.2 divergence the differential fuzzer surfaced, each bit-exact verified, with minimal NumPy-vs-NumSharp reproductions, root cause where known, and disposition (FIXED / BUG / INTENDED / SCOPING) mapped to tasks #7-#12. 22 findings: 1 fixed (complex->bool), 18 confirmed bugs (integer ÷0/mod0, float //0, mixed-precision mod, complex power; NaN <=/>=; NEP50 unary promotion, negative-unsigned, reciprocal, complex unary; reduction NaN propagation, complex-axis throw, bool min/max, summation precision, result dtype; complex where throw; bool arithmetic, size-1 collapse, complex binary cancellation), 2 intended Misaligned (NEP50 weak-scalar, complex-divide ULP), 1 scoping note (ops vs raw offset!=0 / junk size-1 strides).
…r behaviors) Adds docs/FUZZ_PLAN_NEXT.md: an actionable plan for the two open coverage rows. Part 1 — #2 sections C (per-operand flags) + E (composite paths): black-box, extends the existing differential harness with out= / where= params and operand-relationship layouts (aliased, overlapping, write-masked, out-strided/broadcast/cross-dtype). C5 + error parity fold in test-kind #4. Prerequisite: OpRegistry/corpus support for a pre-seeded out operand and a mask. Part 2 — #3 NpyIter behaviors (= section D, white-box): port NumPy's test_nditer.py (~106 fns). B0 coverage ledger -> B1 expose iterator state (ChosenOrder/CoalescedNDim/IsBuffered/InnerStrides) -> B2 order+coalescing (highest value, guards the KEEPORDER stride-sort) -> B3 buffered casting -> B4 op_axes/remove_axis/ranged (triage feature gaps) -> B5 nditer-config error parity. Acceptance: every test_iter_* green / [Misaligned] / tracked [OpenBugs] - nothing untracked. Plus an adjacent-gaps appendix (op breadth incl matmul/manipulation/bitwise/nan-aware, order=/dtype=/ ddof params, SIMD-tail shapes, leak/perf/metamorphic) and an effort-shape table.
…ion north star)
Adds docs/ROADMAP.md consolidating the entire path from the current baseline to the
/np-function goal (every np.* bit-identical to NumPy 2.4.2 OR >=1.5x faster), correctness-first
with the differential fuzzer as the regression net for the perf refactor.
Five phases:
0 Baseline (Plan A done) — harness, T1-T6 matrices, random fuzzer, CI gate.
1 Correctness backlog — fix the 22 findings, grouped by shared root (F1 div-by-zero, F2 NaN,
F3 NEP50 promotion, F4 unsupported throws, F5 complex algorithms, F6 bool, F7 size-1 shape,
F8 summation precision, F9 representation). Each fix flips its classifier branch and re-arms
the gate; F5/F8/F9 are implement-vs-document judgment calls.
2 Coverage breadth — 2A finish #2 sections C+E (out=/where=/aliasing/overlap/mask), 2B op
tiers T7-T15 (manipulation, matmul/dot, bitwise, nan-aware, cumulative, stat, logic,
sorting, multi-output) = the ~75 untested transformation ops.
3 NpyIter behavioral parity (#3 / Plan B) — port test_nditer.py; de-risks Phase 5.
4 Depth — params (order=/dtype=/ddof/axis), SIMD-tail & large shapes, error parity (#4),
unmanaged lifecycle (#5), metamorphic invariants (#7).
5 Performance — the >=1.5x-NumPy mission (#6): the DirectILKernelGenerator -> ILKernelGenerator
(NpyIter-driven) migration in CLAUDE.md priority order, benchmark ledger, perf CI gates;
the differential matrices keep the kernel rewrite from breaking parity.
Includes a dependency graph, value-weighted recommended order, and an effort-shape table. Detail
for phases 2A/3 lives in FUZZ_PLAN_NEXT.md; findings in FUZZ_FINDINGS.md.
… (Phase 1 F1) Ports NumPy's division/modulo C kernels into a new NpyDivision helper and routes both IL emission paths through it, fixing the floor_divide/mod parity defects the differential fuzzer surfaced (FUZZ_FINDINGS #2, #3, #4 — all now bit-exact). Root cause ---------- The old EmitFloorDivideOperation/EmitModOperation emitted inline IL that: * threw DivideByZeroException for unsigned integer ÷0 (and returned a saturated sentinel / the raw dividend for signed int ÷0 and mod-0); * force-converted float //0 to NaN via EmitFloorWithInfToNaN, losing NumPy's ±inf (sign-of-dividend) result and mis-handling inf//x, x//inf, overflow; * computed signed-integer floor division by round-tripping through double, losing precision for large int64 and skipping NumPy's snap-to-nearest. Fix --- New src/NumSharp.Core/Utilities/NpyDivision.cs replicates NumPy exactly: * Integer floor_div_@TYPE@ (loops_arithmetic.dispatch.c.src): ÷0 -> 0, signed floor toward -inf, MIN/-1 wraps to MIN (overflow), sub-int types compute in the int domain so the #DE trap can't fire. * Integer remainder (loops_modulo.dispatch.c.src): ÷0 -> 0, floored (Python) sign convention, MIN % -1 == 0. * Float npy_floor_divide / npy_remainder (CPython divmod port in npy_math_internal.h.src): fmod -> sign-fixup -> snap-to-nearest; b==0 returns a/b (±inf / nan), never a forced NaN. Computed in float32 for Single to stay bit-exact. Both the same-type generic path (DirectILKernelGenerator.Binary.cs EmitFloorDivideOperation<T>) and the resultType path (DirectILKernelGenerator.cs EmitFloorDivideOperation/EmitModOperation) now emit a single Call to the matching helper, mirroring the existing NpyIntegerPower pattern. Decimal/Half/Complex are unaffected (handled by their own emit paths before this point). Verification ------------ * 26 hand-picked edge cases (0.7//0.1==6.0, -2.0//inf==-1.0, inf//2.0==nan, 1e308//1e-300==inf, MIN//-1, MIN%-1, signed/unsigned ÷0) all bit-exact. * mod(f32,f64) now promotes to float64 and matches NumPy bit-exact (#4). * binary_divmod_power.jsonl promoted from [OpenBugs] to [FuzzMatrix]: floor_divide and mod are now CI-gated bit-exact; the only remaining excused divergence is complex power (~ULP/edge), documented for Phase 1 F5. * net10.0 FuzzMatrix gate: 16/16 green. Clears FUZZ_FINDINGS #2, #3, #4 (the reciprocal-int part of F1 tracks with the unary batch). Complex power (#5) reclassified BUG -> F5 (algorithmic, deferred).
…se 1 F2) The scalar comparison kernel emitted 'a <= b' as '!(a > b)' using the ordered Cgt opcode (Clt for '>='). For a NaN operand the ordered compare yields false, so the negation produced **true** — but IEEE/NumPy make every ordered comparison with NaN false (only '!=' is true). FUZZ_FINDINGS #6, surfaced by the comparison matrix (122 cases, all on the NaN element). Fix: for float operands use the UNORDERED compare (Cgt_Un / Clt_Un). With NaN these yield true, so the negation is false — correct. The unordered opcodes were already used for unsigned-integer comparison, so the change is just isUnsigned -> (isUnsigned || isFloat); signed integers keep the ordered Cgt/Clt (no NaN domain). '<' and '>' were already correct (Clt/Cgt already yield false for NaN), as were '==' / '!='. The SIMD path (Vector.LessThanOrEqual/GreaterThanOrEqual) was already IEEE-correct, so only EmitComparisonOperation (scalar/strided/tail) needed the change. Verified bit-exact on scalar (1-D), SIMD (NaN at lanes 0 and 17 of a 32-wide array), strided (every-2nd view), and float32 paths. The MisalignedRegistry NaN-ordering branch is removed so the comparison matrix now verifies this bit-exact; net10.0 FuzzMatrix gate 17/17 green. Clears FUZZ_FINDINGS #6.
…ase 1 F4) np.negative(NDArray) called the legacy Regen-templated NDArray.negative(), a per-dtype switch that threw NotSupportedException for every unsigned dtype (Byte, UInt16, UInt32, UInt64, Char) and read @out.Address (so non-contiguous operands also threw 'Can't return a memory address when sliced/broadcasted'). NumPy negates unsigned integers by two's-complement wrap: np.negative(uint8(1)) == 255. FUZZ_FINDINGS #8. Fix: route np.negative through nd.TensorEngine.Negate(nd) — the same engine path the unary '-' operator (NDArray.Primitive.cs) and nd.negate() already use. Its IL kernel emits the two's-complement (~v + 1) for unsigned types and drives strided / non-contiguous operands through NpyIter. The legacy hand-written nd.negative() is left untouched (still the public method) but is no longer on the np.negative path. Verified bit-exact vs NumPy on contiguous and strided uint8/uint16/uint32 (e.g. [0,1,2,255] -> [0,255,254,1]) and unchanged for signed/float. The MisalignedRegistry 'negative on unsigned throws' branch is removed so the unary matrix verifies it bit-exact. Clears FUZZ_FINDINGS #8 (reciprocal non-contiguous, the sibling Address bug in #9, tracked separately).
…cs (Phase 1 F3a) NumSharp promoted every integer input to float64 for the float-producing unary ufuncs (sqrt/cbrt/exp/log/trig/...), regardless of input width. NumPy NEP50 uses WIDTH-BASED promotion: bool/int8/uint8 -> float16, int16/uint16 -> float32, int32/uint32/int64/uint64 -> float64 (float/complex preserved). FUZZ_FINDINGS #7. Fix: new DefaultEngine.ResolveUnaryFloatReturnType implements the width-based rule; the 20 transcendental Default.<Op>.cs files (ACos/ASin/ATan/Cbrt/Cos/Cosh/Deg2Rad/ Exp/Exp2/Expm1/Log/Log10/Log1p/Log2/Rad2Deg/Sin/Sinh/Sqrt/Tan/Tanh) now call it instead of ResolveUnaryReturnType (which widened to float64 via GetComputingType). The int->Half / int->Single unary kernels already existed, so this is a pure result-dtype routing change; int32/int64 are unaffected (already float64). The dtype-PRESERVING ufuncs (square/floor/ceil/trunc/round/reciprocal) are intentionally left on the old resolver pending F3b (they need integer identity / x*x / int-reciprocal kernels to preserve the integer dtype). Verification: * 364 of 494 unary dtype divergences clear bit-exact; the remaining 130 are the F3b-pending preserve-dtype ops, now scoped in MisalignedRegistry so a transcendental promotion regression fails the gate (not silently excused). * Half/Single transcendental values stay within 2 ULP of NumPy's float16/float32 libm (documented algorithm difference, same class as complex-divide ULP). * DtypeCoverageTests.Sqrt_IntegerDtypes updated: it asserted the old uniform float64 (codified the bug); now asserts the NumPy width-based dtype per input. * Full net10.0 suite green: 9422 passed / 0 failed; FuzzMatrix 17/17. Clears the transcendental half of FUZZ_FINDINGS #7 (and #15 dtype for those ops).
…Phase 1 F2-reductions) NumPy's (non-nan*) min/max PROPAGATE a NaN operand; NumSharp's SIMD reduction used hardware Vector.Min/Max (MINPS/MAXPS), which drop NaN and return the other lane, so np.min([nan,-inf,1]) gave -inf instead of nan. sum/mean/std/var already propagated (NaN arithmetic). FUZZ_FINDINGS #11. Fix: emit/compute the NaN-propagating form for float/double Min/Max — ConditionalSelect(Equals(acc,acc) & Equals(v,v), MinMax(acc,v), acc + v) — hardware min/max where both lanes are ordered (non-NaN), and acc+v (= NaN) where either is NaN (the self-equality mask is the standard SIMD NaN test). Applied at every combine site of the FLAT reduction: * EmitVectorBinaryReductionOp — unrolled body, pairwise tree merge, remainder (DirectILKernelGenerator.Reduction.cs); * EmitVectorReductionOp — the GetLower/GetUpper horizontal tree-reduce fold; * CombineVectors256/128 — the C# axis-SIMD/gather helpers (DirectILKernelGenerator.Reduction.Axis.Simd.cs); the typeof(T) guard is a JIT-time constant, so integer Min/Max keep the single-instruction path. The scalar tail / lane combine already used Math.Min/Max (which propagate). Verified: flat np.min/np.max now propagate NaN for every array size 3..257 and every NaN position, double and float32; integer min/max unchanged; full net10.0 suite green (9422 passed / 0 failed). Scope: the axis (vertical/strided) SIMD min/max in CreateAxisReductionKernelTyped has a further combine site that still drops NaN (e.g. np.min(a2d, axis=0)); the classifier now excuses ONLY axis!=null NaN-propagation, so the flat path is gate-verified and a flat regression fails. Partially clears FUZZ_FINDINGS #11 (flat); axis tracked under F2-reductions.
…Phase 1 F6 + F4) Two independent parity defects the differential fuzzer surfaced, both now bit-exact and CI-gated (their MisalignedRegistry branches removed). F6 — bool arithmetic (FUZZ_FINDINGS #17). NumPy's bool dtype has no integer add/ multiply loop: '+' is logical OR, '*' is logical AND, so True + True == True (raw byte 1). NumSharp did byte arithmetic and stored 2 in a bool slot. Fix: in ExecuteBinaryOp, when both operands are bool (resultType == Boolean) remap Add -> BitwiseOr and Multiply -> BitwiseAnd before kernel dispatch, so every SIMD/scalar and same-type/mixed path writes a normalized 0/1 byte. '-' has no bool loop and keeps throwing like NumPy. Verified scalar + 32-wide SIMD. F4 — complex np.where (FUZZ_FINDINGS #16). where(cond, x, y) threw 'NotSupportedException: Zero-push unsupported for Complex' whenever the promoted result was complex: NpyExpr.EmitPushZeroPublic (WhereNode pushes a typed zero for the unselected branch) had no Complex case. Both-complex where already worked; only the MIXED promotion where(cond, complex, real) hit it. Fix: add Complex (Complex.Zero) and Half cases. Now where([T,F,T], complex, float) == [(1+2j),(8+0j),(5+6j)] bit-exact. Harness: Shrinker_MinimalCaseReproducesDivergence used bool True+True as its fixture (now fixed, so it no longer diverges); retargeted to a synthetic int32-add case with a planted-wrong expected buffer, so it exercises the shrinker mechanism without depending on a live bug. Verified: net10.0 FuzzMatrix 17/17; full suite 9422 passed / 0 failed. Clears FUZZ_FINDINGS #16 and #17.
…ank operand (Phase 1 F7) Shape.Broadcast's size-1 fast path treated a 1-D [1] (NDim==1, size==1) identically to a 0-D scalar and broadcast it to the OTHER operand's dimensions. When that other operand had fewer dimensions, [1] adopted the lower rank — so [1] + 0-D scalar gave shape [] instead of [1] (a dropped dimension). The path was asymmetric: scalar + [1] was already correct. FUZZ_FINDINGS #18 (found by the random fuzzer). Fix: guard each size-1 collapse on the other operand having at least equal rank — left collapses only when rightShape.NDim >= leftShape.NDim, and symmetrically for the right — so the result keeps ndim == max(ndims), matching NumPy. The [1] -> higher-rank broadcasts (e.g. [1] + [2,3] -> [2,3]) are unchanged (guard is satisfied there). Verified: [1]+scalar -> [1], scalar+[1] -> [1], [1]+[1] -> [1], [2]+scalar -> [2], [1]+[2,3] -> [2,3], [3]+[1] -> [3], [2,1]+[3] -> [2,3], [[1]]+scalar -> [1,1], scalar+scalar -> []. Shape.Broadcast is core (used by every broadcasting op); full net10.0 suite green (9422 passed / 0 failed), FuzzMatrix 17/17. Clears FUZZ_FINDINGS #18.
…al-noncontig (Phase 1) Two findings turned out PARTIALLY resolved in committed code; tightened the classifier to verify the resolved sub-cases bit-exact while excusing only the precise residuals. FUZZ_FINDINGS #12 (complex axis reduction): reducing a complex 2-D+ array along an axis (sum/mean/prod/std/var/min/max) now works bit-exact (std(complex2d, axis=0) -> real float64, verified without the uncommitted NpyIter.cs). Only a 1-D complex array reduced along its only axis still throws 'NDCoordinatesAxisIncrementor with a vector shape'. The classifier now excuses just that 1-D Threw case; the 2-D matrix is gate-verified. FUZZ_FINDINGS #9 (reciprocal non-contiguous): reciprocal on a non-contiguous FLOAT operand (transposed/strided) now works. reciprocal(int) still returns the float64 reciprocal (vs NumPy's integer ÷0 sentinel) and still throws on a non-contiguous integer operand (the int->float path needs a flat Address) — both pending F3b. The classifier scoped to integer input (value diff or non-contig throw); float non-contig is gate-verified. No production code change — corpus replay confirmed the resolutions are in committed code. net10.0 FuzzMatrix 17/17 green. Documents the exact resolved-vs-pending split for #9 and #12.
…l matrix (Phase 2B)
Built the T8 linear-algebra differential corpus (matmul/dot/outer, 408 cases across the
gufunc shape space x 6 dtypes x C/F layouts). It immediately surfaced three real bugs —
all now fixed and bit-exact CI-gated:
1. np.matmul only handled 2-D x 2-D. Every other NumPy-supported shape was broken:
matmul(1d,1d) -> Debug.Assert crash; matmul(2d,1d) -> [n,1] (extra dim);
matmul(1d,2d) -> throws 'core dimension mismatch'; batched (b,n,k)@(b,k,m) -> throws
'cannot be broadcast'.
Rewrote Matmul to the NumPy gufunc (n?,k),(k,m?)->(n?,m?): promote 1-D operands
(lhs->[1,k], rhs->[k,1]), run a batched matmul that broadcasts ONLY the stack dims
(the old path broadcast the full shapes incl. the matrix dims, hence the throw) and
does a 2-D MultiplyMatrix per batch element, then squeeze the inserted dims. Verified
bit-exact vs NumPy for 2-D, 1-D@1-D, 2-D@1-D, 1-D@2-D, 3-D/4-D batched, broadcast batch.
2. np.squeeze(arr, axis) over-collapsed: after removing the named axis it also collapsed a
remaining length-1 dim to 0-D (squeeze([1,1],0) -> [] instead of [1]), diverging from
NumPy and breaking matmul's 1-D-promotion squeeze. Fixed squeeze_fast to drop ONLY the
named axis (collapse to scalar only when no axis remains).
3. np.dot(N-D, 1-D) returned the wrong dtype: it did np.sum(a*b, axis=1), which used the
reduction accumulator (int32 -> int64, uint8 -> uint64) and a hard-coded axis=1. NumPy's
dot preserves the input dtype and contracts the last axis. Routed the N-D-dot-1-D case
through Matmul (same contraction, correct dtype + axis).
Harness: OpRegistry gains matmul/dot/outer; gen_oracle gains a mode (gen_matmul
with C/F operand layouts via the C-contig-base + transposed-view pattern); new
matmul.jsonl corpus + [FuzzMatrix] Matmul test.
Verified: matmul.jsonl 408/408 bit-exact; net10.0 FuzzMatrix 18/18; full suite 9423/0.
Continues SubwordCopy into the cross-WIDTH family: {i16,u16,char} -> {i8,u8}
(low-byte truncate) and {i16,u16,char} -> bool (!=0) for strided views. 2-byte
sources are not VPGATHER-eligible, so TryGetIntToNarrowStridedKernel (4/8-byte
gather) declines them and the generic NarrowInt IL kernel fell to a scalar inner
loop on a non-unit inner stride. All 36 domain cells (3 src x 3 dst x 4 strided
layouts) now win, clean best-of-7:
i16|strided|i8 0.76 -> 1.50 char|strided|i8 0.79 -> 1.73
i16|strided|bool 0.65 -> 2.10 char|strided|bool 0.68 -> 2.39
i16|negcol|i8 0.89 -> 2.73 i16|negcol|bool ~ -> 4.63
sliced/negrow (ss==1 path) 1.19-5.x — NO regression vs the prior winners.
Kernel (DirectILKernelGenerator.Cast.SubwordNarrow.cs), two-stage SIMD reusing
the SubwordCopy shuffles, inner loop inlined in the odometer (per the SubwordCopy
PERF NOTE — a per-row helper costs ~6x):
* ss==2 deinterleave even 2B elems (VPACKUSDW+VPERMQ), then narrow;
* ss==-1 reverse 2B elems (VPSHUFB+VPERMQ), then narrow;
* ss==1 contiguous truncating Vector256.Narrow (32/iter — the sliced/negrow
path the generic kernel already won, kept fast so no regression);
* narrow step: int = mask low byte + VPACKUSWB (== unchecked (byte) == NumPy
wrap); bool = AndNot(CompareEqual(v,0), 1) then pack.
f16 source is excluded (float->int is a value truncate, kept on the Half
kernels); only C-NON-contiguous layouts reach here (C-contig uses
TryGetCastKernel, untouched). Routed in TryGetStridedCastKernel after
SubwordCopy, ahead of the int-narrow / to-bool resolvers.
Validation
* BIT-EXACT vs NumPy 2.4.2: 208/208 hashes (26 byte-copy + narrow pairs x 8
layouts), base arange%65521 (incl. 0 for the bool zero-branch + diverse high
bytes for the truncation) — subword_cross_npref.py / subword_cross_verify.cs.
* Standalone math proven 0-diff + benchmarked (subword_narrow_poc.cs).
* Full suite green: 9956 passed / 0 failed (net10.0, excl OpenBugs/HighMemory).
Note: scoreboard refresh deferred — this session's repeated sweeps ran in a
degraded environment (accumulated dotnet build-server/run processes inflating the
best-of-3 NS times ~7pp globally; the clean baseline committed at 9aea3ac is
93.4%). Per-cell wins above are best-of-7 (min over rounds, robust to load); the
full-matrix refresh should be re-run on a clean box.
…+ plan docs)
Reflects the Wave 16/16b/16c kernels in the project docs:
- .claude/CLAUDE.md: DirectILKernelGenerator partial count 60 -> 62; Cast & Copy
list adds .Cast.SubwordCopy.cs + .Cast.SubwordNarrow.cs; the cast coverage row
now documents the sub-word strided/negcol SIMD lane shuffles (VPACKUS
deinterleave / VPSHUFB reverse for same-type & same-size-bit copies; +low-byte/
!=0 narrow for 2B-int->{1B,bool}) and the inline-in-odometer perf rule (a per-row
helper costs ~6x). The old "2-byte/1-byte strided stay scalar" caveat is removed.
- CAST_BEAT_NUMPY_PLAN.md: new §12 (Wave 16 table + kernels + the inline gotcha +
208/208 bit-exact + the best-of-3/throttle measurement caveat + the noise
finding); §11 residual list updated (sub-word strided no longer hardware-limited).
- CAST_REMAINDER_133_PLAN.md: §0 banner marking §2 CLOSED by Wave 16 and confirming
the §6 noise hypothesis (true >=0.9 ~96%, not the sweep headline); remaining real
laggards narrowed to §3 (alloc-bound), §4 (AVX512 i64/u64->f16), §5 (memory-bound),
+ a couple genuine large-type cells.
Scoreboard refresh deferred to a clean environment (this session's repeated sweeps
ran degraded by accumulated dotnet processes; per-cell wins are best-of-7, robust).
…6 shipped) Full cast-matrix sweep on a settled environment (no stray dotnet processes): 1438/1568 (91.7%) >=0.9, 1374 (87.6%) win >=1.0, geomean 1.705. The net count vs the prior 93.4% snapshot is within the best-of-3 noise band (~140 narrowly-winning cells jitter ±0.9 run-to-run); the Wave 16/16b/16c cells now win in this run. Confirmed via clean best-of-7 (robust to load) that the sweep's in-domain LAGs are noise: char|strided|i16 1.66, char|strided|u16 1.84, char|strided|u8 1.82 (sweep read 0.77-0.88). The remaining sub-word strided LAGs are a DIFFERENT, unaddressed family: 1B->2B WIDENING (i8/u8/bool -> i16/u16/char) at 0.55-0.91 (the inverse of Wave 16c narrowing) — the natural next target.
The inverse of Wave 16c: {bool,u8,i8} -> {i16,u16,char} for strided views. 1-byte
sources aren't VPGATHER-eligible, so the generic WidenInt IL kernel fell to a scalar
inner loop on a non-unit inner stride (the i8|strided|i16 0.73 / bool|strided|i16
0.66 / u8|strided|char 0.62 cliff). All 36 domain cells (3 src x 3 dst x 4 strided
layouts) now win, clean best-of-7 / best-of-3 sweep:
i8|strided|i16 0.73 -> 2.09 bool|strided|i16 0.66 -> 4.17
u8|strided|u16 0.67 -> 2.61 bool|strided|u16 0.67 -> 7.03
i8|strided|char 0.71 -> 2.94 u8|strided|char 0.62 -> 2.96
+ negcol/sliced/negrow (ss==1/-1) 1.9-3.5x — sliced/negrow already won via the
generic kernel (1.1-1.8x); the new ss==1 path is as-fast-or-faster (no regression).
Kernel (DirectILKernelGenerator.Cast.SubwordWiden.cs), two-stage SIMD reusing the
SubwordCopy shuffles, inner loop inlined in the odometer:
* ss==2 deinterleave even bytes (VPACKUSWB+VPERMQ), then widen;
* ss==-1 reverse bytes (VPSHUFB+VPERMQ), then widen;
* ss==1 contiguous Vector256.WidenLower/Upper (32/iter);
* widen step: signed src (i8) = WidenLower/Upper on sbyte = SIGN-extend; unsigned
src (u8,bool) = on byte = ZERO-extend. The dst signedness (i16/u16/char) is
irrelevant — same 16-bit pattern (i8 -56 -> 0xFFC8 == int16 -56 / uint16 65480).
dst==Half excluded (int->float value cast, kept on the Half kernels). Routed in
TryGetStridedCastKernel after SubwordNarrow; C-contig untouched (TryGetCastKernel).
Validation
* BIT-EXACT vs NumPy 2.4.2: 280/280 hashes (35 pairs x 8 layouts), incl. i8 sign-
extend across -128..127 (subword_cross_npref.py / subword_cross_verify.cs).
* Standalone math 0-diff + benchmarked (subword_widen_poc.cs).
* Full suite green: 9977 passed / 0 failed (net10.0, excl OpenBugs/HighMemory).
Note: scoreboard refresh deferred (this session's sweeps degraded ~7pp by accumulated
dotnet processes; the 36 widening cells win even in the degraded best-of-3).
… (bcast-reduce ~700x NumPy)
The flat reduction's broadcast handling (DefaultEngine.ExecuteElementReduction) folded each
stride-0 (broadcast) axis by ITERATING its D identical copies via ExecuteAxisReduction —
cache-hot but still O(D×unique), landing at ~1.0x NumPy (the prior "50x->0.9x" fix). A
stride-0 axis repeats its data exactly, so reducing over it is CLOSED-FORM — nothing to
iterate:
sum -> *D, prod -> ^D, min/max -> identity (mean = sum/count, in the caller).
Drop every broadcast axis to a non-broadcast view of the UNIQUE data (index 0 along each
stride-0 axis), reduce THAT once via the existing fast contiguous/IL-SIMD path, then apply
the collapsed multiplicity to the scalar. O(unique), not O(D×unique).
Measured (8192-unique x 1024 broadcast = 8.4M logical, -c Release, best-of-9):
before (iterate fold) : 1.10 ms (1.03x NumPy)
after (algebraic) : 0.0016 ms (~700x NumPy) — AT the floor: the SIMD sum of the
8192-row alone is 0.00149 ms; C# dispatch overhead is 0.11 us.
No per-element kernel can approach this: a perfect contiguous SIMD reduce of the
materialized 8.4M is 2.16 ms = 1350x slower — it pays O(D×N) while the collapse is
O(unique). IL/NpyIter/IL-generation are throughput tools for that slower class and cannot
beat the algorithmic win; the inner unique-reduce already rides the IL SIMD kernel.
Correctness:
- Integer & min/max BIT-EXACT (modular */pow compose under truncation; min/max are
order-independent). Float sum/prod differ only ULP-level — the summation-order class the
codebase already documents here, well within the broadcast fuzz's 1e-5 tolerance.
- CombineWithCount applies the closed form in the accumulator dtype (IPowS/IPowU wrap like a
materialized integer reduction; DPow throws on decimal overflow, matching np.prod).
- DropBroadcastAxes returns the unique view CONTIGUOUS (an O(unique) copy only when the
broadcast wraps a strided view): the prior per-axis fold always finished on a fresh
contiguous intermediate, and reducing a strided unique view directly tripped a PRE-EXISTING
sub-word strided-prod bug (np.prod over a strided view returns 0 — collected for filing).
Verified: bcast_consistency fuzz 4689/4727 pass (the 38 fails are all pre-existing — 30
NaN-engine + 8 decimal argmax/argmin — zero on the flat sum/prod/min/max surface this
touches); 586/586 non-OpenBugs reduction unit tests pass (net8 + net10).
…with a radix line kernel Closes one of the 8 documented Missing Functions (np.sort) and replaces the pathological argsort. Sorting is structured exactly like NumPy's _new_sortlike / _new_argsortlike (item_selection.c): an NpyIter "IterAllButAxis" drive hands one 1-D line at a time to a high-performance inner sort kernel. ARCHITECTURE - Driver (AxisSort.DriveAllButAxis): NpyIterRef.AdvancedNew with opAxesNDim=ndim-1 and op_axes listing every axis EXCEPT the sort axis. Dropping the sort axis from the iterator stops it coalescing into the contiguous run (the bug that made an earlier EXTERNAL_LOOP+CORDER attempt hand the kernel count=20 into an N=5 buffer); the kernel then walks each line itself via the axis stride. Handles 1-D..N-D, any axis (incl. negative and None=flatten), and strided/transposed/F-contig inputs. - Inner kernel = native-width LSD radix (RadixSort): u32 core for 1/2/4-byte dtypes, u64 core for 8-byte; 8-bit digits, one count + one scatter per byte, double-buffered, trivial-pass skip. Stable, so argsort ties resolve in ascending index order (== NumPy kind='stable'). Caller fills keys with a monotonic unsigned transform (so unsigned ascending order == NumPy value order) and un-transforms on the way out. - Floats partition NaN to the end before radix (NumPy sorts NaN last); the float->key transform is the standard IEEE sign-flip so -inf<..<+inf orders correctly. - Half/Complex/Decimal route to BCL introsort (Span.Sort) with exact NumPy comparators (npysort_common.h): real NaN-last; complex CDOUBLE_LT lexicographic real-then-imag, any-NaN-part last. Complex sorts at all now (it has no IComparable). - argsort always returns int64 indices; the radix carries a parallel index column. WHY RADIX (POC, i9-13900K, ratios are NumPy_ms/NS_ms, >1 = NS faster) - int32 sort: radix 164-217 M/s = 0.72-0.90x NumPy, 8-9x over BCL Span.Sort. - int32 argsort: radix 87-135 M/s = 2.1-4.3x FASTER than NumPy (NumPy uses slow indirect quicksort; radix carries indices for one extra scatter stream). - Variation shoot-out @1m int32 sort: radix 197 M/s >> AVX2 vectorized quicksort 52 M/s (correct, but out-of-place partition + no index path) >> Span.Sort 22 M/s. 8-bit beats 11-bit (L1 thrash) and beats ulong-widened combined-histogram for <=4-byte. A SIMD quicksort could only edge radix on int *sort* and would forfeit the argsort win, so radix is the right max-perf, all-dtype, correct-and-stable choice. IL-generating the radix gives no gain (memory-bandwidth-bound, not dispatch-bound). CORRECTNESS (all vs NumPy 2.4.2) - 82/82 .npy round-trip parity: int32/int64/uint8/int16/float32/float64 x 1-D/2-D/3-D x every axis (incl. negative + None) x NaN/+-inf x empty/single/ties/constant. - Half/Complex/Decimal verified (incl. the explicit complex lexicographic NaN-last case). - 20 new unit tests (NpSortTests) incl. transposed-view, in-place ndarray.sort, 3-D argsort-reconstructs-sort. Sort_ApiGap retagged [OpenBugs] -> [Fixed]. - Full suite: 9977 passed, 0 failed. API - np.sort(a, axis=-1, kind=null), np.argsort(a, axis=-1), ndarray.sort(axis=-1) in-place. - ndarray.argsort<T> kept for source compat (delegates; element type from the array). Deleted the old LINQ implementation (per-element tuple + NDArray view allocation).
…esentative run) Full cast-matrix sweep reflecting Wave 16a-d: 1503/1568 (95.8%) >=0.9, 92.5% win >=1.0, geomean 1.941. The geomean rise from the pre-Wave-16 1.70 is largely the real Wave 16 contribution — ~100 sub-word strided cells moved from ~0.6-0.9 to 2-7x. All 36 1B->2B widening cells (Wave 16d) now win 1.9-5.0x; the full sub-word strided family (copy + same-size + narrow + widen) is done. Remaining <0.9 (~48-65 depending on run): AVX512-gated i64/u64->f16, memory-bound i64/u64->narrow strided, alloc-bound 1M same-type contig, + a few genuine large-type cells and best-of-3 jitter. Measurement caveat: the sweep headline oscillates run-to-run (this session saw 87-97% across back-to-back runs) — degraded by accumulated dotnet build-server/run processes inflating either the NumSharp or NumPy subprocess. This run (geomean ~1.94) is representative; per-cell Wave 16 wins are best-of-7 + 280/280 bit-exact, robust to load.
- .claude/CLAUDE.md: partial count 62 -> 63; Cast & Copy list adds .Cast.SubwordWiden.cs; coverage row notes 1B-int->2B via deinterleave/reverse + sign/zero Vector256.Widen. - CAST_BEAT_NUMPY_PLAN.md §12: Wave 16d row (1B->2B widen 0.55-0.92 -> 1.9-7.0x) + SubwordWiden kernel in the kernels list. - CAST_REMAINDER_133_PLAN.md §0: banner now covers the WHOLE sub-word family (copy + narrow + widen, ~100 cells, 280/280 bit-exact).
…VX2 mask overflow
np.any / np.all on the two non-JIT-vectorizable dtypes (bool, char) fell through to
the scalar per-element loop in AnySimdHelper/AllSimdHelper because
Vector256<bool>/<char>.IsSupported == false. For the benchmarked any(all-false)
case at scale this was the 0.08-0.18x cliff (5-12x SLOWER than NumPy): a full
scalar scan of every element, no SIMD. bool/char are bitwise identical to
byte/ushort for a zero/non-zero test (bool: 0/1; char: UTF-16 code unit), so
reinterpret at the top of both helpers and run the existing vectorized integer
path. (Applies to the direct ReduceBool route AND the IL-emitted reduction route,
which both call these helpers.)
While fixing that, found and fixed a latent CORRECTNESS bug in AnySimdHelper: the
"all lanes zero" reference mask was `(1u << vectorCount) - 1`, which for
Vector256<byte>/<sbyte> (32 lanes) is `(1u << 32) - 1`. C# masks uint shift counts
to 5 bits, so `1u << 32 == 1u << 0 == 1`, making allZeroMask = 0 instead of
0xFFFFFFFF. An all-zero vector then hit `bits (0xFFFFFFFF) != 0` -> returned true on
the FIRST vector, so:
np.any(np.zeros(1_000_000, dtype=byte)) -> True (NumPy: False)
byte/sbyte any() was WRONG for any all-zero array >= 32 elements on AVX2 hardware.
Masked until now because the benchmark uses bool (scalar path) and arrays < 32 use
the Vector128 path where the shift doesn't overflow. Widened to
`(uint)((1UL << vectorCount) - 1)` (correct for all 4..32 lane counts). This is also
prerequisite for the bool->byte redirect, which now routes through the byte path.
Result (i9-13900K, Release, best-of-9, vs NumPy 2.4.2), np.any(bool all-false):
100K 0.18x -> 1.90x 1M 0.08x -> 1.04x 10M 0.09x -> 1.04x
NumSharp now runs at 79-95 GB/s (memory-bandwidth-bound) -- at/above NumPy; the
block-unroll rewrite considered next was unnecessary once at the memory ceiling.
Verified: 33 correctness checks (bool/byte/sbyte/char/int x all-false/all-true/
early-hit/all-zero x sizes 1M/40/10 hitting V256 / V256+tail / scalar paths) pass,
incl. byte all-zero any=False at 1M; 19 NpAnyTest/np_all_Test pass on net8+net10;
broader reduction sweep green except 3 pre-existing [OpenBugs] (convolve int64,
var/std ddof>n) in unrelated kernels.
…p the broadcast-reduce copy
Follow-up to the broadcast-reduce algebraic collapse. That commit added a .copy() in
DropBroadcastAxes to dodge a "np.prod over a strided view returns 0" bug — a symptom patch.
This removes the copy and fixes the bug at its root so the strided unique view reduces
through NpyIter directly (NpyIter handles non-contiguous via coalesce + stride-ordered
axis-permute, and is the always-faster path — no materialization).
ROOT CAUSE — NpyIter.Execution.DetermineAccumulatorType had a hand-rolled integer-widening
switch that had DRIFTED from the canonical NPTypeCode.GetAccumulatingType (the map the
direct/contiguous reduction path uses):
- SByte and Char were MISSING -> fell through to `_ => src`, so a strided sub-word
sum/prod/cumsum/cumprod accumulated in 1-/2-byte lanes and OVERFLOWED. np.prod over a
strided sbyte/char view returned garbage (e.g. 47 for a true 3375) or 0.
- Byte was mis-widened to Int64 instead of UInt64 (latent: a UInt64 TResult was returned
through an Int64 accumulator; positive products happened to fit).
The contiguous path was unaffected because it never uses this method (it passes the
GetAccumulatingType result down explicitly). The bug was dormant until the algebraic collapse
started reducing strided unique views directly.
FIX:
- DetermineAccumulatorType now delegates the Sum/Prod/CumSum/CumProd widening to
src.GetAccumulatingType() — single source of truth, can never drift again. (Mean/Var/Std
-> Double and the pass-through for floats/decimal/complex are unchanged; GetAccumulatingType
preserves Half/Single/Double/Decimal/Complex, so it is an exact drop-in.)
- DropBroadcastAxes drops the .copy(): returns the (possibly strided) unique view as-is; the
flat reduction routes non-contiguous inputs through NpyIter, now correct.
This also fixes strided sub-word reductions OUTSIDE the broadcast path (any np.sum/np.prod/
np.cumsum/np.cumprod over a strided/transposed sbyte/char view) — the fuzz improved 38 -> 37
fails (the remaining 37 are all pre-existing: 30 NaN-engine + 7 decimal argmax/argmin).
Verified (-c Release):
- strided-prod pinpoint: sbyte/char/int32/int64/double x {1-D strided, transposed, sliced} all
match the contiguous reference (was sbyte 47/0, char garbage).
- bcast-reduce canary still 0.0021 ms (~534x NumPy), bit-exact, NO copy.
- broadcast fuzz 4690/4727 (zero failures on the sum/prod/min/max surface).
- 2522 unit tests green across reduction/cast/strided/cumulative/scan/statistics/iterator
(net10, excl OpenBugs).
…e 17, bucket A)
Closes the last large genuine cast laggard: i64->f16 and u64->f16 across all 8
layouts were the only AVX512-gated family left (was 0.68-0.89x NumPy on best-of-3,
falling to the scalar generic emitter). There is NO AVX2 i64->f32 (cvtqq2ps is
AVX512DQ, and this box is AVX2-only: Vector512 not accelerated, Avx512F/DQ/BW all
false), so synthesize the conversion exploiting f16's tiny finite range.
KEY INSIGHT: f16 saturates to +-inf at 65520, so EVERY finite i64/u64->f16 result
has |v| < 65520 << 2^17 -- which fits in i32 *exactly* (well under the 2^24 f32
mantissa limit, so the i32->f32 step never rounds, single RTNE at f32->f16). Any
|v| >= 65520 is inf regardless of the low bits.
KERNEL (DirectILKernelGenerator.Cast.ToHalf.cs):
- Int64x8ToHalfBits / UInt64x8ToHalfBits: clamp out-of-range lanes to a +-70000
sentinel (VPCMPGTQ + byte-granular VPBLENDVB; u64 uses the sign-bias unsigned
compare), so the low-32-bit narrow is exact and the sentinel maps to +-inf.
Pack 8 lanes' low words via PermuteVar8x32(RtoSel)+GetLower, then the exact
cvtdq2ps and the already-proven Giesen FloatToHalfBits.
- BulkInt64ToHalf / BulkUInt64ToHalf: 16-at-a-time contig bulk + scalar tail.
- Int64ToHalfBits / UInt64ToHalfBits scalar ports (tail + strided conv); finite
|v| < 2^24 so (float)v is exact, then SingleToHalfBits.
- Contig (CastInt64/UInt64ToHalfContig) + strided (StridedInt64/UInt64ToHalf via
the existing StridedNarrowDriver, srcSize=8) wired into TryGetXToHalfKernel /
TryGetXToHalfStridedKernel (Int64/UInt64 switch cases). No Cast.cs routing
change -- the X->Half resolvers were already in the chain.
VALIDATION (bit-exact vs NumPy 2.4.2):
- 190,043 i64 values (every int in the full +-70010 boundary region + 2^16..2^63
magnitudes + 50K randoms) and 120,024 u64 values: SIMD-diff=0, scalar-diff=0.
- All 8 layouts x {i64,u64} = 16 SHA256 hashes through the real astype path,
with boundary-spanning values (-inf/finite/+inf/0): 16/16 bit-exact.
- Full suite 9977/9977 on net10.0. (The lone net8.0 failure is a PRE-EXISTING
log1p/complex128 1-ULP/NaN-payload libm difference, confirmed by stash-test;
unrelated to this cast.)
PERF (best-of-7, canonical (arange%17)+1 base, 1M, NPY/NS):
i64->f16 C 1.95 F 1.99 T 1.95 sliced 1.46 negrow 1.90 negcol 1.34 strided 1.20 bcast 1.95
u64->f16 C 1.66 F 2.53 T 2.68 sliced 2.50 negrow 2.57 negcol 1.69 strided 1.55 bcast 2.58
geomean 1.914x (was ~0.80). Every cell now >0.9; whole bucket beats NumPy.
PoCs: benchmark/poc/i64_u64_half_poc.cs (value proof), i64u64_half_layout_{npref.py,
verify.cs} (8-layout hash proof), i64u64_half_bench.cs (best-of-7 timing).
…(Wave 17, bucket B)
Bucket B was ~30 cells the best-of-3 scoreboard flagged at 0.77-0.90 for
{f32,f64,c128} -> {i64,u64}. A clean best-of-7 re-measure (all 6 src->dst x 8
layouts, canonical (arange%17)+1 base, GC between cells) showed 25 of them were
pure best-of-3 JITTER -- now 0.9-2.5x, many >1.0 (e.g. f32/f64/c128 -> i64 all
contig/F/T/sliced/negrow >=0.95, bcast 2.3-2.5x). float->i64 has no dedicated
kernel (scalar Converts == NumPy's cvttsd2si) and is already at parity.
Only 5 cells were GENUINELY <0.9, all -> u64 on negcol/strided -- and all because
the strided u64 path fell to a gather even when the data is contiguous:
- f32|negcol|u64 0.88->0.96, f64|negcol|u64 0.86->0.93: negcol ([:, ::-1]) is a
contiguous-REVERSED row, so load contiguously + VPERMQ/VPERMD lane-reverse
instead of a -1-stride gather (Cast.FloatToUInt.cs). Reads stay inside the row.
- f64|strided|u64 0.86->0.97: ::2 deinterleaves even doubles from two contiguous
loads (Permute4x64 0x08/0xDD + Blend 0x0C). The 2nd vector is loaded at 2i+3
(not 2i+4) so the max read is 2i+6 = the last needed element L(i+3) -> provably
no over-read past the row (the gather's only safety advantage, recovered).
- c128|negcol|u64 0.80->0.96: the reals are interleaved with imags, but for negcol
the 8 surrounding doubles are contiguous (re3,im3,..,re0,im0), so two loads +
UnpackLow + VPERMQ-0x72 extract the 4 reals in logical order, beating the
-2-stride real gather (Cast.Complex.cs FusedComplexToUInt64). Over-read-safe.
- c128|strided|u64 ~0.87 (memory-bound): the ::2 reals sit every 4th double; a
contiguous-load alternative reads 2x the data, so the gather is already near
optimal. Left as-is; straddles 0.9 on jitter.
VALIDATION: {f32,f64,c128} -> u64 across all 8 layouts = 24 SHA256 hashes vs NumPy
2.4.2, with values that exercise the modular wrap (negatives), the 2^63 sentinel
(NaN/+-inf/overflow) and truncation: 24/24 bit-exact. Full suite 9977/9977 on
net10.0 (net8.0's lone log1p/complex128 failure is pre-existing, stash-confirmed).
PoCs: benchmark/poc/float_wideint_bench.cs (best-of-7 6x8 matrix),
float_u64_layout_{npref.py,verify.cs} (24-hash layout proof).
…tude mask (Wave 17, bucket E)
f16|strided|bool was a genuine 0.14x cliff (clean best-of-7: NumSharp 1.53ms vs
NumPy 0.22ms, ~7x slower) -- the scoreboard's best-of-3 0.87 was a lucky sample.
f16 is 2-byte so it is not VPGATHER-eligible; StridedHalfToBool routed through the
generic StridedNarrowDriver, which scalar-converts the ss==2 ([:, ::2]) and ss==-1
([:, ::-1]) inner rows. Wave 16's SubwordNarrow excluded f16 (it needs a magnitude
mask, not a plain low-byte narrow), so f16->bool never got the shuffle treatment.
FIX (DirectILKernelGenerator.Cast.ToBool.cs): a dedicated StridedHalfToBool that
applies the SubwordNarrow deinterleave (ss==2: mask even words -> VPACKUSDW ->
VPERMQ 0xD8) / reverse (ss==-1: VPSHUFB _revWords -> VPERMQ 0x4E) shuffles, then
NumPy's half-truthiness (bits AND 0x7fff) != 0 (so +-0.0 -> False, NaN/inf -> True).
Inner ss==1 rows (sliced/negrow/bcast) run the contiguous magnitude compare;
arbitrary strides (F/T column-gather) and the tail keep the scalar test. Inner loops
inlined in the odometer (a per-row helper costs ~6x -- SubwordCopy PERF NOTE).
RESULT (best-of-7, 1M, NPY/NS): every f16->bool layout now beats NumPy 3.9-7.3x:
C 4.00 F 3.87 T 4.02 sliced 7.25 negrow 7.01 negcol 6.49 strided 5.27 bcast 6.92
(strided 0.14 -> 5.27). NumPy's f16->bool is uniformly slow (0.30-0.57ms).
VALIDATION: 8 layouts x SHA256 vs NumPy 2.4.2 with a base carrying +-0.0 / NaN /
+-inf / subnormal / 65504 -> 8/8 bit-exact (-0.0 bits 0x8000 -> False confirmed).
Full suite 9977/9977 net10.0 (net8.0's lone log1p/complex128 failure pre-existing).
Bucket E also RE-MEASURED 6 other scoreboard singletons (f64->u8/i32/char,
u64->f32, c128->u8, u64->u8) -- all were best-of-3 jitter, clean best-of-7 is
1.08-1.87x. PoCs: benchmark/poc/f16bool_{bench.cs,layout_verify.cs,layout_npref.py},
subword_arb_bench.cs.
Scoreboard: 1508/1568 >=0.9 (96.2%, was 95.8%); the i64/u64->f16 (A) and f16->bool (E) families dropped off the <0.9 list entirely. geomean 1.89 on this best-of-3 run (clean best-of-7 across the touched buckets is 1.94) -- this sweep was mildly degraded by accumulated dotnet build-server processes (cannot be cleared without taskkill), so the float->i64/u64 cells (bucket B) read 0.84-0.89 here despite measuring 0.95-2.5x under clean back-to-back best-of-7 (documented in the plan + commit faa2754). The scoreboard is best-of-3 and noisy by construction. Remaining <0.9 (60), by class: - ~15 same-type small-dtype contig (bool/u8/i8 -> self, C/T/sliced/negrow/bcast, 0.16-0.29): NOT a kernel deficiency -- a cache-residency artifact. A warm Buffer.MemoryCopy of 1M is 0.0138ms = NumPy's 0.014ms exactly; NumSharp wins 1.9-3.5x at every cache-busting size (4M/16M/64M). NumPy only wins at 1M because CPython refcounting reuses one warm buffer per iteration while .NET GC hands back cold ones. Unwinnable without deterministic disposal; left as-is (optimal). - float/c128 -> i64/u64 (best-of-3 jitter on this degraded run; clean >=0.95). - c128->bool / c128->u64 strided (memory-bound, gather of strided reals already near-optimal -- the contiguous alternative reads 2-4x the data). Plan (CAST_REMAINDER_133_PLAN.md) gets a full Wave 17 section: bucket-by-bucket A-E with the kernel approach, clean best-of-7 numbers, bit-exact proof counts, and the cache-residency characterization for C/D so it is not re-chased.
…s+imags (Wave 17)
c128|negcol|bool was a genuine 0.66x (clean best-of-7: NumSharp 0.635ms vs NumPy
0.418ms) -- FusedComplexToBool gathered reals (idx) and imags (idx+1) with a -2-stride
double gather for the reversed inner row. As with c128->u64 negcol, the 8 doubles
around p are contiguous-reversed (re3,im3,re2,im2,re1,im1,re0,im0), so two loads +
UnpackLow/UnpackHigh + VPERMQ-0x72 deinterleave BOTH reals and imags into logical
order, then the existing OrderedEqual !=0 | test. Over-read-safe (p-6 of the last
group is the row's first physical real).
RESULT (best-of-7, 1M): c128|negcol|bool 0.66 -> 1.50x. (c128|strided|bool ~0.9-1.07
is NumPy-jitter / memory-bound -- the ::2 reals+imags sit every 4th double, so the
contiguous alternative reads 4x the data; gather stays.) The other 6 c128->bool
layouts were already 1.08-1.19x.
VALIDATION: 8 layouts x SHA256 vs NumPy 2.4.2 with a base carrying re=0&im=0 (False),
im!=0 (True), NaN/inf (True), -0.0-0.0j (False): 8/8 bit-exact. Full suite 9977/9977
net10.0 (net8.0's lone log1p/complex128 failure pre-existing).
PoCs: benchmark/poc/c128bool_{bench.cs,layout_verify.cs,layout_npref.py}.
np.sort, ndarray.sort and np.argsort were O(N^2) for every 1-D array and every axis=None call - the most common cases. Root cause: AxisSort's IterAllButAxis driver (DriveAllButAxis) drops the sort axis via op_axes, so a 1-D input yields a 0-dimensional iterator (oa_ndim=0, op_axes=[[]]). NpyIter mis-drives that degenerate shape as `size` single-element iterations instead of 1, and because each line kernel re-sorts the whole line (it walks by LineCtx.n, ignoring the per-call count) total work is N x O(N) = O(N^2). Measured before fix (i9-13900K, Release), 1-D int32: N=10K sort 514 ms argsort 768 ms N=100K sort 53,419 ms argsort 83,132 ms (~104x per 10x in N = quadratic) 1M+ effectively hung (minutes). Unit tests never caught it (5-12 element arrays, N^2 trivial) and the radix POC benched RadixSort directly, bypassing the driver. Fix: at the single choke point DriveAllButAxis, promote a 1-D operand to a (1, N) memory-sharing view (np.expand_dims aliases storage, so in-place sort still mutates the original; a leading size-1 axis leaves the data-axis stride - and thus LineCtx in/out strides - unchanged). Now exactly one axis (size 1) is kept, so the iterator makes exactly one line-kernel call. Multi-D was already correct (one call per line) and is untouched. Measured after fix: N=100K sort 0.68 ms (147 M/s) argsort 1.06 ms (94 M/s) ~78,000x faster 1-D throughput now flat across N (quadratic gone). vs NumPy 2.4.2 argsort: int32 1.47x (1M) / 2.1x (4M) FASTER; int64 ~parity; f64 0.6x (1M) -> 0.96x (4M). Correctness: NpSortTests 20/20 pass; validated 1-D sort/argsort, in-place on strided a[::2] view, axis=None, int64/f64. NOTE (separate latent bug, not fixed here): the underlying NpyIter 0-dimensional / empty-op_axes iteration emits `size` iterations instead of 1; this still affects any other 0-dim op_axes use. The sort path now works around it rather than relying on the iterator's 0-dim behavior.
…ializing a contiguous copy
Removes the copy-to-contiguous anti-pattern (materializing a strided/transposed/
sliced/F-order/broadcast view because a kernel only handled C-contiguous data)
from seven compute paths, replacing each `.copy()`/`.copy('C')` with in-place
strided iteration — the same principle applied to the reduction fold earlier on
this branch ("must use NpyIter, not copy"). Every element is now read through its
own strides in C-order (via a shared `FlatStrideOffset` odometer) or through the
existing IL strided kernel; results are bit-identical to the prior copy path and
verified against NumPy across all dtypes x layouts.
CumSum / CumProd flat (Default.Reduction.CumAdd.cs, CumMul.cs):
- Both flat scans now share ScanElementwiseFlat, which drives the IL scan kernel
directly on strided/transposed/sliced/reversed/F-order/3-D inputs. The flat
scan generator already emitted a correct C-order strided walk
(EmitScanStridedLoop); the old `cumsum_elementwise(arr.copy())` simply prevented
it from ever being reached.
- Enabled CumProd in the IL scan kernel (DirectILKernelGenerator.Scan.cs):
EmitScanCombine now emits Mul, EmitScanHelperCall maps to a new
CumProdHelperSameType. cumprod previously had NO IL kernel at all (TryGet
returned null at generation, always falling to the scalar copy path).
- Half/Complex accumulators can't be combined with il Add/Mul except same-type-
contiguous, so for non-contiguous Half/Complex (and decimal cumprod) we keep a
single C-order ravel — exactly what NumPy does — not a kernel-dodge copy.
- Base address folds Shape.offset in (sliced views retain it; simple contiguous
slices bake it into Address). NEP50 accumulator widening preserved.
Var / Std flat (Default.Reduction.Var.cs, Std.cs):
- The fallback two-pass now iterates the input in place via FlatStrideOffset
(var/std are order-independent, so any visiting order is valid — mirrors NumPy,
which reduces strided arrays without copying). Shared moment helpers
(VarMomentsReal<T> via INumberBase, VarMomentsDecimal, VarMomentsComplex) are
used by both var and std (std = sqrt). Removes the old double-materialization
(arr.copy() + Cast-to-double).
Integer Reciprocal (Default.Reciprocal.cs):
- The 1/x integer loop reads the source through its strides instead of
materializing sliced/strided/broadcast views (the comment admitted the previous
raw read "THREW for views"). Broadcast stride=0 is handled naturally.
- Pre-existing 1/0 semantics preserved verbatim (see known-bug note below).
Left / Right Shift (Default.Shift.cs):
- Array-operand shift walks both the values and the per-element shift counts
through their own strides (ExecuteShiftArrayStrided), so broadcast (stride=0),
transposed, sliced and mixed-layout operands no longer get copied. The
contiguous fast path still uses the IL scalar-loop kernel.
Clip (Default.ClipNDArray.cs, DirectILKernelGenerator.Clip.cs):
- New ClipStrided reads src and (array) bounds through their own strides in
C-order, writing a fresh C-contiguous result — eliminating the copy('C')
normalizations in the general path, PrepareBound, and the @out path. The
contiguous fast path still drives the IL SIMD kernel. One strided loop is
shared across dtypes via Max/Min function pointers.
- Clamp semantics mirror the IL kernel bit-for-bit: Math.Max/Min for sized
numerics+decimal, the (now internal) Half/Complex NaN helpers, Char as UInt16,
and NaN-propagating float/double Max/Min that resolve the signed-zero tie like
hardware MAXPS/MINPD (maximum(+0,-0) = -0) rather than Math.Max (+0). np.maximum
/minimum/fmax/fmin are implemented on top of clip, so this keeps their
signed-zero/NaN results identical to the SIMD path and to NumPy.
- Fixed a latent allocation bug: result buffers are now built from a fresh
contiguous Shape, never np.empty(lhs.Shape) which would clone a strided/F-order
layout and mis-address the linear dst writes.
Verification: per-op fuzz harnesses compare strided/transposed/sliced/reversed/
F-order/3-D/broadcast views against the contiguous ground truth and against live
NumPy across all dtypes (scan 308/308, var-std 660/660, recip+shift 220/220, clip
350/350). Full unit suite green (9987 passed, 0 failed); net8.0 + net10.0 build clean.
Known pre-existing bug surfaced (NOT changed here): NumPy integer reciprocal(0) is
0 for int8/int16/uint8/uint16/uint32 and the sign-bit sentinel only for
int32/int64/uint64 (uint64 -> 2^63); NumSharp uses MinValue for all signed and 0
for all unsigned, so reciprocal(int8/int16 0) and reciprocal(uint64 0) diverge.
This is the existing documented behavior, faithfully preserved by the strided
rewrite; flagged for a separate fix.
…/maximum signed-zero scalar tail
Two NumPy-parity bugs, both surfaced while removing the strided copy-hacks.
Integer reciprocal 1/0 (Default.Reciprocal.cs)
Re-probed NumPy 2.4.2 bit-exact and deterministic across array sizes / lane
positions: reciprocal(0) is the sign-bit sentinel 0x80..0 ONLY for the 32/64-bit
loops (int32 -> int.MinValue, int64 -> long.MinValue, uint64 -> 2^63) and 0 for
every narrower type (int8, int16, uint8, uint16, AND uint32). NumSharp returned
MinValue for ALL signed (so int8 -> -128, int16 -> -32768) and 0 for ALL unsigned
(so uint64 -> 0) — wrong for int8/int16/uint64. Fixed the three per-type 1/0
results; the other five already matched. (A prior pin had mis-probed int8 1/0 as
-128; corrected.)
reciprocal(bool): NumPy promotes bool to the int8 loop -> int8 [True->1, False->0]
(probed np.reciprocal([True,False]) -> int8 [1,0]). NumSharp took the float loop
(Double [1, inf]). The dispatcher now maps a bool loop type to int8 and casts the
input, so bool reciprocal returns int8 with the int8 1/0 = 0 rule. dtype was
already preserved for the int types (the registry's "widen to float64" note was
stale).
clip / maximum / minimum / fmax / fmin signed-zero scalar tail
(DirectILKernelGenerator.Clip.cs, Default.ClipNDArray.cs)
np.maximum etc. are built on clip. For float32/float64 the SIGNED-ZERO tie resolves
to the SECOND operand (hardware MAXPS/MINPD: maximum(+0,-0) = -0) — the OPPOSITE of
float16 (ties to the first operand) and of Math.Max (returns +0). The IL kernel's
SIMD body already matched NumPy, but its SCALAR TAIL (EmitScalarClamp) used Math.Max
and so diverged on a signed-zero tie landing in the last partial vector. Added
NaN-propagating, tie-to-second FloatMaxNaN/FloatMinNaN/DoubleMaxNaN/DoubleMinNaN
helpers and wired EmitScalarClamp to them for Single/Double; the engine's ClipStrided
now shares the same helpers (dropping its local copies), so the contiguous SIMD body,
contiguous scalar tail, and strided path all agree bit-for-bit with NumPy. The Half
helper (tie-to-first) is correct for float16 and left unchanged.
Tests
- MisalignedRegistry: removed the two now-obsolete reciprocal excuses (the int->float64
dtype excuse and the int ÷0 / non-contig-throw value excuse); the unary fuzz corpus
now verifies reciprocal(int/bool) bit-exact.
- NewDtypesCoverageSweep: corrected B36_SByte (int8 1/0 -> 0, not MinValue) and added
B36_Int16 (1/0 -> 0), B36_UInt64 (1/0 -> 2^63), B36_Bool (-> int8 [1,0]) pinning the
per-type rule.
Verified: contiguous maximum/minimum with -0/+0 across SIMD body AND scalar tail match
NumPy bit-for-bit (f32 and f64); reciprocal(all 8 int types + bool) matches NumPy;
strided reciprocal still equals the contiguous result. Full unit suite green
(9990 passed, 0 failed); net8.0 + net10.0 build clean.
…O(N^2)) CalculateBroadcastShape gated the op_axes branch on `opAxesNDim > 0`, so an explicit 0-dimensional iteration (opAxesNDim == 0, op_axes = [[]]) fell through to natural broadcasting and returned the operand's own shape [N]. That made IterSize = N, so a kernel driven without EXTERNAL_LOOP ran N times instead of once. NumPy's NpyIter_RemoveAxis leaves a 0-dim iterator with GetIterSize() == 1 (one iteration, data pointer = operand base; any dropped axis is the caller's responsibility via its saved stride). This is the ROOT of the 1-D / axis=None sort & argsort O(N^2) addressed at the symptom level in b4d2128: AxisSort's IterAllButAxis drops the only axis of a 1-D operand -> opAxesNDim == 0 -> the line kernel (which re-sorts the whole line per call, ignoring the per-call count) was invoked N times. Fix: treat opAxesNDim == 0 as an explicit 0-dim iteration and return an empty iteration shape (IterSize == 1). The `> 0` -> `>= 0` gate plus an early `return Array.Empty<long>()` leaves the opAxesNDim > 0 and the unspecified (-1, natural-broadcast) paths untouched. TDD (NpyIterZeroDimOpAxesTests, 10 tests, direct NpyIterRef.AdvancedNew): - RED against the unfixed iterator: 7 fail / 3 pass, every failure the exact symptom (NDim 1 not 0; Calls/IterSize = 100000/1000/500/256/10/20 not 1). - GREEN after fix: 10/10. Functional cases (single-call whole-line walk, in-place fill, two-operand lockstep copy, sliced-offset view) confirm the NDim=0 base pointer is offset-correct and read/write-valid; 3 regression guards (no-op_axes 1-D -> N calls; drop-axis-1 -> per-row; drop-axis-0 -> per-col) hold before and after. Regression: 42/42 (NpSortTests 20 + NpyIterZeroDimOpAxes 10 + AxisStride 12). Note: AxisSort's (1,N) promotion workaround from b4d2128 is now redundant (the sort path could rely on this root fix directly) but left in place; it is correct and independent.
…iterals (NEP50)
`a * 2`, `a + 5`, `a / 3.0` route through operator(NDArray, object) ->
np.asanyarray(literal), so a Python int literal is wrapped as an int32 0-d scalar
(a float literal as float64). For `f32[] * 2` the operands therefore arrive as
f32 x int32. resultType already reflects NumPy's weak promotion (f32), but the
fast-path gates key off the RAW operand dtypes:
- TryTrivialContiguousBinaryOp: `if (lhsType != rhsType ...) return null;` bails
BEFORE the scalar-broadcast arm, and
- TryExecuteBinaryOpViaNpyIter: sameDtype=false -> per-element-convert mixed loop
(vector body null).
So the op fell to the heavy mixed path instead of the SimdScalarRight/Left
scalar-broadcast loop it should take. Measured (100K f32, Release):
a * 2 (int32 scalar) 0.080 ms vs a * 2f (f32 scalar) 0.035 ms -- 2.25x slower
for IDENTICAL work and an identical f32 result. This hit EVERY `array <op>
python-literal` and dominated the 100K-tier elementwise losses in the full benchmark
(`a * 2 (literal) f32 @100K` 0.067x was among the worst non-degenerate op-matrix cells).
Root cause is path-check ORDERING: the same-dtype gate is evaluated first, against
un-promoted operand dtypes, so a case that is effectively same-dtype after NumPy's
weak-scalar promotion never reaches the scalar-broadcast fast path.
Fix (ExecuteBinaryOp, before the trivial bypass): when exactly one operand is a
0-d/size-1 scalar, the OTHER (array) operand already IS resultType, and only the
scalar's dtype differs, materialize the scalar at resultType (one element -- exactly
what NumPy does, casting the scalar to the loop dtype). Both operands then share
resultType and the same-dtype SIMD paths (contiguous SimdScalarRight/Left and the
same-dtype NpyIter SIMD loop, incl. strided) light up. When the ARRAY operand also
differs (e.g. int32[] * 2.5 -> float64) the guard skips: the array genuinely needs a
cast, which is the mixed path's job.
Value-identical: the mixed path converted the same scalar to resultType per element;
this converts it once up front. Verified across f32/f64/i16/i32/i64/u8/i8 (contig +
strided): (a * 2) == (a * same-type-scalar 2), every dtype eq=True; i32 * 2.5 still
-> float64 = 2.5 (guard correctly skips). Post-fix: a*2 ~= a*2f ~= a*s32 (gap closed).
778 binary/arithmetic/promotion/scalar/broadcast + 2345 math/reduction/statistics/
ufunc/comparison/evaluate tests pass on net8.0 + net10.0.
Settled while investigating (the suspected second ordering bug): DivergesFromNumpyCast
is a dead stub (returns false), so the cast *->bool slowness is NOT a gating-order bug
-- it is fixed overhead on a 1-byte output, like the same-type copy diagonal.
These 14 documents were working notes for work that has since shipped and is covered by durable artifacts (tests, the benchmark scoreboards, and the website docs). Removed as dead weight: Audit working notes (findings already fixed or pinned as [OpenBugs] under test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs, which remain): docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md docs/plans/audit_v2/01_iterators.md docs/plans/audit_v2/02_ilkernel_simd.md docs/plans/audit_v2/03_default_math_reductions.md docs/plans/audit_v2/04_logic_shape_storage.md docs/plans/audit_v2/05_ndarray_creation.md docs/plans/audit_v2/06_manipulation_apis_logic.md docs/plans/audit_v2/07_math_ops_selection_sorting_stats.md docs/plans/audit_v2/08_casting_random_utilities.md Implementation plans for shipped features: docs/OUT_WHERE_NPYITER_FAMILIES_PLAN.md (out=/where= families — shipped) docs/plans/CAST_BEAT_NUMPY_PLAN.md (cast matrix — shipped) docs/plans/CAST_REMAINDER_133_PLAN.md (cast continuation — shipped) POC scratch (the real implementation landed in NpyIter): benchmark/poc/NPYITER_REDUCTION_PLAN.md benchmark/poc/POC_RESULTS.md Build/tests unaffected — every inbound reference is a prose/comment citation, not a code dependency. Four stale citations remain to be scrubbed separately: src/.../Math/DefaultEngine.BitwiseOp.cs:13 -> OUT_WHERE_NPYITER_FAMILIES_PLAN.md src/.../Unmanaged/UnmanagedStorage.Cloning.cs:426 -> CAST_BEAT_NUMPY_PLAN.md docs/NPYITER_GAPS_AND_ROADMAP.md:50 -> benchmark/poc/POC_RESULTS.md docs/PR611_CHANGELOG.md:210 -> docs/plans/audit_v2/
…32b0 wave) The branch advanced ~143 commits past the last changelog refresh. Updated the doc to represent the current/final state, not the prior snapshot: Stats: 312 -> 455 commits, 615 -> 806 files, +217,949/-16,402 -> +234,348/-19,179. Tests: 9,709 -> 9,990 passed / 0 failed (net8.0 + net10.0). Flipped now-false claims: - np.sort: was 'remains unimplemented' -> implemented (radix line-kernel on NpyIter, stable, NaN-last, all axes/orders); np.argsort reimplemented. Missing-functions list narrowed to flip family / diag / gradient / round. - NDIterator: was 'survives as a thin lazy wrapper' -> deleted entirely (interface + class + AsIterator); all iteration via NpyIter/NpyFlatIterator/ GetAtIndex. Breaking-changes row + TL;DR updated. - Frontier 'open losses' -> fixed: broadcast-reduce (was 54x slow) now ~534-700x NumPy via algebraic stride-0 fold; scalar any/all (bool/char) on the integer SIMD path; np.zeros calloc-backed (was ~1000x slow). Added current-state coverage: - §5: np.sort row, np.broadcast N-operand parity, six Complex transcendentals (sinh..arctan). New-API count 30/35 -> 36 (consistent across TL;DR + §5). - §7: SIMD strided-cast campaign (astype 15x8x15: 716->~391 lagging, 852->1,177 wins), np.zeros calloc, broadcast-reduce fold, bit-exact pairwise sum, any/all SIMD, Complex/Half/Decimal axis reductions, f16 negate. - §10: integer reciprocal 1/0 per-width, clip/maximum signed-zero + NaN, f16 axis-sum float32 accum, exp2/power-Half/complex-narrowing crashes, complex nansum uninit read, AVX2 any() mask overflow, net8.0 complex abs. Also fixed the dangling docs/plans/audit_v2/ reference (those docs were removed last commit) -> points at test/.../AuditV2/AuditV2_*.cs instead. Live PR #611 description updated to match via REST PATCH.
…+ scrub dangling plan-doc refs
The Direct/ kernels carried 24 methods marked [Obsolete("Unused...", error: true)]
as 'pending removal' tombstones. error:true makes any reference a compile error,
so they were provably unreachable. Removed outright across 12 partials:
Argwhere(1) Binary(1) Cast(1) Comparison(1) Masking.NaN(4) MatMul(1)
MixedType(1) Reduction.Axis.Simd(8 + orphaned header comment) Reduction.NaN(1)
Scan(3) Search(1) Unary(1)
The Reduction.Axis.Simd block (Add/Mul/Min/Max 256/128) was superseded by the
type-specialized AddOp*/MulOp* structs (435a07e); the rest by their non-throwing
Get*/TryGet* twins. Net -683 lines, zero behavior change.
Also scrubbed 3 comments/prose that still cited the plan docs deleted earlier in
this branch (kept each substantive note, dropped the dead file path):
- DefaultEngine.BitwiseOp.cs (OUT_WHERE_NPYITER_FAMILIES_PLAN.md)
- UnmanagedStorage.Cloning.cs (CAST_BEAT_NUMPY_PLAN.md)
- docs/NPYITER_GAPS_AND_ROADMAP.md (benchmark/poc/POC_RESULTS.md)
PR611 changelog §14 updated: 'pending deletion' -> removed.
Verified: NumSharp.Core builds clean (0 errors); full suite 9990 passed / 0 failed
/ 11 skipped on net10.0 (CI filter). No dangling refs to deleted docs remain.
Dead-code sweep (Roslyn IDE analyzers + whole-repo reference-frequency probe): each method's name occurs exactly once across all .cs in src/test/benchmark/ examples/tools — i.e. declared and never referenced anywhere, so unreachable. All private/protected, so scope confirms no external caller. Removed: CreateLogicalResultShape Default.LogicalReduction.cs ExpandStartDim, ExpandEndDim Default.Dot.cs ExecuteShiftOpScalar Default.Shift.cs BulkHalfToBoolV, ConvHalfToBool DirectILKernelGenerator.Cast.ToBool.cs FindFirstNaN_Double/_Float NDArray.unique.Kwargs.cs TryUniformConstant np.pad.cs uniqueComplex (protected) NDArray.unique.cs Checked for cascade: BulkHalfToBool (non-V) stays — still used by CastHalfToBoolContig; the removed BulkHalfToBoolV was the unused strided-driver adapter. No other symbol orphaned. Verified: NumSharp.Core builds clean; full suite 9990 passed / 0 failed / 11 skipped on net10.0 (CI filter). Zero references to the 10 names remain.
…bsystems) Full run of benchmark/run_benchmark.py (~3h wall, numpy 2.4.2, Release): the BenchmarkDotNet op/dtype/N matrix across all 14 comparison suites, the NpyIter iterator benchmark (+ README cards), and the layout/operand/cast/fusion matrix subsystems. The committed report files were stale (predated several perf fixes already in HEAD); this refreshes them. layout_results.md/.tsv are generated for the FIRST time (the Layout subsystem had never been run on this branch). Recent fixes now visible in the numbers: - bcast_reduce canary: 0.02x (51.8x SLOWER) -> 538.56x faster [cb62e37] - any(F) reduction: ~0.1x -> 8.89x/8.41x small-N, ~1x at scale, geomean 2.74x (+ byte/sbyte AVX2 all-zero-mask fix) [c79ed6f] - np.evaluate F/T: fusion F 0.11x->1.85x, T 0.12x->1.74x [16afbac] - forder_out canary: 0.46x -> 1.28x Op-matrix summary moved the right way (fewer bad cells): before 1851 ops | OK 797 close 289 slow 219 red 81 negl 389 n/a 76 after 1851 ops | OK 792 close 357 slow 177 red 72 negl 384 n/a 69 NpyIter headline 1.17x -> 1.18x geomean; iterator construction 2.88x -> 3.33x. Known weak spots persist (NOT new regressions - pre-existing, predate this run): - left_shift/right_shift @100k ~0.05x (scalar-loop shift path, not vectorized) - np.sum(f64) @100k 0.074x (np.sum() API dispatch overhead; the isolated iterator reduction still wins 2.79x) - copy/cast + index-math at small N, float16 scalar path (no Vector<Half> in BCL) Generated against the current working tree (branch HEAD + local engine WIP); result files only - no source changed.
📊 Performance snapshot vs NumPy 2.4.2Full Headline: op-matrix geomean 1.18×, scaling to 1.26× at 10M. Of 1,851 ops: 792 faster · 357 within 2× · 177 + 72 slower — and the slow buckets shrank this cycle (🔴 81→72, 🟠 219→177). 🟢 Where it's great (often an order of magnitude)
🔴 Where it's still bad (the honest worklist)
TL;DRReductions, fusion, broadcasting, casts, and large-array throughput are clearly ahead of NumPy — frequently 5–500×. The losses cluster in per-call overhead at small/medium N (API dispatch + iterator construction), the non-vectorized scalar-shift path, and float16 (a BCL limitation, not a design one). All performance, no correctness — it's the optimization backlog, not a bug list. numpy 2.4.2 · Release · best-of-rounds · shared-hardware caveat applies to absolute ms, not ratios. |
📐 Performance matrices —
|
| category | scalar | 1K | 100K | 1M | 10M |
|---|---|---|---|---|---|
| elementwise | 1.05 | 1.54 | 1.18 | 1.09 | 1.11 |
| reductions | 2.67 | 1.99 | 1.51 | 1.44 | 1.42 |
| copy/cast | 0.61 | 0.59 | 0.40 | 1.39 | 1.06 |
| index-math | 0.32 | 0.51 | 0.97 | 1.22 | 1.22 |
| dtypes | 0.71 | 0.85 | 1.97 | 1.54 | 1.47 |
2) Per-family × cache tier
| family | scalar | 1K | 100K | 1M | 10M | geomean |
|---|---|---|---|---|---|---|
| — elementwise — | ||||||
| add | 1.01 | 1.48 | 1.03 | 0.88 | 1.01 | 1.06 |
| sqrt | 0.85 | 1.15 | 1.00 | 1.01 | 1.02 | 1.00 |
| copy | 0.88 | 2.59 | 1.78 | 1.33 | 1.72 | 1.56 |
| strided | 0.89 | 1.12 | 1.00 | 1.02 | 0.99 | 1.00 |
| bcast | 0.89 | 1.13 | 1.02 | 0.98 | 1.03 | 1.01 |
| reversed | 0.85 | 1.28 | 0.90 | 0.99 | 1.00 | 0.99 |
| castbuf | 1.98 | 2.29 | 1.65 | 1.35 | 1.16 | 1.64 |
| mixbuf | 1.49 | 1.94 | 1.40 | 1.24 | 1.09 | 1.40 |
| — reductions — | ||||||
| sum | 1.84 | 1.78 | 2.79 | 2.21 | 1.76 | 2.04 |
| sum axis=0 | 1.90 | 0.86 | 0.96 | 1.00 | 0.94 | 1.08 |
| sum axis=1 | 1.85 | 0.86 | 1.51 | 1.83 | 1.57 | 1.47 |
| sum dtype= | 1.97 | 1.47 | 0.49 | 0.47 | 0.55 | 0.82 |
| amin | 1.70 | 1.61 | 0.71 | 0.70 | 0.82 | 1.02 |
| cumsum | 1.47 | 1.09 | 1.06 | 1.80 | 1.68 | 1.39 |
| any (all-false) | 8.89 | 8.41 | 2.12 | 0.98 | 1.00 | 2.74 |
| any (hit) | 9.01 | 8.50 | 8.50 | 7.87 | 8.22 | 8.41 |
| — copy/cast — | ||||||
| flatten | 0.43 | 0.44 | 0.17 | 2.17 | 0.90 | 0.57 |
| astype | 0.30 | 0.53 | 0.59 | 1.97 | 1.90 | 0.81 |
| ravel.T | 0.45 | 0.73 | 0.48 | 2.11 | 1.01 | 0.80 |
| copy in-place | 1.77 | 0.81 | 0.81 | 1.06 | 1.02 | 1.05 |
| less → bool | 0.81 | 0.52 | 0.26 | 0.54 | 0.76 | 0.54 |
| — index-math — | ||||||
| unravel_index | 0.33 | 0.50 | 0.95 | 1.01 | 0.97 | 0.68 |
| ravel_multi_index | 0.32 | 0.52 | 0.99 | 1.49 | 1.53 | 0.82 |
| — dtypes — | ||||||
| complex128 | 0.74 | 0.63 | 1.01 | 0.76 | 0.89 | 0.80 |
| float16 | 0.72 | 0.65 | 0.62 | 0.62 | 0.62 | 0.65 |
| int8 | 0.67 | 1.47 | 12.09 | 7.70 | 5.78 | 3.51 |
3) Operand & broadcast layouts — case × dtype (1M elements)
| case | f64 | f32 | f16 | i32 | i64 | c128 | geomean |
|---|---|---|---|---|---|---|---|
| 1-D contiguous (a+a) | 2.55 | 2.18 | 0.62 | 2.04 | 2.54 | 2.38 | 1.87 |
| 1-D strided a[::2] | 1.83 | 1.36 | 0.52 | 1.37 | 1.77 | 1.86 | 1.34 |
| 1-D reversed a[::-1] | 2.26 | 2.09 | 0.56 | 2.00 | 2.14 | 2.23 | 1.71 |
| array + scalar | 2.63 | 2.11 | 0.63 | 1.80 | 2.17 | 2.58 | 1.81 |
| scalar + array | 2.35 | 2.10 | 0.64 | 1.98 | 2.48 | 2.59 | 1.85 |
| mixed C + F | 2.12 | 2.01 | 0.62 | 1.98 | 2.02 | 2.26 | 1.70 |
| mixed C + T | 2.59 | 2.13 | 0.62 | 2.04 | 2.20 | 2.51 | 1.84 |
| broadcast +row(1,C) | 2.72 | 2.28 | 0.63 | 2.00 | 2.42 | 2.42 | 1.89 |
| broadcast +col(R,1) | 2.68 | 2.14 | 0.56 | 1.99 | 2.45 | 2.82 | 1.88 |
| col-broadcast unary | 2.55 | 1.76 | 0.86 | 1.69 | 2.66 | 6.09 | 2.18 |
4) astype cast matrix — geomeans (1M, over all 1,568 src→dst × layout cells)
By memory layout
| C | F | T | sliced | negrow | negcol | strided | bcast |
|---|---|---|---|---|---|---|---|
| 1.82 | 1.97 | 1.88 | 1.87 | 1.89 | 1.81 | 1.47 | 2.21 |
By source dtype
| bool | u8 | i8 | i16 | u16 | i32 | u32 | i64 | u64 | char | f16 | f32 | f64 | c128 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2.06 | 1.99 | 2.02 | 2.26 | 2.21 | 1.96 | 1.94 | 1.64 | 1.58 | 1.98 | 2.41 | 1.75 | 1.45 | 1.16 |
By destination dtype
| bool | u8 | i8 | i16 | u16 | i32 | u32 | i64 | u64 | char | f16 | f32 | f64 | c128 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.95 | 1.76 | 1.80 | 1.63 | 1.61 | 1.82 | 1.66 | 1.93 | 1.76 | 1.63 | 2.48 | 1.63 | 2.04 | 2.55 |
Reading: 8.89 = NumSharp ~8.9× faster; 0.40 = NumSharp ~2.5× slower. The float16 ≈0.6 column reflects the missing Vector<Half> in the .NET BCL (scalar path). decimal omitted from cast geomeans (no NumPy dtype). Absolute ms aren't cross-machine comparable; these ratios are.
The benchmark/history/<date>_<sha>/ provenance convention held only the 2026-06-05 snapshot; the just-regenerated full run (report commit e3b7c26) was never persisted there — its raw archive sat only in the gitignored benchmark/results/20260623-065155/. This promotes the four core artifacts + a MANIFEST, matching the existing snapshot's shape. Snapshot: benchmark/history/2026-06-23_e3b7c268/ - MANIFEST.md provenance / env / methodology / headline geomeans - benchmark-report.md op-matrix + appended NpyIter/Layout/Operand/Cast/Fusion - benchmark-report.json unified machine-readable results - benchmark-report.csv spreadsheet form - numpy-results.json raw NumPy timings (merge input) Run 20260623-065155 · i9-13900K / Win11 Pro N / .NET 10.0.101 / Python 3.12.12 / numpy 2.4.2 · Release · ~3h. Benchmarked branch HEAD b8623db + local engine WIP (built clean). Convention NPY/NS (>1 = NumSharp faster). Headline: op-matrix 1.14x/0.90x/1.26x at 1K/100K/10M; NpyIter 1.18x geomean (bcast_reduce 538x, any(F) 2.74x); fusion (a-b)/(a+b) 4.16x. Raw BDN per-class JSON omitted (regenerable per the MANIFEST).
… symlink + publish ritual
Make a benchmark run produce a complete, committable provenance snapshot under
benchmark/history/ (the only tracked home for the json/csv/numpy-results that are
gitignored at the benchmark root), and expose the newest run through a git-tracked
benchmark/history/latest symlink so docs and tooling can reference a stable path.
Previously a run only left the gitignored results/<ts>/ raw scratch plus root-level
report files (json/csv ignored), so "what we commit and reference" had no home and
history/ snapshots were assembled by hand.
WHY
results/<ts>/ = gitignored raw scratch (BDN per-class JSON, ~tens of MB)
history/<date>_<sha>/ = tracked provenance snapshot (MANIFEST + report + all sheets)
history/latest = git symlink -> newest snapshot (the stable reference path)
The benchmark/.gitignore drops benchmark-report.{json,csv} and numpy-results.json at
the root, so history/<snap>/ is the ONLY committed location that carries the full,
machine-readable result set alongside the rendered .md.
NEW benchmark/scripts/snapshot_history.py
Assembles a snapshot from the newest (or --results-dir) run + the rendered subsystem
sheets into history/<date>_<sha>/ and (re)points history/latest at it:
- 15 artifacts: benchmark-report.{md,json,csv}, numpy-results.json,
npyiter/layout/operand/cast/fusion *_results.{md,tsv}, cards/{ops,cat}.png
- auto-generated MANIFEST.md: provenance + environment (CPU parsed from the BDN
ProcessorName, OS, .NET SDK, Python, NumPy) + NPY/NS convention + methodology +
the per-size geomean headline parsed back out of benchmark-report.md + the
NpyIter/Cast headlines.
- PROVENANCE accuracy: the working-tree marker reads DIRTY only when BUILD-AFFECTING
sources differ from HEAD (uncommitted-or-new src/ code or *.csproj/props/targets),
NOT on untracked scratch (benchmark/poc/*.cs), gitignored output, or the report
files a run always regenerates. So a clean-checkout CI run correctly reads
"clean (HEAD exactly)" instead of a permanent false DIRTY. All MANIFEST cells are
|-escaped so a commit subject / WIP path containing '|' can't inject a table column.
- history/latest as a relative mode-120000 symlink (repo has core.symlinks=true and
already tracks symlinks); text-file fallback if the OS refuses symlinks.
- <date>_<sha> names the BENCHMARKED commit; --head lets an after-the-fact
re-snapshot record the run's real report commit (and reads THAT commit's subject).
- flags: --results-dir / --snap-name / --head / --commit / --no-stage.
Raw BDN per-class JSON is intentionally not persisted (regenerable).
WIRE run_benchmark.py
New step 6 after the root-copy: unless --no-history, shell out to snapshot_history.py
with the just-produced --results-dir, so every full run ends with a committable
snapshot + repointed latest. Adds HISTORY_DIR const, the --no-history opt-out, and a
closing "Snapshot: .../latest -> ..." line. Passes --no-stage: writing the snapshot
must NOT mutate the git index — staging is the human's "review" step (run -> review ->
commit) and CI stages benchmark/history/ explicitly; a local perf check shouldn't
silently `git add` ~16 files.
WIRE .github/workflows/benchmark.yml (the post-release publish ritual)
git add now includes benchmark/history/ so the snapshot + latest symlink are committed
to master alongside the refreshed report/cards; commit message and header comment
updated to mention the history snapshot.
DOCS
- benchmark/CLAUDE.md: documents history/ (latest + <date>_<sha>/), results/ as
gitignored scratch, scripts/snapshot_history.py, and a new
"History snapshots & the publish ritual" section (results/ vs history/<snap>/ vs
latest table + run -> review -> commit ritual).
- .claude/CLAUDE.md (Performance Convention): "what we commit & reference is
benchmark/history/, not results/<ts>/"; reference the stable
benchmark/history/latest/benchmark-report.md. Also reconciles the canonical-harness
line to FIVE appended subsystems (adds the previously omitted benchmark/operand/).
BACKFILL benchmark/history/2026-06-23_e3b7c268/
Re-ran the script over the e3b7c26 report run: the snapshot carries the full 15
artifacts (was 4 hand-copied core files) + a regenerated MANIFEST that correctly reads
"clean (HEAD exactly)" (no build-affecting WIP at that commit), and history/latest
points at it.
Add a detailed compatibility matrix comparing NumPy (v2.4.2) string dtype/features to NumSharp (nditer branch f7be5f7). Documents support across dtypes, parsing, scalars, array creation, memory layout, casting, comparisons, string operations, array functions and I/O, and highlights that NumSharp lacks a real string dtype (only a UTF-16 char vector and several hacks). Notes two spec corrections found against live NumPy, links to NUMPY_STRING_TYPES.md, and tracks issue #592.
Complete changelog of the
nditerbranch — everything in this PR since #612 merged.455 commits · 806 files · +234,348 / −19,179 (vs
master, after #612)TL;DR
NpyIter— full port of NumPy 2.4.2'snditer(~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. 566+ byte-for-byte NumPy parity scenarios.NpyExprDSL + three-tier custom-op API — write your own ufuncs: raw IL (Tier 3A), element-wise scalar/SIMD (Tier 3B), or composable expression trees with operator overloads (Tier 3C). Exposed as the publicnp.evaluate, which runs fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPyresult_typetyping and fused reductions.out=/where=/dtype=ufunc kwargs across the elementwise API — the kwargs on every NumPy ufunc, spanning the binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plusnp.bitwise_and/or/xorandnp.positiveat thenp.*surface.np.*APIs —sort,pad(11 modes),tile,median/percentile/quantile(all 13 interpolation methods) + theirnan*variants,average,ptp,take/put/place,extract/compress,diagonal/trace,argwhere/flatnonzero,unravel_index/ravel_multi_index/indices,delete/insert/append,diff/ediff1d,asfortranarray/ascontiguousarray,np.multithreading.Shapeunderstands F-contiguity,OrderResolverresolves NumPy order modes, ~68 layout bugs fixed across 9 fix groups.np.sort/np.argsorton a radix line-kernel (closes a Missing Function); a SIMD strided-cast campaign that killed the cast cliffs (15×8×15astypematrix: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPy);np.zerosviacalloc/demand-zero (O(1), was ~1000× slower); the six Complex transcendentals (sinh…arctan); and bit-exact pairwise summation forsum/mean.IDisposableonNDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).MultiIterator, the Regen-generated cast templates, andNDIteratoritself (interface + class +AsIteratorextensions) are all gone; every code path now iterates throughNpyIter/NpyFlatIterator/GetAtIndex.1. NpyIter — full NumPy
nditerportFrom-scratch C# port of NumPy 2.4.2's iterator machinery under
src/NumSharp.Core/Backends/Iterators/(~12,557 lines), promoted to public API with NDArray overloads.MULTI_INDEX,C_INDEX,F_INDEX,RANGE(parallel chunking),GotoIndex/GotoMultiIndex/GotoIterIndexDELAY_BUFALLOC, buffered-reduce double-loop (incl.bufferSize < coreSize)op_axeswith-1reduction axes,REDUCE_OK,IsFirstVisit,REUSE_REDUCE_LOOPSslab accumulationCOPY_IF_OVERLAPvia a port of NumPy'smem_overlapsolver (NpyMemOverlap.cs) — overlapping in/out operands no longer silently corruptWRITEMASKED+ARRAYMASKexecuted — the buffered window flush writes back only mask-nonzero elements;VIRTUALoperands (null op slots) construct with NumPy 2.x semanticsNPY_MAXARGS=64) and unlimited dimensions (NumPy caps atNPY_MAXDIMS=64) via dynamic allocationCopy,GetIterView,RemoveAxis,RemoveMultiIndex,ResetBasePointers,IterRange,DebugPrint, fixed/axis stride queries,GetValue<T>/SetValue<T>, …NpyIterCasting.CanCastmatches NumPy'ssafe/same_kindlattice exactlyValidated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, a permanent variation-probe harness, and
tools/iterator_parity. Dozens of parity bugs found and fixed against NumPy ground truth: negative-stride flipping,NO_BROADCASTenforcement,F_INDEXcoalescing, buffered-reduction stride inversion, K-order on broadcast inputs, EXLOOPiternext, buffered-castAdvance, rangedReset()desync, buffer free-list corruption, the size-1 stride-0 invariant (a(1,4)view with nonzero stride corruptedRemoveMultiIndex),op_axesout-of-bounds reads on stretched size-1 axes, write-broadcast validation,PARALLEL_SAFEwiring, and unit-axis absorption — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.Execution at NumPy speed
NpyIterisn't just correct — it is now the production execution engine:DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NpyIter Tier-3B shell, and it measures at-or-faster than NumPy on every probed aspect (Release, i9-13900K, NumPy 2.4.2):a*b+c10M(a-b)/(a+b)10MKey mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (
vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels (2-D strided sqrt 1.36× faster than NumPy, strided sum 2.2× faster).2. NpyExpr DSL + three-tier custom-op API
User-extensible kernel layer on top of
NpyIter— the public answer to "how do I write my own ufunc":ExecuteRawIL: emit raw IL against the NumPy ufunc signaturevoid(void** dataptrs, long* strides, long count, void* aux).ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.ExecuteExpression: composeNpyExprtrees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons,Min/Max/Clamp/Where), plusCall()to splice any delegate/MethodInfointo a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.This is what powers the fusion wins — one pass, no temporaries — and it is exposed publicly as
np.evaluate(expr[, operands][, out]):result_typetyping — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly:(i4*i4)+f8wraps the multiply in int32 (→1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4,i4/2 → f8) with NumPy's exactOverflowError, and special resolvers (true_divide,arctan2, negative-integer-literalpower→ValueError, booladd=OR/multiply=AND).NpyExpr.Sum/Prod/Min/Max/Meancompile a one-pass inner loop;sum(a*b)readsaandbonce and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).out=joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing throughCOPY_IF_OVERLAP); anEXTERNAL_LOOPguard prevents the silentcount==1slow path.a*b+c3.2×,(a-b)/(a+b)6.1×,sum(a*b)3.6×,sum f322.9×,i4*2+f83.5× faster. Permanent gate inbenchmark/fusion/evaluate_bench.{cs,py}.3. Legacy iterator stack retired
MultiIteratordeleted; all callers migrated toNpyIter.Copy/ multi-operand execution.NDIterator.template.cs+ 16 generatedNDIterator.Cast.*partials deleted (−3,870 LOC in one commit).NDIterator(interface +NDIterator<T>+AsIteratorextensions) deleted entirely —[Obsolete]tombstones that threw at runtime after the migration and were referenced by nothing live. Production iteration runs throughNpyIter/NpyIterRef(kernels),GetAtIndex(flat reads), andNpyFlatIterator(np.broadcast(...).iters).~400per-dtypeNPTypeCodeswitch sites replaced by a genericNpFuncdispatch utility.4. C/F/A/K memory-layout support
Shapenow tracks F-contiguity with NumPy-convention contiguity computation; newOrderResolverresolvesC/F/A/Kfor every API with anorderparameter.copy,array,asarray,asanyarray,*_like,astype,flatten,ravel,reshape,eye,concatenate,cumsum,argsort,tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.np.asfortranarray,np.ascontiguousarray.np.whereselects C/F output layout the way NumPy does;ravel('F')of an F-contig source returns a view (was a 3,000× copy).fortran_order, Decimal scalar path, fancy-write isolation, …).5. New & completed
np.*APIsNew functions (36):
np.evaluate(fused expressions — see §2),np.bitwise_and,np.bitwise_or,np.bitwise_xor,np.positivenp.sort(+ndarray.sort;np.argsortreimplemented) — radix line-kernel on NpyIter, stable, NaN-last, all axes / orders (IterAllButAxisdrive mirroring NumPy's_new_sortlike)np.pad(all 11 NumPy modes + callable),np.tile,np.delete,np.insert,np.appendnp.take,np.put,np.place,np.extract,np.compress,np.argwhere,np.flatnonzero,np.diagonal,np.trace,np.unravel_index,np.ravel_multi_index,np.indicesnp.median,np.percentile,np.quantile(all 13 interpolation methods, tuple axis,out=,keepdims, QuickSelect engine),np.average(weights,returned, tuple-axis; fused kernel 1.3–1.6× faster than NumPy at 1M),np.ptp,np.nanmedian,np.nanpercentile,np.nanquantilenp.diff,np.ediff1dnp.asfortranarray,np.ascontiguousarraynp.multithreading(enabled, max_threads)— opt-in threaded kernelsRebuilt to full NumPy 2.x parity:
np.clip—min=/max=keyword aliases, default-None bounds, NumPy 2.x dtype promotion,out=validation.np.unique— 5 missing kwargs, sort+mask algorithm (up to 43× faster), NaN partitioning,n > Array.MaxLengthfallback.np.searchsorted—side=,sorter=, multidim validation; IL binary-search kernels 5–25× faster (beats NumPy on 20/22 benchmarks).np.copyto—casting=,where=masked copies at NumPy speed (was 7–72× slower).np.asarray—copy=,like=,device=, dtype-as-string.np.concatenate— full parity + C/F fast paths.np.all/np.any— tuple-axis,out=,where=.np.expand_dims— tuple axis.np.repeat—axis=parameter.np.power— integer-power semantics, negative-exponentValueError, crash fix.np.broadcast— N-operand form (0..64, then unlimited — NumPy parity, was 2-operand only), live index cursor, lazy.iters,.numiter.max/min, Complex quantile,IsInfimplemented (was a stub); the six Complex transcendentalssinh/cosh/tanh/arcsin/arccos/arctanimplemented (hybrid BCL + C99 edge fix-ups, NumPy 2.4.2 parity — wereNotSupportedException).out=/where=/dtype=ufunc kwargs (NumPy parity):The kwargs present on every NumPy ufunc now span the elementwise core — binary (
add,subtract,multiply,divide,true_divide,mod,power,floor_divide), unary-math (sqrt,exp,log,sin,cos,tan,abs/absolute,negative,square), the six comparisons, predicates (isnan/isfinite/isinf), bitwise,invert,arctan2— each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:outjoins the broadcast but never stretches (mismatched/stretchableoutraise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50),outonly needs a same_kind cast; the provided instance is returned (reference identity).wheremust be exactlybool(mask cast under 'safe'); it broadcasts over operands and participates in output shape; mask-false slots keep prioroutcontents.outaliasing an input is well-defined viaCOPY_IF_OVERLAP—add(x[:-1], x[:-1], out=x[1:])matches NumPy exactly.dtype=computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the booladd→OR /multiply→AND remap keyed off the final loop dtype soadd(True, True, dtype=i32) = 2.6. Linear algebra
Vector256FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxingGetValuefallback chain.matmulgufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a dedicated differential matrix (816 cases).np.multithreading— opt-in parallel 1-D dot: 1M float dot 172 → 60 µs, ~2× faster than NumPy's default build. Off by default; bitwise-identical summation order when off.7. Performance (beyond NpyIter and linalg)
sum(int16, axis=1)1058 ms → 2.7 ms (389×, now faster than NumPy); int32/uint32 2.3–4.6×; also fixes a uint32 axis-sum corruption bugmean(axis)var/std21×;count_nonzero20×np.nonzeronp.wheresqrtreached parity via gather→tile→SIMD bufferingVirtualAlloc+ demand-zero faults); ≥1 MiB buckets capped at 2 buffers; pool-side GC memory pressure tracking live state;GC.SuppressFinalizeon free;using/ARC adopted acrossconcatenate,allclose,convolve,tile,eye, masking, shuffle, …astypematrix —cvttfloat→int, Giesen f16↔ widen/narrow, complex deinterleave, sub-word VPSHUFB shuffles, fused VPGATHER whole-array kernels, single-pass KEEPORDER same-type copy. Cliffs eliminated: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPynp.zeroscalloc/ WindowsVirtualAllocdemand-zero — O(1) regardless of size (10M f64: 14.3 ms → ~0.01 ms, was ~1000× slower)sum(broadcast_to(...))now ~534–700× faster, beats NumPy, bit-exactsum/mean(float)np.add.reducebit-for-bit (unblocks float32)np.any/np.all(bool/char)ForEachaxis reductions — Decimal 5–13×, Half mean 1.6–3.7×, Complex mean 15–45×→parity; float16 negate ~10× via sign-bit flipfloat→int32)cvtt, strided/reversed/gathered variantsnp.splitfamily8. Official benchmark suite + honest methodology
run_benchmark.pyentry point: BenchmarkDotNet Full rigor (50 iters, InProcessEmit) × all suites × {1K, 100K, 10M} vs NumPy 2.x — 1,813 C# measurements, 1,111 matched op×dtype×size comparisons, structural op-name join, tracked markdown report + per-suite artifacts + history snapshots. Coverage spans all 15 dtypes (SByte/Half/Complex suites added).dotnet runfile-based apps compile the project reference in Debug (optimizations off) even withConfiguration=Releaseproperties — hand loops measured ~2× slow while DynamicMethod IL was immune. Benchmarks now assertIsJITOptimizerDisabled == falseand refuse to mislead; the rule is documented.run_benchmark.py, plus a post-release CI workflow (.github/workflows/benchmark.yml) that auto-commits report cards to master.np.sumover abroadcast_toview (was 54× slower) folds stride-0 axes algebraically and runs ~534–700× faster than NumPy, bit-exact; scalarnp.any/np.allon bool/char (was 5–12× slower) reinterpret onto the integer SIMD path;np.zeros(was ~1000× slower) is calloc-backed. Remaining tracked items: small-N (~1K) per-call dispatch overhead and a few iterator edge cases pinned as[OpenBugs]/skipped repros. A win surfaced too: hand-rolled 8-band parallel iteration 4.7×.9. Differential fuzzing vs NumPy (new infrastructure)
.github/workflows/fuzz-soak.yml).docs/FUZZ_FINDINGS.md; every fixed class re-armed as a permanent regression gate. The error-parity tier alone surfaced 1 critical crash; the op tiers surfaced 17+ distinct bug classes that are now fixed (see §10).10. Correctness — NumPy-parity bug fixes
Semantics (behavioral changes, may affect callers):
floor_divide/mod: NumPy-exact floored semantics and divide-by-zero results.<=/>=now returnFalsefor NaN (IEEE/NumPy).min/maxpropagate NaN.np.negative(uint)wraps modulo 2ⁿ instead of throwing;bool - booland-bool/np.negative(bool)now throw (NumPy behavior).np.power: negative integer exponent raisesValueError; exact integer-power semantics.ConvertValue);complex→boolno longer drops the imaginary part;float→intSIMD uses truncation (cvtt) like NumPy.[1]meets a lower-rank operand; quantile-family dtype & bool handling; Complexnp.where.reciprocal(0)is per-width exact:int32/int64→MinValue,uint64→ 2⁶³, but0for int8/int16/uint8/uint16/uint32 (was MinValue/0 across the board);bool→ int8.clip/maximum/minimum: float16 signed-zero scalar tail, NaN propagation through the SIMD kernel, and correct F-contiguous/strided element pairing.float16axis sum accumulates infloat32(NumPy parity); Complex flatmin/maxreturn the NaN-bearing element verbatim; Complex unary math ported from NumPy's own C99 algorithms.Crashes & corruption:
COPY_IF_OVERLAP, §1).WRITEMASKEDwrite landed garbage in exactly the slots NumPy preserves (silent corruption of the elements the caller asked to protect) — now writes back only mask-nonzero elements.np.pad: 5 correctness/crash bugs (battle-tested against NumPy 2.4.2); linear_ramp preserved Complex dtype.UnmanagedStorage/ArraySlice:CopyTodirection + bounds bugs;CloneDatapartial-buffer bug; scalar offset lost onClone; bufferedNpyIter.Cloneshared buffers;DTypeSizereportedMarshal.SizeOfinstead of in-memory stride;NPTypeCode.Char.SizeOfreturned 1 (real: 2); stale Decimal priority.TensorEnginenow propagates throughCast/Transpose/copy/reshape/ravel(custom engines were silently dropped).takewithout=enforces NumPy's safe-cast direction;put/placenon-contiguous writeback fixes;argsorton non-C-contiguous input.ForEach/ExecuteGeneric/ExecuteReducingread past the end withoutEXTERNAL_LOOP.np.exp2float32-output IL kernel was malformed (InvalidProgramException);np.powerwith a Half exponent threwInvalidCastException; a narrowingdtype=on a complex float-ufunc segfaulted — all fixed.nansumaxis reduction read uninitialized memory forndim ≥ 3; the AVX2 32-laneany()mask overflow (byte/sbyte) returned wrong results; net8.0 complexabsand axismin/maxNaN propagation corrected.11. Memory management — ARC +
IDisposableNDArraynow implementsIDisposablebacked by atomic reference counting on the unmanaged block: CAS-drivenTryAddRef/Release, idempotentDispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.dotat 100K: 446 collections → 0).12.
Char8primitiveNew 1-byte character type (
NumSharp.Char8) — the NumPyS1/Pythonbytes(1)equivalent — with conversions, operators, span helpers, and 100% PythonbytesAPI parity validated against a Python oracle. Vendored .NET ASCII/Latin-1 reference sources undersrc/dotnet/document the upstream implementations it mirrors.13. Examples — trainable MNIST MLP
New
examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive implementation and a fused one (single-NpyIterbias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). Originally needed a "copy transposed views beforenp.dot" workaround (31× training speedup at the time); the stride-native GEMM (§6) made the workaround unnecessary. Converges to >99% test accuracy in the bundled demo.14. Kernel architecture & hygiene
ILKernelGeneratorsplit intoDirectILKernelGenerator(legacy whole-array kernels, 51 partials underKernels/Direct/) andILKernelGenerator(NpyIter-driven per-chunk kernels — the target model matching NumPy'sPyUFuncGenericFunction); migration path documented per kernel family.Vector128/256/512andMath/MathFreflection centralized inVectorMethodCache/ScalarMethodCache; IL-emitted typed-field copier replaces theUnmanagedStorage.Aliasswitch.[Obsolete(error: true)]tombstones, referenced by nothing); dead axis-reduction SIMD paths removed.15. Documentation
docs/website-src/docs/NDIter.md(7-technique quick reference, decision tree, memory model, gotchas) +ndarray.md.benchmarks.md(head-to-head evidence companion to the IL-generation page),benchmark-iterator.md,benchmark-matrix.md, driven by the auto-committed report artifacts.PERF_LEDGER.md(every optimization with before/after),NPYITER_GAPS_AND_ROADMAP.md(gap analysis vs NumPy 2.4.2 + prioritized roadmap),MIGRATE_NPYITER.md, IL-kernel playbook, fuzz findings/coverage.test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs— every Tier-1 finding fixed or reproduced as an[OpenBugs]test.16. Tests & CI
np.evaluate(per-node wraparound, dtype matrices, weak scalars + overflow, fused-vs-unfused,out=identity/cast/aliasing, fused reductions),out=/where=/dtype=parity suites (broadcast/cast/error-text pins), WRITEMASKED/VIRTUAL parity; NpyIter battletests (566 scenarios), order-support sections 41–51, ARC lifecycle, clone regression, np.pad/average/median/percentile/ptp/diff battle tests, IL-kernel battle tests, behavioral audit harness.build-and-release.yml, nightlyfuzz-soak.yml, new post-releasebenchmark.yml(auto-commits NumPy-comparison report cards to master).flip/fliplr/flipud/rot90,diag,gradient, andround(np.sortis now done); small-N (~1K) per-call dispatch overhead is the headline performance focus (docs/NPYITER_GAPS_AND_ROADMAP.md); a few iterator edge cases remain pinned as[OpenBugs]/skipped repros. Every open issue found by the audits/fuzzers/benches is checked in as a failing-by-design test rather than ignored.Breaking changes
bool - bool,-bool,np.negative(bool)now throw^/ cast to int first<=/>=returnsFalsenp.isnanexplicitlyfloor_divide/moddivide-by-zero & floored resultsnp.negative(uint)wraps instead of throwingnp.power(int, negative int)raisesValueErrornp.clip/quantile-family dtype promotion[1].copy()to writeMultiIteratorandNDIterator(+NDIterator<T>,AsIterator) removedNpyIter/NpyIter.Copy/NpyFlatIteratorMaxOperands=8and 64-dim limits removednp.copytounwriteable-destination error type correctedEverything above was validated against NumPy 2.4.2 ground truth — by 37k differential corpus cases, 566 iterator parity scenarios, and per-feature battle tests run on actual NumPy output.