Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
462 commits
Select commit Hold shift + click to select a range
2dbc466
fix(math): reject boolean subtraction to match NumPy (bool - bool now…
Nucs May 29, 2026
bed9a43
fix(math): reject boolean negative to match NumPy (-bool / np.negativ…
Nucs May 29, 2026
71b6c38
fix(cast): align IL kernel + ConvertValue cast semantics with NumPy a…
Nucs May 29, 2026
e55a513
perf(astype): route cross-dtype casts through NpyIter KEEPORDER copy
Nucs May 30, 2026
26b710b
perf(cast): NumPy-faithful SIMD float->int32 (cvtt) + correct scalar …
Nucs May 30, 2026
06b117f
perf(cast): strided/reversed/gathered cvtt for double->int32 (closes …
Nucs May 30, 2026
908ee7f
test(fuzz): NumPy differential cast matrix (Plan A A0+T1) + fix compl…
Nucs May 30, 2026
f36752b
test(fuzz): T2 binary-arith differential matrix (NEP50) + Misaligned …
Nucs May 30, 2026
1c95be8
test(fuzz): catalog floor_divide/mod/power NumPy-parity bugs as [Open…
Nucs May 30, 2026
edb6493
test(fuzz): T3 comparison differential matrix + document NaN <=/>= bug
Nucs May 30, 2026
38e8960
test(fuzz): T4 unary differential matrix + document unary divergence …
Nucs May 30, 2026
9279548
test(fuzz): T5 reduction differential matrix + document reduction div…
Nucs May 30, 2026
22fd2b2
test(fuzz): T6 where/place differential matrix (completes A1 op tiers)
Nucs May 30, 2026
565afb2
test(fuzz): A2 seeded random fuzzer + element-wise shrinker
Nucs May 30, 2026
8c473d2
ci(fuzz): A3 wire FuzzMatrix gate + nightly soak workflow + harness R…
Nucs May 30, 2026
86d7447
docs(fuzz): inventory all NumPy differential-fuzzer findings
Nucs May 30, 2026
5c45555
docs(fuzz): plan to finish #2 (44-variation C/D/E) + build #3 (NpyIte…
Nucs May 30, 2026
013712c
docs: master parity & performance roadmap (full plan to the /np-funct…
Nucs May 30, 2026
8aee0f2
fix(floor_divide/mod): NumPy-exact divide-by-zero & floored semantics…
Nucs May 30, 2026
1dc0f43
fix(comparison): <=/>= return False for NaN, matching IEEE/NumPy (Pha…
Nucs May 30, 2026
f845683
fix(negative): np.negative(uint) wraps modulo instead of throwing (Ph…
Nucs May 30, 2026
fb17369
fix(unary): NEP50 width-based float promotion for transcendental ufun…
Nucs May 30, 2026
75c2eb7
fix(reductions): propagate NaN through flat min/max, matching NumPy (…
Nucs May 30, 2026
7e9bbfb
fix(bool arith + complex where): NumPy-parity for two semantic bugs (…
Nucs May 30, 2026
11d5cff
fix(broadcast): keep rank when a 1-D [1] broadcasts against a lower-r…
Nucs May 30, 2026
b63009c
test(fuzz): re-arm gate for resolved complex-axis-2D & float-reciproc…
Nucs May 30, 2026
dcb9cfa
feat(linalg): full matmul gufunc + dot/squeeze fixes + T8 differentia…
Nucs May 30, 2026
1d8080e
docs(perf): start PERF_LEDGER with the matmul/dot baseline (Phase 5 a…
Nucs May 30, 2026
c0179f3
test(fuzz/W1): widen differential corpus to float16-input + all narro…
Nucs May 30, 2026
e273e0d
test(fuzz/W2): add T9 bitwise + shift differential tier (655 cases, b…
Nucs May 30, 2026
a1b69d2
test(fuzz/W3): add unary-stragglers tier (4654 cases) — 3 bug classes…
Nucs May 30, 2026
fdd12bc
test(fuzz/W4): add T10 NaN-aware reduction tier (2040 cases) — 5 bug …
Nucs May 30, 2026
4283096
test(fuzz/W5): add T11 cumulative tier (scan.jsonl, 544 cases) — 1 bu…
Nucs May 30, 2026
781a519
test(fuzz/W6): add T12 statistics tier (stat.jsonl, 2304 cases) — 4 b…
Nucs May 30, 2026
fb00ed1
test(fuzz/W7): add T13 logic + element-wise extrema tier (logic.jsonl…
Nucs May 30, 2026
c9f64de
test(fuzz/W8): add T15 multi-output tier (modf.jsonl, 64 cases) — 1 b…
Nucs May 30, 2026
acb1aea
test(fuzz/W9): add T7 manipulation tier (manip.jsonl, 1516 cases) — 3…
Nucs May 30, 2026
4b165e9
test(fuzz/W10): add T14 sorting/searching tier (sort.jsonl, 35 cases,…
Nucs May 30, 2026
3f2b510
test(fuzz/W13): add SIMD-tail boundary tier (tail.jsonl, 900 cases, b…
Nucs May 30, 2026
aad46a0
test(fuzz/W12): add parameter-sweep tier (params.jsonl, 288 cases, bi…
Nucs May 30, 2026
4ce81ce
test(fuzz/W11): add section-C operand-relationship tier (aliasing.jso…
Nucs May 30, 2026
f1f3d2d
test(fuzz/W14): add error-parity tier (errors.jsonl, 10 cases) — 1 cr…
Nucs May 30, 2026
0a3adcb
test(fuzz/W15): add metamorphic invariant tier (MetamorphicTests.cs, …
Nucs May 30, 2026
6e81ea6
perf(axis reduce): REUSE_REDUCE_LOOPS slab-accumulation for strided/t…
Nucs May 31, 2026
e062f53
perf(elementwise): O(1) trivial-loop bypass skips NpyIter constructio…
Nucs Jun 3, 2026
1ff75ae
perf(NpyIter): identity-broadcast fast paths + long-shape parity in i…
Nucs Jun 4, 2026
8364bdb
perf(elementwise/NpyIter): hyperoptimize trivial-loop bypass + identi…
Nucs Jun 4, 2026
d50a0e2
perf(reduction): incremental-advance fast path for 1-D non-contiguous…
Nucs Jun 4, 2026
9cebc83
perf(unary): buffered-SIMD path for 1-D strided inputs (gather->conti…
Nucs Jun 4, 2026
d01f1d6
perf(unary): fused strided-SIMD IL kernel for 1-D non-contiguous unar…
Nucs Jun 5, 2026
8da2d7a
bench: official NumSharp-vs-NumPy suite — all-op coverage, 3 sizes, f…
Nucs Jun 5, 2026
1f129c4
bench: per-suite artifact copy + per-size geomean summary in report
Nucs Jun 5, 2026
6038990
bench: official NumSharp-vs-NumPy report (all ops, 3 sizes) + structu…
Nucs Jun 5, 2026
d98c319
bench(history): persist official run snapshot 2026-06-05_6038990f
Nucs Jun 5, 2026
48e8552
bench: cover all 15 NumSharp dtypes — add SByte, Half, Complex (no re…
Nucs Jun 5, 2026
96a5ffc
perf(nditer): Phase 0 hygiene — drop dead axis-reduction SIMD methods…
Nucs Jun 5, 2026
cb0a072
fix(nditer): exact NumPy parity for NpyIterCasting.CanCast (safe + sa…
Nucs Jun 5, 2026
c882c0d
perf(dot): fused single-pass 1-D inner product — 3.5–9× faster, zero GC
Nucs Jun 5, 2026
fea62d2
perf(where): migrate non-contiguous np.where to an NpyIter multi-oper…
Nucs Jun 5, 2026
c5c49b8
feat(dot): np.multithreading toggle + parallel 1-D dot — ~2–3× more o…
Nucs Jun 5, 2026
845f5e0
perf(reduction): widening-SIMD axis reductions for narrow ints — 230-…
Nucs Jun 9, 2026
0c8a5d6
poc(nditer): NpyIter-driven execution at NumPy parity, fusion 2.1-4.6…
Nucs Jun 9, 2026
eae64d8
poc(nditer): close strided C/D/E gaps — NpyIter now at-or-faster than…
Nucs Jun 10, 2026
4eb9749
docs(nditer): NpyIter gap analysis vs NumPy 2.4.2 + prioritized 6-wav…
Nucs Jun 10, 2026
e37f349
feat(nditer): COPY_IF_OVERLAP — overlapping operands no longer silent…
Nucs Jun 10, 2026
fd2f630
perf(nditer): AVX2 hardware-gather strided SIMD in the Tier-3B shell …
Nucs Jun 10, 2026
33058b8
feat(nditer): windowed buffered iteration + DELAY_BUFALLOC + buffered…
Nucs Jun 10, 2026
c9b1588
fix(nditer): size-1 stride-0 invariant, op_axes OOB, bug-(b) sites #6…
Nucs Jun 10, 2026
4140f4d
docs(pr): full PR #611 changelog — 272-commit double-pass audit of th…
Nucs Jun 10, 2026
7c8e058
docs(nditer): correct Wave-1.4 (N,1) A/B figures to the clean interle…
Nucs Jun 10, 2026
224e522
feat(nditer): WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (roa…
Nucs Jun 10, 2026
5962a5e
feat(ufunc): out= and where= parameters across the elementwise np.* A…
Nucs Jun 10, 2026
a8b6160
perf(alloc): open the buffer-pool window + live-state GC pressure + f…
Nucs Jun 10, 2026
d0c122f
feat(evaluate): np.evaluate fused expressions — NumPy result_type per…
Nucs Jun 10, 2026
5fee783
docs(claude): document np.evaluate fused expressions + ufunc out=/whe…
Nucs Jun 10, 2026
fac04f7
bench(nditer): route-audit harness — which non-out= families ride Npy…
Nucs Jun 10, 2026
57e9467
docs(nditer): out=/where= design plan for the NpyIter-routed families…
Nucs Jun 10, 2026
6a566e4
feat(ufunc): ONE NumPy-shaped overload per elementwise ufunc — dtype=…
Nucs Jun 10, 2026
42c8a3d
feat(ufunc): np.bitwise_and/or/xor created with out=/where= + NumPy n…
Nucs Jun 11, 2026
5716f86
feat(ufunc): out=/where=/dtype= across the unary-math batch + invert …
Nucs Jun 11, 2026
eb2abd6
feat(ufunc): out=/where= for the six comparisons + isnan/isfinite/isi…
Nucs Jun 11, 2026
9173268
refactor(ufunc): merge the comparison/predicate bare+out overload pai…
Nucs Jun 12, 2026
450fc93
feat(ufunc): close the 1-to-1 signature sweep — dtype= on the binary …
Nucs Jun 12, 2026
595a8c4
bench(npyiter): iterator-core benchmark — NpyIter machinery itself vs…
Nucs Jun 12, 2026
58de1ee
bench(npyiter): chart renderer for the iterator-core bench results
Nucs Jun 12, 2026
23b4dd5
bench(npyiter): frontier bench — adversarial probe of the NOT-winning…
Nucs Jun 12, 2026
0ee5b70
bench(npyiter): frontier round 2 — broadcast-reduce 54x + scalar np.a…
Nucs Jun 12, 2026
bc48c14
bench(npyiter): round 3 — NumPy's internal NpyIter consumers mapped f…
Nucs Jun 12, 2026
9a16743
bench(npyiter): ASCII geomean bar summary over all 89 measured pairs …
Nucs Jun 13, 2026
86abd65
bench(npyiter): reorient bar summary to the official-report axis (slo…
Nucs Jun 13, 2026
14ef533
bench(npyiter): size-tier sweep (scalar/1K/100K/1M) — the official-re…
Nucs Jun 13, 2026
022b3ee
bench(npyiter): FULL family sweep — all 33 distinct op families x sca…
Nucs Jun 13, 2026
b974c66
bench(npyiter): canonical NpyIter benchmark — one section-addressable…
Nucs Jun 13, 2026
4dc5080
ci(npyiter): decoupled post-release benchmark workflow + README cards…
Nucs Jun 13, 2026
416397d
chore(npyiter): retire exploratory POC rounds; finalize benchmark/npy…
Nucs Jun 13, 2026
9d4502b
feat(bench): integrate NpyIter into run_benchmark.py (Option B); +10M…
Nucs Jun 13, 2026
e98bbd2
bench(npyiter): refresh canonical sheet with 10M tier; AV->NA proven …
Nucs Jun 13, 2026
f26dc24
docs(website): add Benchmarks vs NumPy page driven by the auto-commit…
Nucs Jun 13, 2026
93aee7f
docs(bench): richer two-card story + render full reports into the Doc…
Nucs Jun 13, 2026
336b7dc
fix(bench): op-matrix report ranked measurement artifacts as the "Top…
Nucs Jun 13, 2026
42c65e0
docs(pr): amend PR #611 changelog with the post-changelog wave (Waves…
Nucs Jun 13, 2026
af2386b
docs(bench): prototype a dense, numbers-first NumPy-vs-NumSharp dashb…
Nucs Jun 13, 2026
7649854
fix(net8.0): NumPy-correct complex abs and axis min/max NaN propagation
Nucs Jun 13, 2026
c0a5346
fix(bench): align the dashboard to the house NS/NP convention (<1× = …
Nucs Jun 13, 2026
c829e12
docs(pr): represent final state — drop roadmap-wave/phase framing fro…
Nucs Jun 13, 2026
49af3af
fix(bench): dashboard back to NP/NS speedup + add 🕐 %NumPy time-share…
Nucs Jun 13, 2026
5eac49f
bench: one convention everywhere — NP/NS speedup + 🕐 %NumPy time-share
Nucs Jun 13, 2026
1061f5b
bench: stick the 🕐 after the % (NN%🕐), drop the leading-gap "🕐 NN%"
Nucs Jun 13, 2026
36f7756
bench: %NumPy is ALWAYS a percentage — drop the 880×NP / 880× compact…
Nucs Jun 13, 2026
17c99c5
bench(dashboard): read the merge's canonical ratio/pct — stop driftin…
Nucs Jun 13, 2026
c32d0c3
bench(merge): canonicalize 3 op-name aliases — recover 10 falsely-⚪ "…
Nucs Jun 13, 2026
1657d89
feat(complex): implement sinh/cosh/tanh/arcsin/arccos/arctan for Comp…
Nucs Jun 13, 2026
ef9e1c6
perf(complex): inline-friendly hot/cold split for the complex transce…
Nucs Jun 13, 2026
4536b27
fix(memory): dispose owned intermediates in np.isclose and np.random.…
Nucs Jun 13, 2026
416affc
fix(complex): port NumPy's own algorithms for complex unary math (par…
Nucs Jun 13, 2026
224e96e
bench: close the op-matrix coverage gaps — add missing defs + full re…
Nucs Jun 13, 2026
6879b47
fix(complex): reject narrowing dtype= on complex float-ufuncs (was a …
Nucs Jun 14, 2026
92129bb
perf(creation): np.zeros via calloc + Windows VirtualAlloc demand-zer…
Nucs Jun 14, 2026
6d415b4
fix(exp2): correct malformed float32-output IL kernel (was InvalidPro…
Nucs Jun 17, 2026
5de1138
fix(power): Half exponent no longer throws InvalidCastException (W1-B)
Nucs Jun 17, 2026
ed7cee5
fix(maximum/minimum/clip): propagate NaN through the clip SIMD kernel…
Nucs Jun 17, 2026
4cbb5bc
fix(maximum/minimum/clip): correct F-contiguous/strided element pairi…
Nucs Jun 17, 2026
d72822f
perf(clip): aggressively inline/optimize the per-element extrema helpers
Nucs Jun 17, 2026
06d942c
perf(kernels): aggressively inline/optimize the remaining per-element…
Nucs Jun 17, 2026
e3685be
perf(kernels): complete AggressiveInlining|AggressiveOptimization on …
Nucs Jun 17, 2026
5ef3bba
perf: extend AggressiveInlining|AggressiveOptimization to all small h…
Nucs Jun 17, 2026
20679be
bench(complex-reduce): POC suite diagnosing + benchmarking complex128…
Nucs Jun 17, 2026
7481541
docs(reduce): NpyIter reduction parity + fusion execution plan, with …
Nucs Jun 17, 2026
1b04807
perf(complex): NpyIter axis reductions — fix complex mean (15–45×→par…
Nucs Jun 17, 2026
3a0b53f
perf(half/decimal): NpyIter axis reductions — Decimal all-ops 5–13×, …
Nucs Jun 17, 2026
25e1cc9
perf(reduce): transposed/strided axis reductions to parity-or-better …
Nucs Jun 17, 2026
9c5bad5
feat(evaluate): axis-aware fused reductions — evaluate(Sum(a*b, axis:…
Nucs Jun 17, 2026
8af3faa
docs(reduce): record Phase 5b/6 as skipped — premise-invalidated by m…
Nucs Jun 17, 2026
cbd3706
perf(reduce): Phase 6 step 1 — migrate Double Sum/Mean to per-chunk S…
Nucs Jun 17, 2026
c94aa34
docs(reduce): record parallel-reduce proof (2-6x vs NumPy) + decision…
Nucs Jun 18, 2026
0eeef0f
fix(complex): flat min/max return the NaN-bearing element verbatim (N…
Nucs Jun 18, 2026
d373315
feat(reduce): pairwise summation on NpyIter — bit-exact NumPy sum/mea…
Nucs Jun 18, 2026
84d3b4e
perf(reduce): IL-emit SIMD pairwise sum per dtype — bit-exact NumPy, …
Nucs Jun 18, 2026
f251fc5
perf(reduce): IL-emit Vector128 pairwise sum for Complex128 — bit-exa…
Nucs Jun 18, 2026
abcf3be
fix(reduce): float16 axis sum accumulates in float32, not float16 (Nu…
Nucs Jun 18, 2026
609a43d
fix(reduce): bool axis min/max identity + NpyIter reduce Shape.offset…
Nucs Jun 18, 2026
de1d4d9
bench(reduce): add reduction × layout × dtype × op parity matrix (Npy…
Nucs Jun 18, 2026
81bf489
perf(reduce): SIMD-route non-C-contiguous int-widening axis sum (kill…
Nucs Jun 19, 2026
3fadd11
perf(reduce): materialize broadcast views in flat reduction (bcast-re…
Nucs Jun 19, 2026
327ffbf
bench(elementwise): add op × layout × dtype matrix to probe non-conti…
Nucs Jun 19, 2026
7ca8ede
bench(reduce): refresh layout×dtype matrix after the reduction perf f…
Nucs Jun 19, 2026
e9a998c
bench(copy): NpyIter copy path vs NumPy across all 15 dtypes × 7 layouts
Nucs Jun 19, 2026
c84693c
docs(claude): canonical Performance Convention — NPY/NS, >1 = NumShar…
Nucs Jun 19, 2026
c79ce60
perf(reduce): bool/char/half min/max-along-axis — kill the scalar dou…
Nucs Jun 19, 2026
878240c
perf(reduce): fold broadcast axes in place instead of materializing (…
Nucs Jun 19, 2026
0a3a872
perf(copy): strided/broadcast same-type clone for the Vector-less dty…
Nucs Jun 19, 2026
158ccc5
perf(reduce): flat char/half min/max — reuse the per-dtype trick on t…
Nucs Jun 19, 2026
ebd50cd
fix(reduce)+perf: half flat sum/prod/mean — boxing-free contiguous sc…
Nucs Jun 19, 2026
fd6434f
test(reduce): adversarial broadcast-reduce sweep — fold verified bug-…
Nucs Jun 19, 2026
d628c7c
perf(cast): typed strided cross-dtype cast for the Vector-less dtypes…
Nucs Jun 19, 2026
474b47d
perf(reduce): migrate non-contig Half/Complex reduction fallbacks fro…
Nucs Jun 19, 2026
690a4d5
perf(unary): float16 negate via sign-bit flip, not the BCL float roun…
Nucs Jun 19, 2026
f1ffa06
fix(reduce): complex nansum axis reduction read uninitialized memory …
Nucs Jun 19, 2026
a2b6cba
perf(cast): IL-emitted scalar cast kernel — direct `call Converts.ToX…
Nucs Jun 20, 2026
b38a1d4
docs(plan): final cast-optimization plan — beat NumPy at every execution
Nucs Jun 20, 2026
873f3ad
docs(plan): cast entry-point unification — route buffered + assignmen…
Nucs Jun 20, 2026
ba8db75
bench(cast): Phase 0 full-matrix discovery — astype 15×8×15 sweep rep…
Nucs Jun 20, 2026
12b97f5
docs(cast-plan): PROVE float→narrow-int kernel by benchmark — correct…
Nucs Jun 20, 2026
84c2572
perf(reduce)+fix: Half/Complex/bool flat reductions via struct-generi…
Nucs Jun 20, 2026
5a34e37
perf(cast): SIMD float->narrow-int (cvtt+truncating-Narrow) — kills P…
Nucs Jun 20, 2026
cff88f7
docs(cast-plan): PROVE all 5 remaining cast cliff families by benchma…
Nucs Jun 20, 2026
0378596
perf(cast): SIMD {int,float}->bool (!=0 compare) — Phase-0's worst ds…
Nucs Jun 20, 2026
7e2cb11
perf(broadcast): make np.broadcast(...).iters lazy — no eager iterato…
Nucs Jun 20, 2026
c13b9b8
docs(cast-plan): ROOT-CAUSE + prove the last routing cliff (same-type…
Nucs Jun 20, 2026
68d0e0c
perf(cast): vectorized f16->{bool,i8,u8,i16,u16,char,i32} via bit-fid…
Nucs Jun 20, 2026
9eff2c0
revert(cast): remove this session's cast SIMD kernels — superseded by…
Nucs Jun 20, 2026
f7be5f7
refactor(iterators): retire legacy NDIterator — [Obsolete] tombstones…
Nucs Jun 20, 2026
fe42a89
test(broadcast): NumPy-parity coverage for np.broadcast(...).iters ac…
Nucs Jun 20, 2026
53f18f0
feat(broadcast): np.broadcast accepts 0..64 operands (NumPy parity), …
Nucs Jun 20, 2026
f83de45
feat(broadcast): live index cursor + iteration + reset() — align np.b…
Nucs Jun 20, 2026
7c283f3
perf(cast): SIMD float->narrow-int + complex->int (Waves 1-2) — kill …
Nucs Jun 20, 2026
33d93bd
perf(cast): SIMD Half->int via Giesen bit-fiddle widen (Wave 3)
Nucs Jun 20, 2026
b1a6382
perf(cast): SIMD {int,float,half,char}->bool via !=0 compare (Wave 4)
Nucs Jun 20, 2026
54b9589
perf(broadcast): scalar-broadcast same-type clone via fast fill, not …
Nucs Jun 20, 2026
e6579d6
bench(cast): post-waves full matrix re-run — 716 lagging cells -> 461…
Nucs Jun 20, 2026
0d6c232
feat(broadcast): drop the 64-operand cap — match NumSharp's unlimited…
Nucs Jun 20, 2026
0ec2f71
test(broadcast): prove np.broadcast scales to N operands like NpyIter…
Nucs Jun 20, 2026
e660f84
perf(cast): fused VPGATHER whole-array kernels for f32/f64->narrow st…
Nucs Jun 20, 2026
4a91531
perf(cast): fused VPGATHER whole-array kernels for {f32,f64,i32,u32,i…
Nucs Jun 20, 2026
47b20eb
bench(cast): refresh cast-matrix scoreboard after Wave 7 fused-gather…
Nucs Jun 20, 2026
ffc7417
perf(cast): fused VPGATHER whole-array kernel for f32->i32 strided (W…
Nucs Jun 20, 2026
cb255eb
bench(cast): refresh scoreboard after Wave 7c (f32->i32 strided)
Nucs Jun 20, 2026
a313c56
refactor(iterators): delete NDIterator entirely — it was fully dead code
Nucs Jun 20, 2026
d45a9d2
perf(cast): fused VPGATHER whole-array kernels for int->narrow stride…
Nucs Jun 20, 2026
357bbce
perf(cast): drop stale signed->UInt64 SIMD-widen rejection (Wave 9)
Nucs Jun 20, 2026
5cf136d
docs(website): refocus iterator docs on NpyIter; drop deleted NDItera…
Nucs Jun 20, 2026
401deaa
perf(cast): gather-real deinterleave for c128->narrow inner-strided (…
Nucs Jun 20, 2026
e5b2707
perf(cast): SIMD Giesen float->f16 narrow for {bool,u8,i8,i16,u16,cha…
Nucs Jun 20, 2026
e5875cf
perf(cast): SIMD u32/f64/c128 -> f16 — finish the f16 column except i…
Nucs Jun 20, 2026
b7af39e
docs: remove 10 stale/superseded planning & handover docs + fix dangl…
Nucs Jun 20, 2026
e28a71e
perf(cast): single-pass KEEPORDER same-type copy — kill the F-contig …
Nucs Jun 20, 2026
dd62afe
docs(NDIter): correct stale bug status, add memory-overlap + adoption…
Nucs Jun 20, 2026
cbca606
docs(CLAUDE.md): retarget to current-state snapshot, drop migration n…
Nucs Jun 20, 2026
30fa835
perf(cast): SIMD char->{i8,u8} contiguous narrow — close the generic …
Nucs Jun 20, 2026
5832f4f
docs(NDIter): make the page a final-state reference (drop bug list + …
Nucs Jun 20, 2026
055fd4a
perf(cast): SIMD complex128 -> bool via deinterleave + nonzero compar…
Nucs Jun 20, 2026
ce70ed8
bench(cast): refresh cast-matrix scoreboard at HEAD (Waves 8-14)
Nucs Jun 20, 2026
24a9b2c
docs(CLAUDE.md): second current-state pass — deleted class, random su…
Nucs Jun 20, 2026
76c99f5
docs(NDIter): second final-state pass — neutral wording + GROWINNER a…
Nucs Jun 20, 2026
d9dadfe
docs(CLAUDE.md): third pass — close API-list completeness gaps, fix t…
Nucs Jun 20, 2026
2f9e1bd
perf(cast): bit-exact AVX2 f32/f64 -> u32 (Wave 15) — 16 cells 0.46-0…
Nucs Jun 20, 2026
d881ad9
perf(cast): route f16->u32 and c128->u32 through the AVX2 f64/f32->u3…
Nucs Jun 20, 2026
e54a15c
perf(cast): bit-exact vectorized f16 -> f32 widen (Wave 15c) — 8 cell…
Nucs Jun 20, 2026
b889391
feat(bench): promote layout/cast/fusion harnesses from poc into run_b…
Nucs Jun 20, 2026
4d4d460
perf(cast): bit-exact AVX2 {f16,f32,f64,c128} -> u64 (Wave 15d)
Nucs Jun 20, 2026
429476b
perf(cast): bit-exact vectorized f16 -> i64 (Wave 15e) — 8 cells 0.73…
Nucs Jun 20, 2026
e667fc6
bench(cast): fix UTF-8 piping in bench_common + refresh scoreboard at…
Nucs Jun 20, 2026
f3faaaf
docs(cast): record Wave 15 cast kernels — float/c128->u32/u64, f16->f…
Nucs Jun 20, 2026
f12b71f
docs(cast): continuation plan for the remaining 133 <0.9 cells
Nucs Jun 20, 2026
83dfccb
bench(layout): POC operand/extra layout classes + harmonize subsystem…
Nucs Jun 20, 2026
7e6ac1e
bench(operand): promote operand/broadcast layout POC into a run_bench…
Nucs Jun 20, 2026
835939b
docs(NDIter): third final-state pass — accuracy fixes in Iteration Me…
Nucs Jun 20, 2026
c5fb126
bench(fusion): add operand-layout sweep to the fusion gate (C/F/T/str…
Nucs Jun 20, 2026
f60a170
perf(cast): SIMD deinterleave/reverse for same-type sub-word strided …
Nucs Jun 20, 2026
9aea3ac
perf(cast): extend SubwordCopy to same-size cross-type bit reinterpre…
Nucs Jun 20, 2026
16afbac
perf(evaluate): fix np.evaluate F-order/transpose cliff (~15x) via F-…
Nucs Jun 20, 2026
03647e8
bench: review fixes — stale operand labels + fusion out= apples-to-or…
Nucs Jun 20, 2026
3b44603
perf(cast): SIMD 2-byte-int -> 1-byte/bool strided narrowing (Wave 16c)
Nucs Jun 20, 2026
6e3bd83
docs(cast): record Wave 16 sub-word strided SIMD shuffles (CLAUDE.md …
Nucs Jun 20, 2026
97f5ab2
bench(cast): refresh scoreboard at HEAD (clean foreground run, Wave 1…
Nucs Jun 20, 2026
12041fe
perf(cast): SIMD 1-byte-int -> 2-byte strided widening (Wave 16d)
Nucs Jun 20, 2026
cb62e37
perf(reduce): collapse broadcast axes algebraically in flat reduction…
Nucs Jun 20, 2026
97b2012
feat(sort): implement np.sort + np.argsort + ndarray.sort on NpyIter …
Nucs Jun 20, 2026
710ef15
bench(cast): refresh scoreboard at HEAD incl. Wave 16d widening (repr…
Nucs Jun 20, 2026
c47971a
docs(cast): record Wave 16d 1B->2B widening (CLAUDE.md + plan docs)
Nucs Jun 20, 2026
c79ed6f
perf+fix(any/all): SIMD bool/char reductions + fix byte/sbyte any() A…
Nucs Jun 20, 2026
30891f5
fix(reduce): root-fix NpyIter strided sub-word accumulator drift; dro…
Nucs Jun 20, 2026
750059a
perf(cast): SIMD i64/u64 -> f16 via clamp-sentinel + low-32 pack (Wav…
Nucs Jun 20, 2026
faa2754
perf(cast): kill the -1/2-stride gather in float/c128 -> u64 strided …
Nucs Jun 20, 2026
0a0a42f
perf(cast): SIMD f16 -> bool strided via deinterleave/reverse + magni…
Nucs Jun 20, 2026
f4a34ba
docs(cast): Wave 17 scoreboard refresh + bucket A-E plan writeup
Nucs Jun 20, 2026
2d93eb6
perf(cast): SIMD c128 -> bool negcol via deinterleave-reverse of real…
Nucs Jun 20, 2026
b4d2128
fix(sort): eliminate O(N^2) blowup on 1-D / axis=None sort & argsort
Nucs Jun 21, 2026
861f069
perf(engine): route 7 strided ops through the kernel instead of mater…
Nucs Jun 21, 2026
cf000e8
fix(math): NumPy-exact integer reciprocal 1/0 (+ bool->int8) and clip…
Nucs Jun 21, 2026
41f295a
fix(npyiter): 0-dim op_axes iterates once, not N times (root of sort …
Nucs Jun 21, 2026
d1932b0
perf(binaryop): fix scalar-broadcast fast-path miss for weak-scalar l…
Nucs Jun 21, 2026
d1fd3cd
docs: remove shipped plan/audit/POC scratch from the nditer branch
Nucs Jun 23, 2026
b8623db
docs(pr): refresh PR #611 changelog to current branch state (HEAD d19…
Nucs Jun 23, 2026
07041e5
refactor(kernels): delete the 24 [Obsolete(error:true)] dead methods …
Nucs Jun 23, 2026
76ec6dd
refactor: remove 10 confirmed-dead private/protected methods
Nucs Jun 23, 2026
e3b7c26
bench: regenerate full NumSharp-vs-NumPy report (op-matrix + all 5 su…
Nucs Jun 23, 2026
dc4ca8b
bench(history): archive the 2026-06-23 full-run snapshot (e3b7c268)
Nucs Jun 24, 2026
9fd5666
bench(history): committable history/<date>_<sha> snapshots + `latest`…
Nucs Jun 24, 2026
7d44d49
Add NumPy-NumSharp string compatibility matrix
Nucs Jun 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .agents/skills/np-function/SKILL.md
1 change: 1 addition & 0 deletions .agents/skills/np-tests/SKILL.md
213 changes: 159 additions & 54 deletions .claude/CLAUDE.md

Large diffs are not rendered by default.

191 changes: 191 additions & 0 deletions .claude/commands/np-function.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
name: np-function
description: Implement a NumPy np.* function in NumSharp with full API parity, optimizations, and variation coverage (NumPy 2.4.2 source of truth).
argument-hint: <np.function_name or description>
---

When user requests /np-function, you are to follow these instructions carefully!:

# np-function command

We are looking to support NumPy's np.* to the fullest. we are aligning with NumPy 2.4.2 as source of truth and are to provide exact same API (np.* overloading) as NumPy does.
This session we focusing on: """$ARGUMENTS"""
You job is around interacting with np.* functions - no more than one unless they are closely related.

np.* / function's high-level development cycle is defined as follows:

## 1. Read, investigate, learn and experiment
Read how NumPy (src\NumPy\) implemented the np functions you are about to implement - noting all parameters and overloads.
NumPy is the source of truth and if NumPy does A, we do A but in NumSharp's C# way.

### Definition of Done:
- At the end of step (1) step you understand to 100%:
- How the np function works internally in NumPy and reacts to inputs / parameters.
- What parameters the np function accepts and what modes the function works in.
- Understand what optimizations are used by NumPy and what optimizations can we use.
- Understand how would be the best integration to our existing infrastructure.
- Do we use ILKernelGenerator or NpyIter to implement the loop.
- Do not implement struct kernel.
## 2. Implement np method/s
- Implement np methods to the fullest, integrating into our existing infrastructure and patterns.
- Our implementation might differ from NumPy's because NumPy uses C++ macros while we generate IL methods during runtime to achieve peak performance and cpu acceleration. But any input given to NumPy will produce same output with complete parity.
- Our implementation must provide same parameters as the NumPy function and support all dtypes NumSharp currently supports.
- Do not create a function per dtype/NPTypeCode or if-else/switch-case per dtype/NPTypeCode to call a specialized path.
- Do not use struct kernel pattern.
- Do utilize IL generation (ILKernelGenerator) and/or NpyIter to implement the function, including fast paths.
- Any loops must be implemented via NpyIter or via ILKernelGenerator.

## Tools:
### Asserting, Validating, Comparing, Experimenting and Probing
"dotnet run <<'EOFDOTNET'" and "python <<'EOFPYTHON'" both can be used to asserting, validating, comparing, test and confirm how behaviors, edge cases, parameter variations, happyflow, unhappyflow are acting based on given input/s.
These cli functions allow rapid development and experimentation.
Specifying '#:project' and other '#' with paths must be absolute path.

### Benchmarking
Use "dotnet run <<'EOFDOTNET'" and "python <<'EOFPYTHON'" to produce professional benchmarks.

#### Benchmarking Rules of Thumbs
- We must be at-least x1.5 as fast as NumPy at all variations of execution extensively and modes possible extensively (all dtypes, all parameters combinations, see "Variations for Asserting, Validating, Comparing and Experimenting").
- There is a reason towards why NumPy does

## Optimizations and Implementation
Our codebase uses and follows the following techniques:

### A. Specialization & code generation

- Runtime IL emission per cache key — DynamicMethod generates a kernel once per (op, dtypes, layout) and the JIT compiles it to native; subsequent calls hit a ConcurrentDictionary lookup.
- Per-startup SIMD width baking — VectorBits resolved once via IsHardwareAccelerated; the emitted IL targets exactly one of V128/V256/V512 with no runtime width branch.
- Layout-specialized kernel paths — Generate distinct kernels for SimdFull / SimdScalarLeft / SimdScalarRight / SimdChunk / General instead of one kernel with runtime layout branches; layout becomes part of
the cache key.
- Signature collapse for fast paths — Contig kernels drop stride/shape args; scalar-broadcast kernels take T scalar not T*; cuts indirection and shrinks the IL body.
- Helper-call vs inline-IL choice — When an op has a tidy generic-constrained C# helper (e.g. CumSumHelperSameType<T>), the kernel emits a single Call and lets the JIT inline; only complex bodies inline the
IL loop themselves.
- Negative cache for unsupported combos — _castUnsupported/_maskedCastUnsupported record dtype pairs that fail IL gen so retries are O(1) instead of re-attempting emission.

### B. Loop shaping

- 4x-8x unrolling with independent accumulators — Body processes 4-8 vectors per iter into 4-8 separate accumulators; breaks the carried dependency so the CPU dispatches 4-8 SIMD ops/cycle.
- Three-stage loop — Unrolled SIMD body + 1-vector remainder + scalar tail; handles any count without padding.
- Inner-contig runtime dispatch — Inside strided kernels, compare each operand's stride to its element size; branch into the SIMD inner body when all match, else strided.
- Cache-friendly loop ordering — IKJ in MatMul so the inner SIMD walk is over sequential B[k,:] memory; A[i,k] is broadcast once and reused across all j.

### C. SIMD primitives

- Mask→uint via ExtractMostSignificantBits — Convert a Vector mask to packed bits in a uint — the universal building block for All/Any/NonZero/CountTrue/CopyMasked.
- Bit-scan loop (TrailingZeroCount + bits &= bits-1) — Materialize lane indices from a packed mask one-at-a-time without a per-lane branch; standard idiom for sparse-extract.
- Self-equality NaN mask — Equals(v, v) produces lanes that are true for non-NaN (NaN ≠ NaN); used to zero/count out NaNs in NaN-aware reductions.
- Branchless ConditionalSelect — Per-lane gating without a branch; used by Where and masked cross-dtype copy.
- Scalar pre-broadcast — Vector.Create(scalar) hoisted into a local before the loop so the body re-uses it instead of reloading; used by scalar-broadcast variants of binary/where/clip.
- Op-specific identity seeding — Reduction accumulators are pre-loaded with 0 (Sum), 1 (Prod), MinValue (Max), MaxValue (Min) — also defines the empty-array result.
- Tree merge + horizontal halving — Multi-accumulator finalization: acc0 op= acc1; acc2 op= acc3; acc0 op= acc2, then horizontal reduce across the lanes.
- Early-exit on mask state — All/Any/IsAllZero return immediately when the packed bits hit the terminal pattern, skipping the rest of the array.
- Vectorized index discovery, scalar scatter — Even when the data store can't be vectorized (gather/scatter limits), the mask scan that finds the indices is fully SIMD.
- AVX2 gather for strided float/double — Strided axis reductions use intrinsic gather when the dtype is gather-capable.
- Width-adaptive emit via GetVectorContainerType() — One emission function picks Vector{128|256|512} methods through a cache; the same source code path covers all widths.

### D. Memory & pointer

- Cpblk IL intrinsic — Same-type contiguous copy emits the CLR block-memcpy opcode directly instead of a loop.
- Incremental coord advance — Outer-dim walks update offsets by adding strides rather than recomputing via flat → div/mod per element.
- Pre-computed dim strides in stack array — Axis kernels pre-build output-dim strides on the stack so each output index → input offset is O(ndim) muladds, no divmods.
- Pointer/stride prologue hoisting — Inner-loop factory snapshots dataptrs[i] and strides[i] into locals once at the top so the loop body works against locals, not memory loads.
- Pre-size-then-fill — np.nonzero runs an IL-emitted popcount first to size the output buffer, then a second IL-emitted bit-scan kernel writes indices; avoids the "alloc max-size temp" pathology.

### E. Algorithmic

- Two-pass algorithms — ArgMax (find value → find index), Var/Std (mean → squared diffs), masked-copy (count → place). First pass enables vectorization; second pass exploits the known result.
- Monotonic-bound carry — searchsorted carries the lower bound L from the previous iteration when consecutive keys ascend, mirroring NumPy's binsearch.cpp.
- Short-circuit prescan — Quick SIMD all-zero check on a boolean mask short-circuits the whole np.where(cond) pipeline when the condition is fully false.
- Type-promotion-aware path skip — SIMD reduction skipped when input != accumulator (e.g. sum(int32)→int64) because Vector<T> can't widen lanes; falls to scalar IL.
- Two-tier inner-loop API — Callers choose between Tier 3A (raw IL body) for full control or Tier 3B (scalar/vector body lambdas wrapped in the standard 4×-unrolled shell) for boilerplate elimination.

### F. Cross-type bridging

- Decimal-via-double bridge — All transcendental decimal ops emit decimal→double→Math.*→decimal inline IL.
- Bool-mask lane expansion — 1-byte mask is widened through WidenLower chain to match the 1/2/4/8-byte data lane width before ConditionalSelect.
- Magnitude comparison for Complex — ArgMax/ArgMin on Complex compares |z|, since Complex has no native ordering.

### F. NumPy semantic compliance

- NumPy-overflow shift semantics — Branch on shift >= bitWidth returns 0 (or -1 for signed-negative right shift) instead of C# x << (n & 31) masking.
- Sign-preserving zero in Modf — Explicit fixup so modf(-0.0) = (-0.0, -0.0) and modf(+inf) = (+0.0, +inf) per C standard.
- Vacuous truth for empty reductions — all([])=True, any([])=False, identity-valued Sum/Prod/Max/Min for empty arrays.
- NEP50-aligned accumulator types — Reduction kernels promote int32→int64 for Sum/Prod/CumSum, dropping out of SIMD when needed.

### G. Reflection & caching

- MethodInfo cache (fail-fast at type load) — Math.*, Vector*.*, Decimal.* reflection resolved in static initializers with ?? throw; emission never pays GetMethod cost.
- Width-resolved generic method cache — VectorMethodCache.V(VectorBits, clrType) returns the right Vector{W}<T> type and Generic(VectorBits, name, clrType, paramCount) returns the right method handle.
- ConcurrentDictionary.GetOrAdd keyed by structural value — All kernel caches use struct keys with stable Equals/GetHashCode; thread-safe lazy init via GetOrAdd.


## Variations for Asserting, Validating, Comparing and Experimenting
These variations are the range of possabilities of inputs that we need to follow NumPy's output based on inputs for complete parity.
Total: ~44 distinct variations — 25 single-array layouts, 6 pairwise paths, 8 per-operand flags, 8 iteration flags, 4 composite execution paths.

### A. Single-array layouts

- C-contiguous — Row-major, stride[-1]==1 and stride[i]==shape[i+1]*stride[i+1]; baseline fast path via IsContiguous.
- F-contiguous — Column-major, stride[0]==1; 1-D arrays are both. Detected via IsFContiguous.
- Strided / non-contiguous — Arbitrary strides, neither C nor F; built via step slicing or axis swap.
- Transposed — Strides permuted by .T / swapaxes / moveaxis; usually non-contig.
- Negative-stride view — Reversed slicing ([::-1]); strides are signed-negative.
- Simple slice — offset!=0, not broadcast; fast GetOffsetSimple path (IsSimpleSlice).
- Sliced + composed — a[1:5].T, a[1:3][:,None,:]; offset combined with permutation or broadcast.
- Broadcasted — stride=0 with dim>1 (BROADCASTED flag); read-only per NumPy.
- Scalar-broadcast — All strides zero (IsScalarBroadcast); load value once and reuse.
- Partial broadcast — Some axes stride=0, others not; common (1,N)→(M,N) case.
- Scalar (0-d) — ndim==0, size==1, no strides.
- 0-D view from integer indexing — a[0,0,0] shares storage; distinct from np.array(5.0) which owns.
- 1-element 1-D — ndim==1, size==1; ambiguous against 0-d in some paths.
- Empty — size==0 (e.g. np.zeros((0,3))); reductions must return identity.
- Empty + composed — np.zeros((0,3))[::2,:]; rare but must not crash.
- NewAxis-inserted dim — a[None,:] adds dim=1, stride=0; not flagged broadcast since dim=1.
- Singleton dim (dim=1) — Stride is moot; NumPy treats as contig.
- Higher-rank (5+D) — Stack-allocated coord/stride arrays in kernels may have bounds.
- Stride > bufferSize — Negative-stride views can have offset + stride*(dim-1) >= bufferSize.
- Reshape view vs copy — Reshape returns a view when contig allows, materializes otherwise.
- Fancy-indexed result — Always a fresh C-contig owning array, never a view.
- Boolean-mask result — Always a contig owning copy.
- Read-only / non-writeable — IsWriteable==false (set on broadcast views); writes throw.
- Non-owning view — OwnsData==false; writes alias the parent.
- Aligned — ALIGNED flag; always true for managed allocs but a real NumPy axis.

### B. Pairwise (binary-op) paths — MixedTypeKernelKey.Path

- SimdFull — Both operands C-contig same dtype; SIMD baseline.
- SimdScalarRight — RHS is 0-d / scalar-broadcast, LHS is array.
- SimdScalarLeft — LHS is 0-d / scalar-broadcast, RHS is array.
- SimdChunk — Inner dim contig for both, outer strided.
- General — Arbitrary strides on either side; coordinate iteration.
- Mixed dtypes — Orthogonal axis: same layout, different LHS/RHS/result dtypes (NEP50 promotion).

### C. Per-operand variations — NpyIterOpFlags

- Aliased operands — Same buffer on both sides (a + a, out=a); no non-aliasing assumption.
- Overlapping views — Two views with partial overlap (a[1:] and a[:-1]); writes can clobber unread reads.
- In-place output (out=) — Output aliases an input; loop order must respect read-before-write.
- Reduction operand — Output has stride=0 along the reduction axis (REDUCE flag).
- Write-masked operand — WRITEMASKED: write only where mask (ARRAYMASK) is true. Enforced ONLY at buffered copy-back (NumPy parity); unbuffered = kernel contract.
- Virtual operand — VIRTUAL: null operand, allocate-equivalent in NumPy 2.x (real backing array, dtype request discarded → common dtype).
- Buffered / casting operand — CAST / FORCECOPY / HAS_WRITEBACK: type conversion needs a temp.
- Read-only operand — READ without WRITE; matters for output selection.

### D. Iteration-level variations — NpyIterFlags

- Coalesced dimensions — Consecutive axes with matching strides collapsed; ndim=4 may arrive as ndim=1.
- IDENTPERM vs NEGPERM — Axis iteration order: identity vs flipped (negative stride on some axis).
- External loop (EXLOOP) — Kernel sees only the inner axis; outer loop driven by iterator.
- Ranged iteration (RANGE) — Partial traversal of a subset.
- GROWINNER — Inner-loop length varies across outer iterations.
- GATHER_ELIGIBLE — Strided inner axis but dtype supports AVX2 gather.
- Early exit — short-circuit (All/Any/IsAllZero) is a KERNEL property (`SupportsEarlyExit`/`ShouldExit`), not an iterator flag.
- PARALLEL_SAFE — iteration range splittable across workers: no REDUCE operand, ≤1 WRITE operand with COPY_IF_OVERLAP-resolved overlap (`IsParallelSafe`).

### E. NpyIter composite execution paths

- Source-broadcast + dest-contig — Common reduction shape.
- Source-contig + dest-strided — Writing into a sliced output.
- Buffer-required path — Dtype mismatch or alignment forces NpyIter to insert a temp; kernel sees contig but indirect.
- Reused reduce loops — REUSE_REDUCE_LOOPS: inner-loop kernel runs against successive output positions without re-derivation.

37 changes: 0 additions & 37 deletions .claude/skills/np-function/SKILL.md

This file was deleted.

Loading
Loading