Skip to content

feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411

Open
GarrettBeatty wants to merge 1 commit into
gcbeatty/flatfrom
gcbeatty/durable-payload-overflow
Open

feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411
GarrettBeatty wants to merge 1 commit into
gcbeatty/flatfrom
gcbeatty/durable-payload-overflow

Conversation

@GarrettBeatty

@GarrettBeatty GarrettBeatty commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds large-payload overflow handling to Amazon.Lambda.DurableExecution, bringing the .NET SDK to parity with the Python/Java/JS SDKs.

A single durable operation's checkpoint can't exceed 256 KB. Today, a Parallel/Map/child-context op whose results are larger than that fails to checkpoint. This PR makes those results survive replay by not storing them — instead the SDK records just enough to re-derive them, and re-runs the bodies on the next invoke.

How it works

When a FLAT concurrent op (or a child context) finishes and its serialized result exceeds 256 KB:

  1. The inline per-unit results are stripped from the checkpoint — only Index / Name / Status + the CompletionReason are kept.
  2. ContextOptions.ReplayChildren = true is set on the parent CONTEXT op.
  3. The full result stays in memory, so the current invoke returns normally.

On a later replay, the inbound ReplayChildren flag routes the op to re-execute the unit bodies to recover the stripped values — reading status/reason from the frozen summary (authoritative) and never re-checkpointing the already-terminal parent.

Example

var result = await ctx.ParallelAsync(
    new Func<IDurableContext, CancellationToken, Task<string>>[]
    {
        (c, t) => FetchLargeDocAsync("a"),  // each returns ~200 KB
        (c, t) => FetchLargeDocAsync("b"),
    },
    config: new ParallelConfig { NestingType = NestingType.Flat });

The combined summary is ~400 KB → over the limit.

  • First invoke: both branches run, result holds the full 400 KB. The checkpoint is written without the inline values, flagged ReplayChildren=true.
  • Replay invoke: the parent is already SUCCEEDED, so its statuses/reason come from the frozen summary; the two branch bodies are re-executed to rebuild the 400 KB in memory. No new checkpoint is written.

Units recorded as Started (short-circuited, never dispatched) are not re-run on replay, so there are no spurious side effects — matching Python/Java/JS.

Scope: which operations get this

ReplayChildren recovery works by re-executing the body on replay. That's only safe for operations whose children replay from their own checkpoints, so it is applied to exactly three operation types:

  • RunInChildContextAsync
  • ParallelAsync
  • MapAsync

It is deliberately not applied to plain StepAsync or chained InvokeAsync. Re-running a step body to recover a stripped result would re-trigger its side effects, breaking step's at-most-once guarantee — so an oversized step/invoke result is checkpointed as-is and rejected by the service, exactly as it is today. All three sibling SDKs (Python/Java/JS) make the same choice — the overflow path lives only in their child-context/parallel/map code, never in step/invoke.

Also included

CheckpointBatcher now enforces the MaxBatchBytes (~750 KB) service request limit: it pre-flushes before an item would push a batch over the byte/count cap, and sends a lone oversized item by itself. Map/parallel fan-out is what actually fills this.

Deferred: final Lambda response > 6 MB (depends on the service accepting a function-emitted EXECUTION SUCCEED checkpoint — follow-up PR).

Test Plan

  • Unit tests: 398/398 on net8.0 and net10.0. Covers strip-on-checkpoint at both overflow sites, replay re-execution recovering values, Started-unit skip, failed-unit error recovery, no-re-checkpoint assertions, inbound ReplayChildren mapping, and batcher byte-cap split.
  • Release build clean (warnings-as-errors); AOT/trim analysis passes (no IL2026/IL3050).
  • Integration test (ParallelFlatOverflowTest) — authored; to be run live (requires AWS credentials + Docker).

@github-advanced-security github-advanced-security AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@GarrettBeatty GarrettBeatty changed the base branch from master to gcbeatty/durable-parallel June 8, 2026 17:32
@GarrettBeatty GarrettBeatty changed the base branch from gcbeatty/durable-parallel to gcbeatty/flat June 8, 2026 17:32

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/flat branch 2 times, most recently from f01d9f7 to bb1b112 Compare June 17, 2026 18:41
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-payload-overflow branch 3 times, most recently from 6130bb9 to 8eaa8b8 Compare June 18, 2026 17:54
feat(durable): plumb inbound ContextDetails.ReplayChildren

feat(durable): strip Flat batch summary + set ReplayChildren on overflow

feat(durable): re-execute units on ReplayChildren overflow replay

feat(durable): gate overflow-replay re-execution by frozen unit status

feat(durable): ChildContext single-child overflow via ReplayChildren

feat(durable): suppress terminal re-checkpoint on ChildContext overflow replay

feat(durable): enforce CheckpointBatcher byte cap

docs(durable): document ChildContext overflow replay branch

docs(durable): note payload-size bound makes byte estimate int-safe

test(durable): integration test for large-payload overflow replay
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-payload-overflow branch from 8eaa8b8 to 425765d Compare June 18, 2026 17:58
@GarrettBeatty GarrettBeatty marked this pull request as ready for review June 18, 2026 17:58
@GarrettBeatty GarrettBeatty requested review from a team as code owners June 18, 2026 17:58
@GarrettBeatty GarrettBeatty requested review from normj and philasmar and removed request for a team June 18, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants