feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap) by GarrettBeatty · Pull Request #2411 · aws/aws-lambda-dotnet

GarrettBeatty · 2026-06-08T17:30:59Z

Summary

Adds large-payload overflow handling to Amazon.Lambda.DurableExecution, bringing the .NET SDK to parity with the Python/Java/JS SDKs.

A single durable operation's checkpoint can't exceed 256 KB. Today, a Parallel/Map/child-context op whose results are larger than that fails to checkpoint. This PR makes those results survive replay by not storing them — instead the SDK records just enough to re-derive them, and re-runs the bodies on the next invoke.

How it works

When a FLAT concurrent op (or a child context) finishes and its serialized result exceeds 256 KB:

The inline per-unit results are stripped from the checkpoint — only Index / Name / Status + the CompletionReason are kept.
ContextOptions.ReplayChildren = true is set on the parent CONTEXT op.
The full result stays in memory, so the current invoke returns normally.

On a later replay, the inbound ReplayChildren flag routes the op to re-execute the unit bodies to recover the stripped values — reading status/reason from the frozen summary (authoritative) and never re-checkpointing the already-terminal parent.

Example

var result = await ctx.ParallelAsync(
    new Func<IDurableContext, CancellationToken, Task<string>>[]
    {
        (c, t) => FetchLargeDocAsync("a"),  // each returns ~200 KB
        (c, t) => FetchLargeDocAsync("b"),
    },
    config: new ParallelConfig { NestingType = NestingType.Flat });

The combined summary is ~400 KB → over the limit.

First invoke: both branches run, result holds the full 400 KB. The checkpoint is written without the inline values, flagged ReplayChildren=true.
Replay invoke: the parent is already SUCCEEDED, so its statuses/reason come from the frozen summary; the two branch bodies are re-executed to rebuild the 400 KB in memory. No new checkpoint is written.

Units recorded as Started (short-circuited, never dispatched) are not re-run on replay, so there are no spurious side effects — matching Python/Java/JS.

Scope: which operations get this

ReplayChildren recovery works by re-executing the body on replay. That's only safe for operations whose children replay from their own checkpoints, so it is applied to exactly three operation types:

✅ RunInChildContextAsync
✅ ParallelAsync
✅ MapAsync

It is deliberately not applied to plain StepAsync or chained InvokeAsync. Re-running a step body to recover a stripped result would re-trigger its side effects, breaking step's at-most-once guarantee — so an oversized step/invoke result is checkpointed as-is and rejected by the service, exactly as it is today. All three sibling SDKs (Python/Java/JS) make the same choice — the overflow path lives only in their child-context/parallel/map code, never in step/invoke.

Also included

CheckpointBatcher now enforces the MaxBatchBytes (~750 KB) service request limit: it pre-flushes before an item would push a batch over the byte/count cap, and sends a lone oversized item by itself. Map/parallel fan-out is what actually fills this.

Deferred: final Lambda response > 6 MB (depends on the service accepting a function-emitted EXECUTION SUCCEED checkpoint — follow-up PR).

Test Plan

Unit tests: 398/398 on net8.0 and net10.0. Covers strip-on-checkpoint at both overflow sites, replay re-execution recovering values, Started-unit skip, failed-unit error recovery, no-re-checkpoint assertions, inbound ReplayChildren mapping, and batcher byte-cap split.
Release build clean (warnings-as-errors); AOT/trim analysis passes (no IL2026/IL3050).
Integration test (ParallelFlatOverflowTest) — authored; to be run live (requires AWS credentials + Docker).

github-advanced-security

Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

+
+COPY bin/publish/ ${LAMBDA_TASK_ROOT}
+
+ENTRYPOINT ["/var/task/bootstrap"]


feat(durable): plumb inbound ContextDetails.ReplayChildren feat(durable): strip Flat batch summary + set ReplayChildren on overflow feat(durable): re-execute units on ReplayChildren overflow replay feat(durable): gate overflow-replay re-execution by frozen unit status feat(durable): ChildContext single-child overflow via ReplayChildren feat(durable): suppress terminal re-checkpoint on ChildContext overflow replay feat(durable): enforce CheckpointBatcher byte cap docs(durable): document ChildContext overflow replay branch docs(durable): note payload-size bound makes byte estimate int-safe test(durable): integration test for large-payload overflow replay

github-advanced-security AI found potential problems Jun 8, 2026

View reviewed changes

GarrettBeatty changed the base branch from master to gcbeatty/durable-parallel June 8, 2026 17:32

GarrettBeatty changed the base branch from gcbeatty/durable-parallel to gcbeatty/flat June 8, 2026 17:32

GarrettBeatty force-pushed the gcbeatty/flat branch from 54a24e4 to e1db9a7 Compare June 8, 2026 18:38

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

GarrettBeatty force-pushed the gcbeatty/flat branch 2 times, most recently from f01d9f7 to bb1b112 Compare June 17, 2026 18:41

GarrettBeatty force-pushed the gcbeatty/durable-payload-overflow branch 3 times, most recently from 6130bb9 to 8eaa8b8 Compare June 18, 2026 17:54

GarrettBeatty force-pushed the gcbeatty/durable-payload-overflow branch from 8eaa8b8 to 425765d Compare June 18, 2026 17:58

GarrettBeatty marked this pull request as ready for review June 18, 2026 17:58

GarrettBeatty requested review from a team as code owners June 18, 2026 17:58

GarrettBeatty requested review from normj and philasmar and removed request for a team June 18, 2026 17:58

normj approved these changes Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411

feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411
GarrettBeatty wants to merge 1 commit into
gcbeatty/flatfrom
gcbeatty/durable-payload-overflow

GarrettBeatty commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-advanced-security AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		COPY bin/publish/ ${LAMBDA_TASK_ROOT}

		ENTRYPOINT ["/var/task/bootstrap"]

Conversation

GarrettBeatty commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Example

Scope: which operations get this

Also included

Test Plan

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GarrettBeatty commented Jun 8, 2026 •

edited

Loading