feat(durable): large-payload overflow for Durable Execution (batch ReplayChildren + batcher byte cap)#2411
Open
GarrettBeatty wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
54a24e4 to
e1db9a7
Compare
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
f01d9f7 to
bb1b112
Compare
6130bb9 to
8eaa8b8
Compare
feat(durable): plumb inbound ContextDetails.ReplayChildren feat(durable): strip Flat batch summary + set ReplayChildren on overflow feat(durable): re-execute units on ReplayChildren overflow replay feat(durable): gate overflow-replay re-execution by frozen unit status feat(durable): ChildContext single-child overflow via ReplayChildren feat(durable): suppress terminal re-checkpoint on ChildContext overflow replay feat(durable): enforce CheckpointBatcher byte cap docs(durable): document ChildContext overflow replay branch docs(durable): note payload-size bound makes byte estimate int-safe test(durable): integration test for large-payload overflow replay
8eaa8b8 to
425765d
Compare
normj
approved these changes
Jun 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds large-payload overflow handling to
Amazon.Lambda.DurableExecution, bringing the .NET SDK to parity with the Python/Java/JS SDKs.A single durable operation's checkpoint can't exceed 256 KB. Today, a
Parallel/Map/child-context op whose results are larger than that fails to checkpoint. This PR makes those results survive replay by not storing them — instead the SDK records just enough to re-derive them, and re-runs the bodies on the next invoke.How it works
When a FLAT concurrent op (or a child context) finishes and its serialized result exceeds 256 KB:
Index/Name/Status+ theCompletionReasonare kept.ContextOptions.ReplayChildren = trueis set on the parentCONTEXTop.On a later replay, the inbound
ReplayChildrenflag routes the op to re-execute the unit bodies to recover the stripped values — reading status/reason from the frozen summary (authoritative) and never re-checkpointing the already-terminal parent.Example
The combined summary is ~400 KB → over the limit.
resultholds the full 400 KB. The checkpoint is written without the inline values, flaggedReplayChildren=true.SUCCEEDED, so its statuses/reason come from the frozen summary; the two branch bodies are re-executed to rebuild the 400 KB in memory. No new checkpoint is written.Units recorded as
Started(short-circuited, never dispatched) are not re-run on replay, so there are no spurious side effects — matching Python/Java/JS.Scope: which operations get this
ReplayChildrenrecovery works by re-executing the body on replay. That's only safe for operations whose children replay from their own checkpoints, so it is applied to exactly three operation types:RunInChildContextAsyncParallelAsyncMapAsyncIt is deliberately not applied to plain
StepAsyncor chainedInvokeAsync. Re-running astepbody to recover a stripped result would re-trigger its side effects, breaking step's at-most-once guarantee — so an oversized step/invoke result is checkpointed as-is and rejected by the service, exactly as it is today. All three sibling SDKs (Python/Java/JS) make the same choice — the overflow path lives only in their child-context/parallel/map code, never instep/invoke.Also included
CheckpointBatchernow enforces theMaxBatchBytes(~750 KB) service request limit: it pre-flushes before an item would push a batch over the byte/count cap, and sends a lone oversized item by itself. Map/parallel fan-out is what actually fills this.Deferred: final Lambda response > 6 MB (depends on the service accepting a function-emitted
EXECUTION SUCCEEDcheckpoint — follow-up PR).Test Plan
Started-unit skip, failed-unit error recovery, no-re-checkpoint assertions, inboundReplayChildrenmapping, and batcher byte-cap split.IL2026/IL3050).ParallelFlatOverflowTest) — authored; to be run live (requires AWS credentials + Docker).