LCORE-: Wired compaction turn persistence into agentic query flows by asimurka · Pull Request #1953 · lightspeed-core/lightspeed-stack

asimurka · 2026-06-19T12:41:15Z

Description

Wires explicit turn compaction for compacted turns in query agentic flows.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Refactor
- Updated conversation compaction persistence to no longer store structured output items, focusing on the core response content.
- Improved handling of the original user input during compaction for both standard and streaming queries.
- Expanded the public API to export CompactionTurnContext for compaction-related usage.
Tests
- Adjusted and removed tests to match the updated compacted-mode persistence and streaming/refusal behavior.

coderabbitai · 2026-06-19T12:41:23Z

Warning

Review limit reached

@asimurka, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 15 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e0ba1a80-f813-4880-9643-8980d9364a6b

📥 Commits

Reviewing files that changed from the base of the PR and between fec773e and 4a19bc4.

📒 Files selected for processing (10)

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/models/common/turn_summary.py
src/pydantic_ai_lightspeed/llamastack/__init__.py
src/pydantic_ai_lightspeed/llamastack/_model.py
src/utils/agents/query.py
src/utils/agents/streaming.py
src/utils/pydantic_ai.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/agents/test_streaming.py

Walkthrough

The PR removes TurnSummary.output_items, adds compaction-state handling in the LlamaStack model layer, threads original_input through agent construction and retrieval helpers, and updates query/streaming endpoints and tests to use the new persistence path.

Changes

Compaction Turn Persistence Refactor

Layer / File(s)	Summary
Turn summary and stream cleanup `src/models/common/turn_summary.py`, `src/app/endpoints/streaming_query.py`	Removes `output_items` from `TurnSummary` and deletes the streaming code that filled or consumed it in refusal, completion, failure, and compacted-turn storage paths.
LlamaStack compaction context `src/pydantic_ai_lightspeed/llamastack/__init__.py`, `src/pydantic_ai_lightspeed/llamastack/_model.py`	Adds `CompactionTurnContext`, exports it, and updates the model and filtered stream to append completed turns through compaction state in streaming and non-streaming flows.
Agent original_input wiring `src/utils/pydantic_ai.py`, `src/utils/agents/query.py`, `src/utils/agents/streaming.py`	`build_agent` accepts `original_input` and constructs compaction context when present; retrieval helpers pass it through and use it for moderation-blocked conversation appends.
Endpoint threading and tests `src/app/endpoints/query.py`, `src/app/endpoints/streaming_query.py`, `tests/unit/app/endpoints/test_streaming_query.py`, `tests/unit/utils/agents/test_streaming.py`	The query and streaming endpoints now pass `original_input` explicitly, and tests are updated to remove `output_items` assertions and validate the new blocked-moderation append behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

lightspeed-core/lightspeed-stack#1796: Also threads an original_input hint through the compaction-aware query/streaming flow and affects the same persistence path.
lightspeed-core/lightspeed-stack#1865: Updates /query and retrieve_agent_response to pass original_input through the same retrieval orchestration path.
lightspeed-core/lightspeed-stack#1870: Modifies the streaming query compaction flow around retrieve_agent_response_generator and generate_agent_response, matching the same call chain.

Suggested reviewers

tisnik
jrobertboos

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: wiring compaction turn persistence into agentic query flows.
Docstring Coverage	✅ Passed	Docstring coverage is 95.65% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pydantic_ai_lightspeed/llamastack/_model.py`:
- Around line 246-277: The _patch_responses_create method mutates the shared
singleton client.responses.create method, causing a race condition where
concurrent LlamaStackResponsesModel instances overwrite each other's patches and
execute the wrong compaction context. Instead of directly assigning to
responses_api.create, implement per-instance method wrapping that preserves the
original create method behavior while maintaining instance-specific state
without mutating the shared client, or create a per-request client instance to
avoid sharing mutable state between concurrent requests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3a16c48b-ee7d-4f57-a12e-b55fc68e3576

📥 Commits

Reviewing files that changed from the base of the PR and between bcc47ed and 62527e3.

📒 Files selected for processing (10)

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/models/common/turn_summary.py
src/pydantic_ai_lightspeed/llamastack/__init__.py
src/pydantic_ai_lightspeed/llamastack/_model.py
src/utils/agents/query.py
src/utils/agents/streaming.py
src/utils/pydantic_ai.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/agents/test_streaming.py

💤 Files with no reviewable changes (2)

tests/unit/app/endpoints/test_streaming_query.py
src/models/common/turn_summary.py

📜 Review details

⏰ Context from checks skipped due to timeout. (2)

GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request

🧰 Additional context used

📓 Path-based instructions (4)

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

src/pydantic_ai_lightspeed/llamastack/__init__.py
src/app/endpoints/query.py
src/utils/pydantic_ai.py
src/utils/agents/query.py
src/utils/agents/streaming.py
src/app/endpoints/streaming_query.py
src/pydantic_ai_lightspeed/llamastack/_model.py

src/**/__init__.py

📄 CodeRabbit inference engine (AGENTS.md)

Package __init__.py files must contain brief package descriptions

Files:

src/pydantic_ai_lightspeed/llamastack/__init__.py

src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

tests/unit/utils/agents/test_streaming.py

🧠 Learnings (2)

📚 Learning: 2026-01-14T09:37:51.612Z

Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 988
File: src/app/endpoints/query.py:319-339
Timestamp: 2026-01-14T09:37:51.612Z
Learning: In the lightspeed-stack repository, when provider_id == "azure", the Azure provider with provider_type "remote::azure" is guaranteed to be present in the providers list. Therefore, avoid defensive StopIteration handling for next() when locating the Azure provider in providers within src/app/endpoints/query.py. This change applies specifically to this file (or nearby provider lookup code) and relies on the invariant that the Azure provider exists; if the invariant could be violated, keep the existing StopIteration handling.

Applied to files:

src/app/endpoints/query.py

📚 Learning: 2026-04-06T20:18:07.852Z

Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py

🔇 Additional comments (18)

src/app/endpoints/streaming_query.py (5)
337-342: LGTM!

350-361: LGTM!

507-512: LGTM!

872-895: LGTM!

701-706: This code path is unreachable dead code; the compacted branch in generate_response is never executed.

The streaming_query_endpoint_handler routes all requests through either generate_response_with_compaction (for compacted paths) or generate_agent_response (standard paths). The deprecated generate_response function is never called from the handler, and all test calls use the default compacted=False. Lines 701–706 are therefore unreachable. Either remove this dead code or mark it explicitly as such with a comment.
			> Likely an incorrect or invalid review comment.
src/pydantic_ai_lightspeed/llamastack/_model.py (5)

1-26: LGTM!

28-47: LGTM!

49-90: LGTM!

91-138: LGTM!

171-190: Document the assumption about MCP argument delta format and consider adding validation.

Line 175 unconditionally appends "}" to combined MCP argument deltas. This assumes Llama Stack's MCP event stream always delivers arguments without the closing brace, but this assumption lacks documentation and validation.

If the buffered deltas already include the closing brace (or if the format changes), this produces malformed JSON. Add a comment explaining why the closing brace is required based on Llama Stack's MCP streaming behavior, and consider a defensive check: combined_args.rstrip() + "}" or verify the last buffered delta doesn't already contain }.

src/pydantic_ai_lightspeed/llamastack/__init__.py (1)

1-14: LGTM!

src/utils/pydantic_ai.py (2)

13-17: LGTM!

Also applies to: 116-121

142-157: LGTM!

src/utils/agents/query.py (1)

280-333: LGTM!

src/utils/agents/streaming.py (2)

83-138: LGTM!

141-266: LGTM!

src/app/endpoints/query.py (1)

231-237: LGTM!

tests/unit/utils/agents/test_streaming.py (1)

530-566: LGTM!

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/utils/agents/query.py (2)
285-296: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Align the changed signature and docstring with project conventions.

Use ResponseInput | None, and prefer Parameters: over Args: in changed docstrings.

As per coding guidelines, functions should use modern union syntax. Based on learnings, docstrings should use Parameters: rather than Args:.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/agents/query.py` around lines 285 - 296, The function signature in
retrieve_response still uses Optional[ResponseInput] for original_input; update
it to modern union syntax ResponseInput | None to match project conventions.
Also revise the retrieve_response docstring section header from Args: to
Parameters: while keeping the existing parameter descriptions aligned with the
function’s named arguments and symbols.
Sources: Coding guidelines, Learnings

304-310: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Preserve falsy original_input values.

Using or falls back to the rewritten input when original_input is a valid falsy value, such as "" or []. Check for None explicitly.
Proposed fix
-            original_input or responses_params.input,
+            original_input if original_input is not None else responses_params.input,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/agents/query.py` around lines 304 - 310, The blocked-moderation
branch in query logic is incorrectly using a truthiness fallback for
original_input, which can replace valid falsy values like an empty string or
list with responses_params.input. Update the append_turn_items_to_conversation
call in the moderation handling path to preserve original_input whenever it is
provided by checking explicitly for None instead of using or. Use the nearby
identifiers moderation_result, original_input, responses_params.input, and
append_turn_items_to_conversation to locate the change.
src/app/endpoints/streaming_query.py (1)
703-712: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Compacted streaming turns now persist no assistant output.

store_compacted_turn() appends original_input plus the supplied output_items. Passing [] here means compacted streams store only the user side of the turn, so the assistant response disappears from conversation history. The blocking path in src/app/endpoints/query.py:381-386 still passes response.output, so streaming compaction now diverges from non-streaming persistence.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/streaming_query.py` around lines 703 - 712, The compacted
streaming path in `streaming_query.py` is dropping assistant output because
`store_compacted_turn()` is being called with an empty output list. Update the
call in the streaming compaction flow to pass the actual streamed assistant
output items, matching the non-streaming `query.py` persistence path that uses
`response.output`, so `store_compacted_turn()` records both sides of the turn.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 640-655: The interrupted-stream path is replaying already-emitted
text because build_interrupted_response() returns a suffix that duplicates the
partial response, and this call site then persists full_text and yields the
duplicate suffix. Update the helper contract in build_interrupted_response() so
the returned suffix contains only the new interrupt tail, not the previously
streamed prefix, and ensure full_text used by persist_interrupted_turn() does
not repeat client-visible content. Verify the streaming path in
streaming_query.py still passes turn_summary.partial_tokens through without
re-adding the prefix, and keep the token event payload limited to the
non-duplicated suffix.

In `@src/pydantic_ai_lightspeed/llamastack/_model.py`:
- Around line 91-100: The __init__ signature in _model.py should follow the
project’s modern type and docstring conventions: replace
Optional[CompactionTurnContext] with CompactionTurnContext | None, and update
the docstring header from Args: to Parameters:. Apply the same cleanup
consistently to the other affected constructor or method block referenced in the
review, keeping the naming aligned with the existing symbols like __init__ and
any related stream wrapper methods in this module.

In `@src/utils/agents/streaming.py`:
- Around line 88-98: Update the affected function signatures in streaming.py to
use modern union syntax by replacing Optional[...] with the corresponding | None
form, including the original_input parameter in the streaming helper and any
other touched signatures in this change. Also revise the related docstrings to
follow project conventions by changing the “Args:” section header to
“Parameters:” while keeping the existing parameter descriptions aligned with the
named symbols such as original_input and the streaming return helper.
- Around line 108-113: The append_turn_items_to_conversation call in
streaming.py is using `original_input or responses_params.input`, which can
incorrectly fall back when `original_input` is a valid falsy value. Update the
logic around `append_turn_items_to_conversation` to preserve falsy originals by
checking explicitly for `None` before choosing the fallback input, using the
existing `original_input` and `responses_params.input` symbols.
- Around line 236-258: The HTTPException handling in streaming.py is aborting
the whole stream too early in the topic-summary path. In the try/except around
maybe_get_topic_summary, stop yielding an error event and returning; instead log
the warning, set topic_summary = None, and continue so token consumption, the
end event, and store_query_results() still run. Keep the existing identifiers
maybe_get_topic_summary, topic_summary, and serialize_event in place while
removing the early-exit behavior.
- Around line 205-215: The interruption path is reusing already-streamed tokens,
so the client-visible text and persisted response are duplicated. Update the
handling around build_interrupted_response() in streaming.py so the returned
suffix excludes the partial text that was already yielded, and only the new
interrupt-specific tail is appended. Make sure turn_summary.llm_response is set
to the non-duplicating interrupted content before calling
persist_interrupted_turn(), and keep the persist_guard flow unchanged.

In `@src/utils/pydantic_ai.py`:
- Line 103: Update the changed type annotations in the pydantic_ai helpers to
use modern union syntax instead of Optional. Specifically, adjust the return
annotation on the relevant function signature and the other affected annotation
referenced in the review to use list[AgentCapability[Any]] | None and
ResponseInput | None, keeping the existing symbols and behavior unchanged.
- Around line 144-149: The compaction flow is still typed too narrowly for
library mode, so update the type annotations used by
CompactionTurnContext.client and append_turn_items_to_conversation to the shared
client union accepted by build_agent() (AsyncLlamaStackClient plus
AsyncLlamaStackAsLibraryClient). Adjust the relevant signatures and any related
context construction in pydantic_ai.py so compaction can receive either client
type without a type mismatch.

In `@tests/unit/utils/agents/test_streaming.py`:
- Around line 1053-1054: The interrupted-response test for
build_interrupted_response() is too weak because it only checks containment and
can miss duplicated streamed prefixes. Update the test in test_streaming.py to
assert the exact turn_summary.llm_response for the simple "Hello world" case,
and also verify the final token payload if available. Use the existing
turn_summary and INTERRUPTED_RESPONSE_MESSAGE assertions as anchors, but replace
the partial string checks with exact expected output matching the interrupted
assembly behavior.

---

Outside diff comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 703-712: The compacted streaming path in `streaming_query.py` is
dropping assistant output because `store_compacted_turn()` is being called with
an empty output list. Update the call in the streaming compaction flow to pass
the actual streamed assistant output items, matching the non-streaming
`query.py` persistence path that uses `response.output`, so
`store_compacted_turn()` records both sides of the turn.

In `@src/utils/agents/query.py`:
- Around line 285-296: The function signature in retrieve_response still uses
Optional[ResponseInput] for original_input; update it to modern union syntax
ResponseInput | None to match project conventions. Also revise the
retrieve_response docstring section header from Args: to Parameters: while
keeping the existing parameter descriptions aligned with the function’s named
arguments and symbols.
- Around line 304-310: The blocked-moderation branch in query logic is
incorrectly using a truthiness fallback for original_input, which can replace
valid falsy values like an empty string or list with responses_params.input.
Update the append_turn_items_to_conversation call in the moderation handling
path to preserve original_input whenever it is provided by checking explicitly
for None instead of using or. Use the nearby identifiers moderation_result,
original_input, responses_params.input, and append_turn_items_to_conversation to
locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 39b58bda-ff8e-48e5-90e1-2f688dd44870

📥 Commits

Reviewing files that changed from the base of the PR and between 62527e3 and fec773e.

📒 Files selected for processing (10)

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/models/common/turn_summary.py
src/pydantic_ai_lightspeed/llamastack/__init__.py
src/pydantic_ai_lightspeed/llamastack/_model.py
src/utils/agents/query.py
src/utils/agents/streaming.py
src/utils/pydantic_ai.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/agents/test_streaming.py

📜 Review details

⏰ Context from checks skipped due to timeout. (15)

GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
GitHub Check: unit_tests (3.12)
GitHub Check: unit_tests (3.13)
GitHub Check: build-pr
GitHub Check: Pylinter
GitHub Check: integration_tests (3.12)
GitHub Check: integration_tests (3.13)
GitHub Check: E2E Tests for Lightspeed Evaluation job
GitHub Check: E2E: server mode / ci / group 2
GitHub Check: E2E: library mode / ci / group 3
GitHub Check: E2E: library mode / ci / group 1
GitHub Check: E2E: server mode / ci / group 1
GitHub Check: E2E: server mode / ci / group 3
GitHub Check: E2E: library mode / ci / group 2

⚠️ CI failures not shown inline (2)

GitHub Actions: OpenAPI (Spectral) / 0_spectral.txt: LCORE-: Wired compaction turn persistence into agentic query flows

Conclusion: failure

View job details

##[group]Run set -euo pipefail
 �[36;1mset -euo pipefail�[0m
 �[36;1muv run python scripts/generate_openapi_schema.py /tmp/openapi-generated.json�[0m
 �[36;1mif ! diff -u docs/openapi.json /tmp/openapi-generated.json; then�[0m
 �[36;1m  echo "::error::docs/openapi.json is out of date. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json"�[0m

GitHub Actions: OpenAPI (Spectral) / spectral: LCORE-: Wired compaction turn persistence into agentic query flows

Conclusion: failure

View job details

##[group]Run set -euo pipefail
 �[36;1mset -euo pipefail�[0m
 �[36;1muv run python scripts/generate_openapi_schema.py /tmp/openapi-generated.json�[0m
 �[36;1mif ! diff -u docs/openapi.json /tmp/openapi-generated.json; then�[0m
 �[36;1m  echo "::error::docs/openapi.json is out of date. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json"�[0m

🧰 Additional context used

📓 Path-based instructions (5)

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

src/pydantic_ai_lightspeed/llamastack/__init__.py
src/utils/agents/query.py
src/app/endpoints/query.py
src/utils/pydantic_ai.py
src/models/common/turn_summary.py
src/app/endpoints/streaming_query.py
src/pydantic_ai_lightspeed/llamastack/_model.py
src/utils/agents/streaming.py

src/**/__init__.py

📄 CodeRabbit inference engine (AGENTS.md)

Package __init__.py files must contain brief package descriptions

Files:

src/pydantic_ai_lightspeed/llamastack/__init__.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

tests/unit/utils/agents/test_streaming.py
tests/unit/app/endpoints/test_streaming_query.py

src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py

src/models/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models must use @model_validator and @field_validator for validation and complete type annotations for all attributes, avoiding Any type

Files:

src/models/common/turn_summary.py

🧠 Learnings (5)

📚 Learning: 2026-06-24T13:45:37.249Z

Learnt from: Jdubrick
Repo: lightspeed-core/lightspeed-stack PR: 1971
File: src/utils/markdown_repair.py:31-36
Timestamp: 2026-06-24T13:45:37.249Z
Learning: In the lightspeed-stack repository, docstrings must use the section header name "Parameters:" (not "Args:") for function arguments, even if the project references Google Python docstring conventions. Ensure docstrings follow the project’s established "Parameters:" header format for any documented function parameters.

Applied to files:

src/pydantic_ai_lightspeed/llamastack/__init__.py
tests/unit/utils/agents/test_streaming.py
src/utils/agents/query.py
src/app/endpoints/query.py
src/utils/pydantic_ai.py
src/models/common/turn_summary.py
src/app/endpoints/streaming_query.py
tests/unit/app/endpoints/test_streaming_query.py
src/pydantic_ai_lightspeed/llamastack/_model.py
src/utils/agents/streaming.py

📚 Learning: 2026-01-14T09:37:51.612Z

Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 988
File: src/app/endpoints/query.py:319-339
Timestamp: 2026-01-14T09:37:51.612Z
Learning: In the lightspeed-stack repository, when provider_id == "azure", the Azure provider with provider_type "remote::azure" is guaranteed to be present in the providers list. Therefore, avoid defensive StopIteration handling for next() when locating the Azure provider in providers within src/app/endpoints/query.py. This change applies specifically to this file (or nearby provider lookup code) and relies on the invariant that the Azure provider exists; if the invariant could be violated, keep the existing StopIteration handling.

Applied to files:

src/app/endpoints/query.py

📚 Learning: 2026-04-06T20:18:07.852Z

Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py

📚 Learning: 2026-01-12T10:58:40.230Z

Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

src/models/common/turn_summary.py

📚 Learning: 2026-02-25T07:46:33.545Z

Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

src/models/common/turn_summary.py

🪛 GitHub Actions: Type checks / 0_mypy.txt

src/pydantic_ai_lightspeed/llamastack/_model.py

[error] 305-305: mypy: No overload variant of "_responses_create" of "OpenAIResponsesModel" matches argument types "list[ModelRequest | ModelResponse]", "bool", "OpenAIResponsesModelSettings", "ModelRequestParameters". [call-overload]

🪛 GitHub Actions: Type checks / mypy

src/pydantic_ai_lightspeed/llamastack/_model.py

[error] 305-305: mypy call-overload: No overload variant of "_responses_create" of "OpenAIResponsesModel" matches argument types "list[ModelRequest | ModelResponse]", "bool", "OpenAIResponsesModelSettings", "ModelRequestParameters".

🔇 Additional comments (1)

src/pydantic_ai_lightspeed/llamastack/__init__.py (1)

3-13: LGTM!

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/utils/agents/query.py (2)
285-296: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Align the changed signature and docstring with project conventions.

Use ResponseInput | None, and prefer Parameters: over Args: in changed docstrings.

As per coding guidelines, functions should use modern union syntax. Based on learnings, docstrings should use Parameters: rather than Args:.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/agents/query.py` around lines 285 - 296, The function signature in
retrieve_response still uses Optional[ResponseInput] for original_input; update
it to modern union syntax ResponseInput | None to match project conventions.
Also revise the retrieve_response docstring section header from Args: to
Parameters: while keeping the existing parameter descriptions aligned with the
function’s named arguments and symbols.
Sources: Coding guidelines, Learnings

304-310: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Preserve falsy original_input values.

Using or falls back to the rewritten input when original_input is a valid falsy value, such as "" or []. Check for None explicitly.
Proposed fix
-            original_input or responses_params.input,
+            original_input if original_input is not None else responses_params.input,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/agents/query.py` around lines 304 - 310, The blocked-moderation
branch in query logic is incorrectly using a truthiness fallback for
original_input, which can replace valid falsy values like an empty string or
list with responses_params.input. Update the append_turn_items_to_conversation
call in the moderation handling path to preserve original_input whenever it is
provided by checking explicitly for None instead of using or. Use the nearby
identifiers moderation_result, original_input, responses_params.input, and
append_turn_items_to_conversation to locate the change.
src/app/endpoints/streaming_query.py (1)
703-712: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Compacted streaming turns now persist no assistant output.

store_compacted_turn() appends original_input plus the supplied output_items. Passing [] here means compacted streams store only the user side of the turn, so the assistant response disappears from conversation history. The blocking path in src/app/endpoints/query.py:381-386 still passes response.output, so streaming compaction now diverges from non-streaming persistence.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/streaming_query.py` around lines 703 - 712, The compacted
streaming path in `streaming_query.py` is dropping assistant output because
`store_compacted_turn()` is being called with an empty output list. Update the
call in the streaming compaction flow to pass the actual streamed assistant
output items, matching the non-streaming `query.py` persistence path that uses
`response.output`, so `store_compacted_turn()` records both sides of the turn.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 640-655: The interrupted-stream path is replaying already-emitted
text because build_interrupted_response() returns a suffix that duplicates the
partial response, and this call site then persists full_text and yields the
duplicate suffix. Update the helper contract in build_interrupted_response() so
the returned suffix contains only the new interrupt tail, not the previously
streamed prefix, and ensure full_text used by persist_interrupted_turn() does
not repeat client-visible content. Verify the streaming path in
streaming_query.py still passes turn_summary.partial_tokens through without
re-adding the prefix, and keep the token event payload limited to the
non-duplicated suffix.

In `@src/pydantic_ai_lightspeed/llamastack/_model.py`:
- Around line 91-100: The __init__ signature in _model.py should follow the
project’s modern type and docstring conventions: replace
Optional[CompactionTurnContext] with CompactionTurnContext | None, and update
the docstring header from Args: to Parameters:. Apply the same cleanup
consistently to the other affected constructor or method block referenced in the
review, keeping the naming aligned with the existing symbols like __init__ and
any related stream wrapper methods in this module.

In `@src/utils/agents/streaming.py`:
- Around line 88-98: Update the affected function signatures in streaming.py to
use modern union syntax by replacing Optional[...] with the corresponding | None
form, including the original_input parameter in the streaming helper and any
other touched signatures in this change. Also revise the related docstrings to
follow project conventions by changing the “Args:” section header to
“Parameters:” while keeping the existing parameter descriptions aligned with the
named symbols such as original_input and the streaming return helper.
- Around line 108-113: The append_turn_items_to_conversation call in
streaming.py is using `original_input or responses_params.input`, which can
incorrectly fall back when `original_input` is a valid falsy value. Update the
logic around `append_turn_items_to_conversation` to preserve falsy originals by
checking explicitly for `None` before choosing the fallback input, using the
existing `original_input` and `responses_params.input` symbols.
- Around line 236-258: The HTTPException handling in streaming.py is aborting
the whole stream too early in the topic-summary path. In the try/except around
maybe_get_topic_summary, stop yielding an error event and returning; instead log
the warning, set topic_summary = None, and continue so token consumption, the
end event, and store_query_results() still run. Keep the existing identifiers
maybe_get_topic_summary, topic_summary, and serialize_event in place while
removing the early-exit behavior.
- Around line 205-215: The interruption path is reusing already-streamed tokens,
so the client-visible text and persisted response are duplicated. Update the
handling around build_interrupted_response() in streaming.py so the returned
suffix excludes the partial text that was already yielded, and only the new
interrupt-specific tail is appended. Make sure turn_summary.llm_response is set
to the non-duplicating interrupted content before calling
persist_interrupted_turn(), and keep the persist_guard flow unchanged.

In `@src/utils/pydantic_ai.py`:
- Line 103: Update the changed type annotations in the pydantic_ai helpers to
use modern union syntax instead of Optional. Specifically, adjust the return
annotation on the relevant function signature and the other affected annotation
referenced in the review to use list[AgentCapability[Any]] | None and
ResponseInput | None, keeping the existing symbols and behavior unchanged.
- Around line 144-149: The compaction flow is still typed too narrowly for
library mode, so update the type annotations used by
CompactionTurnContext.client and append_turn_items_to_conversation to the shared
client union accepted by build_agent() (AsyncLlamaStackClient plus
AsyncLlamaStackAsLibraryClient). Adjust the relevant signatures and any related
context construction in pydantic_ai.py so compaction can receive either client
type without a type mismatch.

In `@tests/unit/utils/agents/test_streaming.py`:
- Around line 1053-1054: The interrupted-response test for
build_interrupted_response() is too weak because it only checks containment and
can miss duplicated streamed prefixes. Update the test in test_streaming.py to
assert the exact turn_summary.llm_response for the simple "Hello world" case,
and also verify the final token payload if available. Use the existing
turn_summary and INTERRUPTED_RESPONSE_MESSAGE assertions as anchors, but replace
the partial string checks with exact expected output matching the interrupted
assembly behavior.

---

Outside diff comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 703-712: The compacted streaming path in `streaming_query.py` is
dropping assistant output because `store_compacted_turn()` is being called with
an empty output list. Update the call in the streaming compaction flow to pass
the actual streamed assistant output items, matching the non-streaming
`query.py` persistence path that uses `response.output`, so
`store_compacted_turn()` records both sides of the turn.

In `@src/utils/agents/query.py`:
- Around line 285-296: The function signature in retrieve_response still uses
Optional[ResponseInput] for original_input; update it to modern union syntax
ResponseInput | None to match project conventions. Also revise the
retrieve_response docstring section header from Args: to Parameters: while
keeping the existing parameter descriptions aligned with the function’s named
arguments and symbols.
- Around line 304-310: The blocked-moderation branch in query logic is
incorrectly using a truthiness fallback for original_input, which can replace
valid falsy values like an empty string or list with responses_params.input.
Update the append_turn_items_to_conversation call in the moderation handling
path to preserve original_input whenever it is provided by checking explicitly
for None instead of using or. Use the nearby identifiers moderation_result,
original_input, responses_params.input, and append_turn_items_to_conversation to
locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 39b58bda-ff8e-48e5-90e1-2f688dd44870

📥 Commits

Reviewing files that changed from the base of the PR and between 62527e3 and fec773e.

📒 Files selected for processing (10)

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/models/common/turn_summary.py
src/pydantic_ai_lightspeed/llamastack/__init__.py
src/pydantic_ai_lightspeed/llamastack/_model.py
src/utils/agents/query.py
src/utils/agents/streaming.py
src/utils/pydantic_ai.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/agents/test_streaming.py

📜 Review details

🔇 Additional comments (1)

src/pydantic_ai_lightspeed/llamastack/__init__.py (1)

3-13: LGTM!

🛑 Comments failed to post (9)

src/app/endpoints/streaming_query.py (1)

640-655: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't replay the full partial response in the interrupt suffix.

build_interrupted_response() currently returns a suffix that includes the already-streamed text again, and full_text repeats that same prefix in storage. Once any partial tokens have been emitted, interrupted streams will duplicate content in both the client-visible token stream and the persisted transcript. The fix belongs in src/utils/stream_interrupts.py, but this call path now exposes the bad output contract.

Proposed helper fix

 def build_interrupted_response(partial_tokens: list[str]) -> tuple[str, str]:
     partial_text = "".join(partial_tokens)
     repaired_text = close_open_markdown(partial_text)
     interrupted_indicator = f"\n\n*{INTERRUPTED_RESPONSE_MESSAGE}*"
-    suffix = repaired_text + interrupted_indicator
-    final_text = partial_text + suffix
+    final_text = repaired_text + interrupted_indicator
+    suffix = final_text[len(partial_text) :]
     return final_text, suffix

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/streaming_query.py` around lines 640 - 655, The
interrupted-stream path is replaying already-emitted text because
build_interrupted_response() returns a suffix that duplicates the partial
response, and this call site then persists full_text and yields the duplicate
suffix. Update the helper contract in build_interrupted_response() so the
returned suffix contains only the new interrupt tail, not the previously
streamed prefix, and ensure full_text used by persist_interrupted_turn() does
not repeat client-visible content. Verify the streaming path in
streaming_query.py still passes turn_summary.partial_tokens through without
re-adding the prefix, and keep the token event payload limited to the
non-duplicated suffix.

src/pydantic_ai_lightspeed/llamastack/_model.py (1)

91-100: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Align changed annotations and docstring headers with project conventions.

The new optional annotations should use CompactionTurnContext | None, and changed docstrings should use Parameters: instead of Args:.

Proposed cleanup

-from typing import Any, Literal, Optional, cast, overload
+from typing import Any, Literal, cast, overload
@@
-        compaction: Optional[CompactionTurnContext] = None,
+        compaction: CompactionTurnContext | None = None,
@@
-        Args:
+        Parameters:
@@
-        compaction: Optional[CompactionTurnContext] = None,
+        compaction: CompactionTurnContext | None = None,
@@
-        Args:
+        Parameters:

As per coding guidelines, functions should use modern union syntax. Based on learnings, docstrings should use Parameters: rather than Args:.

Also applies to: 236-253

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pydantic_ai_lightspeed/llamastack/_model.py` around lines 91 - 100, The
__init__ signature in _model.py should follow the project’s modern type and
docstring conventions: replace Optional[CompactionTurnContext] with
CompactionTurnContext | None, and update the docstring header from Args: to
Parameters:. Apply the same cleanup consistently to the other affected
constructor or method block referenced in the review, keeping the naming aligned
with the existing symbols like __init__ and any related stream wrapper methods
in this module.

Sources: Coding guidelines, Learnings

src/utils/agents/streaming.py (4)

88-98: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Align changed signatures and docstrings with project conventions.

Use ResponseInput | None, and prefer Parameters: over Args: in changed docstrings.

As per coding guidelines, functions should use modern union syntax. Based on learnings, docstrings should use Parameters: rather than Args:.

Also applies to: 149-167

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/agents/streaming.py` around lines 88 - 98, Update the affected
function signatures in streaming.py to use modern union syntax by replacing
Optional[...] with the corresponding | None form, including the original_input
parameter in the streaming helper and any other touched signatures in this
change. Also revise the related docstrings to follow project conventions by
changing the “Args:” section header to “Parameters:” while keeping the existing
parameter descriptions aligned with the named symbols such as original_input and
the streaming return helper.

Sources: Coding guidelines, Learnings

108-113: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Preserve falsy original_input values.

Using or can persist the rewritten compacted input instead of a valid falsy original input. Use an explicit None check.

Proposed fix

-                original_input or responses_params.input,
+                original_input if original_input is not None else responses_params.input,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            await append_turn_items_to_conversation(
                context.client,
                responses_params.conversation,
                original_input if original_input is not None else responses_params.input,
                [context.moderation_result.refusal_response],
            )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/agents/streaming.py` around lines 108 - 113, The
append_turn_items_to_conversation call in streaming.py is using `original_input
or responses_params.input`, which can incorrectly fall back when
`original_input` is a valid falsy value. Update the logic around
`append_turn_items_to_conversation` to preserve falsy originals by checking
explicitly for `None` before choosing the fallback input, using the existing
`original_input` and `responses_params.input` symbols.

205-215: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Avoid duplicating already-streamed partial tokens on interruption.

build_interrupted_response() currently returns a suffix that includes the full partial text, but those tokens were already yielded. This duplicates client-visible text and the persisted interrupted response.

Proposed helper fix

-    suffix = repaired_text + interrupted_indicator
-    final_text = partial_text + suffix
+    suffix = repaired_text.removeprefix(partial_text) + interrupted_indicator
+    final_text = repaired_text + interrupted_indicator

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/agents/streaming.py` around lines 205 - 215, The interruption path
is reusing already-streamed tokens, so the client-visible text and persisted
response are duplicated. Update the handling around build_interrupted_response()
in streaming.py so the returned suffix excludes the partial text that was
already yielded, and only the new interrupt-specific tail is appended. Make sure
turn_summary.llm_response is set to the non-duplicating interrupted content
before calling persist_interrupted_turn(), and keep the persist_guard flow
unchanged.

236-258: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not drop a completed response when topic-summary generation fails.

This path returns before token consumption, the end event, and store_query_results(). Since topic summaries are ancillary, log the failure and continue with topic_summary = None.

Proposed fix

     except HTTPException as exc:
         logger.warning(
             "Topic summary failed for request %s: %s",
             context.request_id,
             exc.detail,
         )
-        detail: dict[str, str] = exc.detail if isinstance(exc.detail, dict) else {}
-        yield serialize_event(
-            ErrorStreamPayload.create(
-                status_code=exc.status_code,
-                response=detail.get("response", "Internal server error"),
-                cause=detail.get("cause", str(exc.detail)),
-            ),
-            media_type,
-        )
-        return
+        topic_summary = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    try:
        topic_summary = await maybe_get_topic_summary(
            generate_topic_summary=should_generate_topic_summary,
            input_text=context.query_request.query,
            client=context.client,
            model_id=responses_params.model,
        )
    except HTTPException as exc:
        logger.warning(
            "Topic summary failed for request %s: %s",
            context.request_id,
            exc.detail,
        )
        topic_summary = None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/agents/streaming.py` around lines 236 - 258, The HTTPException
handling in streaming.py is aborting the whole stream too early in the
topic-summary path. In the try/except around maybe_get_topic_summary, stop
yielding an error event and returning; instead log the warning, set
topic_summary = None, and continue so token consumption, the end event, and
store_query_results() still run. Keep the existing identifiers
maybe_get_topic_summary, topic_summary, and serialize_event in place while
removing the early-exit behavior.

src/utils/pydantic_ai.py (2)

103-103: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Use modern optional syntax in changed annotations.

Prefer list[AgentCapability[Any]] | None and ResponseInput | None for new/changed annotations.

As per coding guidelines, functions should use modern union syntax.

Also applies to: 122-122

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/pydantic_ai.py` at line 103, Update the changed type annotations in
the pydantic_ai helpers to use modern union syntax instead of Optional.
Specifically, adjust the return annotation on the relevant function signature
and the other affected annotation referenced in the review to use
list[AgentCapability[Any]] | None and ResponseInput | None, keeping the existing
symbols and behavior unchanged.

Source: Coding guidelines

144-149: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n "class CompactionTurnContext|def append_turn_items_to_conversation|AsyncLlamaStackAsLibraryClient" --type py -C 4

Repository: lightspeed-core/lightspeed-stack

Length of output: 170

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the relevant file and nearby symbols.
git ls-files 'src/utils/pydantic_ai.py' 'src/**/_model.py' 'src/**/model.py' | sed 's#^`#FILE` #'
echo '--- OUTLINE pydantic_ai.py ---'
ast-grep outline src/utils/pydantic_ai.py --view expanded || true
echo '--- SEARCH relevant symbols ---'
rg -n "CompactionTurnContext|append_turn_items_to_conversation|AsyncLlamaStackAsLibraryClient|AsyncLlamaStackClient" src -C 3

Repository: lightspeed-core/lightspeed-stack

Length of output: 50388

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Read the target file and likely model definitions in small slices.
for f in src/utils/pydantic_ai.py; do
  echo "--- $f (lines 1-260) ---"
  nl -ba "$f" | sed -n '1,260p'
done

echo "--- candidate model files containing CompactionTurnContext or append helper ---"
rg -n "CompactionTurnContext|append_turn_items_to_conversation|AsyncLlamaStackAsLibraryClient" src -l

Repository: lightspeed-core/lightspeed-stack

Length of output: 257

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Read the target file and the type definitions it uses.
nl -ba src/utils/pydantic_ai.py | sed -n '1,260p'

echo '--- related definitions ---'
rg -n "class CompactionTurnContext|def append_turn_items_to_conversation|AsyncLlamaStackAsLibraryClient|AsyncLlamaStackClient" src -C 4

Repository: lightspeed-core/lightspeed-stack

Length of output: 210

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- src/utils/pydantic_ai.py (118-155) ---'
sed -n '118,155p' src/utils/pydantic_ai.py

echo '--- src/pydantic_ai_lightspeed/llamastack/_model.py (57-76, 150-160, 320-330) ---'
sed -n '57,76p;150,160p;320,330p' src/pydantic_ai_lightspeed/llamastack/_model.py

echo '--- src/utils/conversations.py (453-470) ---'
sed -n '453,470p' src/utils/conversations.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 4395

Broaden the compaction client type. build_agent() accepts AsyncLlamaStackAsLibraryClient, but CompactionTurnContext.client and append_turn_items_to_conversation() are still typed as AsyncLlamaStackClient. Widen those annotations to the shared client union, or library-mode compaction stays mismatched.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/pydantic_ai.py` around lines 144 - 149, The compaction flow is
still typed too narrowly for library mode, so update the type annotations used
by CompactionTurnContext.client and append_turn_items_to_conversation to the
shared client union accepted by build_agent() (AsyncLlamaStackClient plus
AsyncLlamaStackAsLibraryClient). Adjust the relevant signatures and any related
context construction in pydantic_ai.py so compaction can receive either client
type without a type mismatch.

tests/unit/utils/agents/test_streaming.py (1)

1053-1054: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Assert the exact interrupted text here.

These containment checks still pass when interrupted-response assembly duplicates the already-streamed prefix, so they won't catch the regression in build_interrupted_response(). Please assert the exact turn_summary.llm_response (and ideally the final token payload) for this simple "Hello world" case.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/utils/agents/test_streaming.py` around lines 1053 - 1054, The
interrupted-response test for build_interrupted_response() is too weak because
it only checks containment and can miss duplicated streamed prefixes. Update the
test in test_streaming.py to assert the exact turn_summary.llm_response for the
simple "Hello world" case, and also verify the final token payload if available.
Use the existing turn_summary and INTERRUPTED_RESPONSE_MESSAGE assertions as
anchors, but replace the partial string checks with exact expected output
matching the interrupted assembly behavior.

asimurka · 2026-06-29T11:16:57Z

        responses_params,
        moderation_result,
        endpoint_path,
-        compaction.original_input if compaction.compacted else None,


compaction.compacted is True iff compaction.original_input is not None

asimurka · 2026-06-29T11:17:36Z

                    else context.query_request.query
                ),
-                turn_summary.output_items,
+                [],  # field was removed from TurnSummary


This is inside deprecated helper function so it has no effect.

asimurka · 2026-06-29T11:18:06Z

    rag_chunks: list[RAGChunk] = Field(default_factory=list)
    referenced_documents: list[ReferencedDocument] = Field(default_factory=list)
    token_usage: TokenCounter = Field(default_factory=TokenCounter)
-    output_items: list[OpenAIResponseOutput] = Field(


This attribute is unnecessary now, we capture the output inside the agentic loop

asimurka · 2026-06-29T14:36:18Z


        assert isinstance(turn_summary, TurnSummary)
        assert turn_summary.llm_response == "Content blocked"
-        # Structured refusal captured for compacted-mode persistence (LCORE-1572).


Removed some irrelevant tests - persistence now takes place within agentic loop

asimurka · 2026-06-29T14:37:19Z

-                    responses_params.conversation,
-                    responses_params.input,
-                    [context.moderation_result.refusal_response],
-                )


Blocked compacted turn never reaches agentic loop, so has to be persisted here

asimurka · 2026-06-29T14:38:40Z

            client,
            responses_params.conversation,
-            responses_params.input,
+            original_input or responses_params.input,


If original input is not None -> store the original prompt, responses_api.input contains the whole compacted history.

asimurka · 2026-06-29T14:40:21Z

+        model_request_parameters: ModelRequestParameters,
+    ) -> AsyncStream[responses.ResponseStreamEvent]: ...
+
+    async def _responses_create(


Patching _responses_create function in order to intercept input and output of each responses api call

asimurka · 2026-06-29T14:40:44Z

+            and not stream
+            and isinstance(result, responses.Response)
+        ):
+            await append_turn_items_to_conversation(


Store here for non-streaming query

asimurka · 2026-06-29T14:42:12Z

+                compaction.latest_round_input,
+                cast(Sequence[Any], result.output),
+            )
+            compaction.original_input_persisted = True


First LLM call contains the whole compacted history, therefore we need to store onpy the original input, subsequent turns are stored normally

asimurka · 2026-06-29T14:42:55Z

+                and self._compaction is not None
+            ):
+                compaction = self._compaction
+                await append_turn_items_to_conversation(


Store streaming compacted turn from final event

asimurka marked this pull request as draft June 19, 2026 12:41

asimurka marked this pull request as ready for review June 22, 2026 06:36

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread src/pydantic_ai_lightspeed/llamastack/_model.py Outdated

asimurka marked this pull request as draft June 22, 2026 08:42

asimurka force-pushed the integrate_compaction_to_agentic_flow branch 2 times, most recently from 32f06e4 to fec773e Compare June 29, 2026 11:00

asimurka marked this pull request as ready for review June 29, 2026 11:01

asimurka force-pushed the integrate_compaction_to_agentic_flow branch from fec773e to 662e633 Compare June 29, 2026 11:07

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

asimurka commented Jun 29, 2026

View reviewed changes

Wired compaction turn persistence into agentic query flows

4a19bc4

asimurka force-pushed the integrate_compaction_to_agentic_flow branch from 662e633 to 4a19bc4 Compare June 29, 2026 11:45

asimurka commented Jun 29, 2026

View reviewed changes

asimurka requested review from jrobertboos, max-svistunov and tisnik June 29, 2026 14:43

Uh oh!

Conversation

asimurka commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asimurka commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading