feat(schema): represent, serialize and validate v3 column default values (1/4)#746
feat(schema): represent, serialize and validate v3 column default values (1/4)#746huan233usc wants to merge 14 commits into
Conversation
445f7ae to
4aade00
Compare
1ee5b32 to
34470af
Compare
There was a problem hiding this comment.
Pull request overview
Adds Iceberg v3 column default value support at the schema layer by carrying, JSON-(de)serializing, validating, and projecting initial-default / write-default literals (foundation for later read/write-path work).
Changes:
- Extend
SchemaFieldto storeinitial-default/write-defaultas shared immutable literals and include them in equality/ID-reassignment rebuilds. - Add JSON serde for
initial-default/write-defaultusing existing single-value literal serialization. - Update schema projection to use
FieldProjection::Kind::kDefaultwhen an expected field is missing but hasinitial-default, and add/extend unit tests + v3 metadata fixture.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/test/schema_util_test.cc | Adds projection tests for missing fields with initial-default and for ignoring defaults when the field is present. |
| src/iceberg/test/schema_test.cc | Adds schema validation/version-gating and ID-reassignment preservation tests for default values. |
| src/iceberg/test/schema_json_test.cc | Adds round-trip and failure tests for JSON serialization/parsing of default values (incl. nested structs). |
| src/iceberg/test/resources/TableMetadataV3Valid.json | Adds a v3-valid table metadata JSON fixture. |
| src/iceberg/schema.cc | Preserves defaults during ID reassignment and adds default-related validation in Schema::Validate. |
| src/iceberg/schema_util.cc | Projects missing fields with initial-default as kDefault rather than error/null. |
| src/iceberg/schema_field.h | Extends SchemaField API/storage to carry default literals and expose them via optional reference accessors. |
| src/iceberg/schema_field.cc | Implements default accessors, validation of defaults, and includes defaults in equality. |
| src/iceberg/json_serde.cc | Serializes/parses initial-default / write-default on schema fields via literal single-value serde. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
First of a multi-part split of column default value support (apache#730) — the schema foundation the read and evolution paths build on. Purely additive; no read/write behavior change on its own. - SchemaField carries `initial-default` / `write-default` (immutable std::shared_ptr<const Literal>) with copy-preserving WithInitialDefault / WithWriteDefault modifiers; getters return optional<reference_wrapper>. - JSON serde reads/writes `initial-default` / `write-default` via the existing single-value serialization. - Schema::Validate rejects default values below format v3 and validates they are non-null primitive literals matching the field type. - Generic schema projection maps a column missing from a data file with an initial-default to FieldProjection::Kind::kDefault. Read-path application (Parquet/Avro) and schema evolution follow in separate PRs. See apache#731 for the full end-to-end proof-of-concept.
34470af to
f663e0e
Compare
This file was added by this PR but is referenced by no test (table-metadata resources are loaded by explicit name via ReadTableMetadataFromResource) and carries no column defaults, so it is dead. Remove it.
Add SchemaField default-value coverage directly to json_serde_test (the binary that exercises json_serde.cc): one test pins the initial-default / write-default wire format (incl. their absence when unset), and one round-trips a field with both defaults through ToJson -> FieldFromJson for every primitive type, exercising the non-trivial date/timestamp/decimal/uuid/binary encodings the default path reuses.
…-schema # Conflicts: # src/iceberg/test/json_serde_test.cc
wgtmac
left a comment
There was a problem hiding this comment.
Static review notes; I did not build or run tests.
Addresses wgtmac's review feedback: - Enforce UTC offset for timestamptz / timestamptz_ns default values in FieldFromJson. The shared timestamp parser accepts any offset and silently normalizes to UTC, so C++ could accept default metadata Java rejects and rewrite the offset on write. Added TemporalUtils::IsUtcOffset, which reuses the existing timezone-suffix parser. - ValidateDefault now casts the default literal to the field type (Literal::CastTo) instead of requiring an exact type match, matching Java's Types.NestedField (e.g. an int default on a long field). Uncastable or out-of-range defaults are still rejected. - Documented that non-primitive (struct/list/map) defaults remain unsupported, matching Java's current model. - Extended the json_serde round-trip tests with time/timestamptz/timestamp_ns/ timestamptz_ns and a negative case for non-UTC timestamptz defaults; added a schema test for the cast-to-field-type behavior.
Directly cover the UTC/non-UTC/unparseable offset cases the timestamptz default-value enforcement relies on.
The AMD64 Windows job failed in its cleanup step (rm -rf example/build: Device or resource busy) after the example compiled and linked successfully; re-run CI.
wgtmac
left a comment
There was a problem hiding this comment.
Two remaining static review notes; I did not build or run tests.
| // be cast to the field type or fall outside its range (CastTo signals out-of-range as | ||
| // an above-max/below-min sentinel). | ||
| auto field_type = std::static_pointer_cast<PrimitiveType>(field.type()); | ||
| auto cast = value.CastTo(field_type); |
There was a problem hiding this comment.
This validates the cast but keeps storing the original literal. Java stores the result of castDefault, so a programmatic default like Literal::Long(...) on a timestamp field would validate here, then later serialize or project as a long instead of a timestamp default. Either the field needs to retain the casted literal, or this should keep requiring an exact type match.
|
|
||
| Status ValidateDefault(const SchemaField& field, const Literal& value, | ||
| std::string_view kind) { | ||
| if (value.IsNull() || value.IsAboveMax() || value.IsBelowMin()) { |
There was a problem hiding this comment.
This rejects explicit null defaults. The spec allows optional field defaults to be null and says they may be explicitly set. Java mostly models a null default as absence, but C++ now has a present-null Literal path that will fail validation and, for initial-default, also trips the format-version gate by presence rather than by non-null value.
wgtmac
left a comment
There was a problem hiding this comment.
I've just finished a human review over non-test files. Let me know what you think. Thanks!
| /// | ||
| /// The returned reference is a non-owning view into a value owned by this field; | ||
| /// it remains valid for the lifetime of this SchemaField. | ||
| [[nodiscard]] std::optional<std::reference_wrapper<const Literal>> initial_default() |
There was a problem hiding this comment.
I think we should remove [[nodiscard]] because users should always know what they are doing when these are called. I think we need also clean up existing functions as well.
| /// \brief Get the shared owning pointer to the `initial-default` value, or null if | ||
| /// absent. Prefer initial_default() for reading; this exists so a rebuilt field can | ||
| /// share the (immutable) value rather than copy it. | ||
| [[nodiscard]] const std::shared_ptr<const Literal>& initial_default_ptr() const; |
There was a problem hiding this comment.
It looks odd to provide both initial_default and initial_default_ptr. Should we keep one, like const std::shared_ptr<const Literal>& initial_default() const?
| std::shared_ptr<Type> type_; | ||
| bool optional_; | ||
| std::string doc_; | ||
| // Default values are owned by this field and never mutated after being set; copies |
There was a problem hiding this comment.
Not a strong opinion but the comment looks too verbose.
| } | ||
|
|
||
| Result<bool> TemporalUtils::IsUtcOffset(std::string_view str) { | ||
| ICEBERG_ASSIGN_OR_RAISE(auto timestamp_with_offset, ParseTimestampWithZoneSuffix(str)); |
There was a problem hiding this comment.
This allows Z and -00:00 as well. This is not allowed by the spec, right? However, Java impl seems doing the same thing. Perhaps we need to document this behavior explicitly?
| #include "iceberg/util/macros.h" | ||
|
|
||
| namespace iceberg { | ||
|
|
There was a problem hiding this comment.
This always create a new literal even when their types are the same which is unlikely?
| // an above-max/below-min sentinel). | ||
| auto field_type = std::static_pointer_cast<PrimitiveType>(field.type()); | ||
| auto cast = value.CastTo(field_type); | ||
| if (!cast.has_value() || cast->IsAboveMax() || cast->IsBelowMin()) { |
There was a problem hiding this comment.
Perhaps it is better to use ICEBERG_ASSIGN_OR_RAISE to respect the original error instead of checking cast.has_value()?
| // Defaults are validated to be primitive literals matching the field type, so | ||
| // single-value serialization cannot fail here. | ||
| if (field.initial_default().has_value()) { | ||
| ICEBERG_ASSIGN_OR_THROW(json[kInitialDefault], |
There was a problem hiding this comment.
Though it cannot not fail, we shouldn't use ICEBERG_ASSIGN_OR_THROW here. Let's change the signature to return Result<nlohmann::json> just in case any error.
There was a problem hiding this comment.
| if (!is_utc) { | ||
| return JsonParseError( | ||
| "Invalid timestamptz default '{}' for {}: default values must use UTC " | ||
| "(offset 'Z' or '+00:00')", |
There was a problem hiding this comment.
As I have commented earlier, this can also accept -00:00 so the error message is not consistent.
| return {}; | ||
| } | ||
| if (!value.is_string()) { | ||
| // Let LiteralFromJson report the type mismatch. |
There was a problem hiding this comment.
Why not directly error out here?
| ICEBERG_ASSIGN_OR_RAISE(auto name, GetJsonValue<std::string>(json, kName)); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto required, GetJsonValue<bool>(json, kRequired)); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto doc, GetJsonValueOrDefault<std::string>(json, kDoc)); | ||
| ICEBERG_ASSIGN_OR_RAISE(std::optional<nlohmann::json> initial_default_json, |
- ValidateDefault: use internal::checked_pointer_cast and ICEBERG_ASSIGN_OR_RAISE so an uncastable default surfaces the original CastTo error instead of a generic message; keep the out-of-range (sentinel) case as an invalid-schema error. - json_serde: error out directly when a timestamptz default is not a string, drop the offset-specific wording from the UTC error message (it also accepts -00:00), and use auto for the parsed default JSON. - Update schema tests for the propagated cast error and the out-of-range case.
…erializers Column default values make single-value (Literal) serialization fallible, so the schema/type serializers must propagate errors instead of throwing. Change ToJson for SchemaField, Type, Schema, TableMetadata and TableUpdate (plus the REST CreateTableRequest/CommitTableRequest/LoadTableResult/CommitTableResponse, which embed them) to return Result<nlohmann::json>, replacing the ICEBERG_ASSIGN_OR_THROW in SchemaField default serialization with ICEBERG_ASSIGN_OR_RAISE. - The shared ToJsonList template (also used by infallible serializers) is left unchanged; TableMetadata serializes its schema list with an explicit loop. - REST adds an ICEBERG_DECLARE_JSON_SERDE_FALLIBLE macro variant for the four models that embed a schema/metadata; the other REST models keep the bare-json ToJson. - Callers bottom out at existing Result boundaries (ToJsonString wrappers, rest_catalog.cc, TableMetadataUtil::Write). - Tests unwrap via ICEBERG_UNWRAP_OR_FAIL.
Extract the FromJson declarations into ICEBERG_DECLARE_FROM_JSON so the four schema/metadata-bearing models declare a Result-returning ToJson without duplicating the whole serde macro.
Keep ICEBERG_DECLARE_JSON_SERDE identical to upstream and declare the four schema/metadata-bearing models (LoadTableResult, CreateTableRequest, CommitTableRequest, CommitTableResponse) explicitly, since only their ToJson return type (Result) differs.
…-schema # Conflicts: # src/iceberg/catalog/rest/rest_catalog.cc
…lizers (#785) ## What Change the schema/type/metadata `ToJson` serializers from returning bare `nlohmann::json` to `Result<nlohmann::json>`: - `json_serde`: `ToJson(SchemaField | Type | Schema | TableMetadata | TableUpdate)` - REST: `ToJson(CreateTableRequest | CommitTableRequest | LoadTableResult | CommitTableResponse)` Errors propagate via `ICEBERG_ASSIGN_OR_RAISE`; callers bottom out at the existing `Result`-returning boundaries (`ToJsonString`, `rest_catalog`, `TableMetadataUtil::Write`). ## Why Preparation for v3 column default values (#730 / #746). Single-value (`Literal`) serialization is fallible, and column defaults invoke it from schema serialization, so these serializers need to propagate the error instead of throwing. Splitting it out keeps the feature PR focused on the feature. ## Notes - **No behavior change** — every conversion still succeeds today; only the return type changes. - The shared `ToJsonList` template stays bare (it also serializes infallible types such as partition specs and snapshots); `TableMetadata` serializes its schema list with an explicit loop. - The REST `ICEBERG_DECLARE_JSON_SERDE` macro is unchanged; the four schema/metadata-bearing models are declared explicitly so only their `ToJson` return type differs. ## Testing No behavior change; existing tests are adapted to the new return type. Full build and `ctest` pass locally.
| if (lhs == nullptr || rhs == nullptr) { | ||
| return lhs == rhs; | ||
| } | ||
| return *lhs == *rhs; |
There was a problem hiding this comment.
Just a reminder that Literal::operator<=> returns unordered when any side IsNull() returns true so two null defaults do not equal. I think we should fix Literal::operator<=> to be null safe?
There was a problem hiding this comment.
Or add a overload Literal::NullSafeEquals()
Part 1 of a multi-part split of #730 (column default values, item 2 of #637). The full
end-to-end implementation is in #731, kept open as the proof-of-concept; this series
lands it in reviewable pieces.
This PR is the schema foundation — representing, serializing and validating v3
column default values. It is purely additive and changes no read or write behavior on
its own.
What's in this PR
SchemaFieldcarriesinitial-default/write-default, stored asstd::shared_ptr<const Literal>(immutable payload shared across copies, like theadjacent
type_; the C++ analog of Java'sfinal Literal<?>). They are set via theconstructor. Getters return
std::optional<std::reference_wrapper<const Literal>>forreading (the
Schema::FindFieldByNameidiom);initial_default_ptr()/write_default_ptr()expose the shared pointer so a rebuilt field (e.g. IDreassignment) shares the value instead of copying it.
initial-default/write-defaultusing the existingsingle-value serialization (all primitive types).
Schema::Validate: version-gates theinitial-defaultto format v3(
kMinFormatVersionDefaultValues) — it reinterprets how existing data files are read,so it requires the v3 reader contract. The
write-defaultonly affects values writtengoing forward and is not version-gated (matching Java's
Schema.checkCompatibility,which gates only the initial default). Both defaults are otherwise validated to be
non-null primitive literals matching the field type.
initial-defaultmaps to
FieldProjection::Kind::kDefaultcarrying the literal (the per-format readersconsume this in the follow-up PRs).
Follow-ups (stacked on this PR)
literal_util+ parquet projection/materialization)UpdateSchemaadd/update column defaults)Testing
Added tests