Skip to content

feat(schema): represent, serialize and validate v3 column default values (1/4)#746

Open
huan233usc wants to merge 14 commits into
apache:mainfrom
huan233usc:feat/default-values-schema
Open

feat(schema): represent, serialize and validate v3 column default values (1/4)#746
huan233usc wants to merge 14 commits into
apache:mainfrom
huan233usc:feat/default-values-schema

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Part 1 of a multi-part split of #730 (column default values, item 2 of #637). The full
end-to-end implementation is in #731, kept open as the proof-of-concept; this series
lands it in reviewable pieces.

This PR is the schema foundation — representing, serializing and validating v3
column default values. It is purely additive and changes no read or write behavior on
its own.

What's in this PR

  • SchemaField carries initial-default / write-default, stored as
    std::shared_ptr<const Literal> (immutable payload shared across copies, like the
    adjacent type_; the C++ analog of Java's final Literal<?>). They are set via the
    constructor. Getters return std::optional<std::reference_wrapper<const Literal>> for
    reading (the Schema::FindFieldByName idiom); initial_default_ptr() /
    write_default_ptr() expose the shared pointer so a rebuilt field (e.g. ID
    reassignment) shares the value instead of copying it.
  • JSON serde: parse/write initial-default / write-default using the existing
    single-value serialization (all primitive types).
  • Schema::Validate: version-gates the initial-default to format v3
    (kMinFormatVersionDefaultValues) — it reinterprets how existing data files are read,
    so it requires the v3 reader contract. The write-default only affects values written
    going forward and is not version-gated (matching Java's Schema.checkCompatibility,
    which gates only the initial default). Both defaults are otherwise validated to be
    non-null primitive literals matching the field type.
  • Generic projection: a column missing from a data file with an initial-default
    maps to FieldProjection::Kind::kDefault carrying the literal (the per-format readers
    consume this in the follow-up PRs).

Follow-ups (stacked on this PR)

  • read path — Parquet (literal_util + parquet projection/materialization)
  • read path — Avro
  • schema evolution (UpdateSchema add/update column defaults)

Testing

Added tests

Comment thread src/iceberg/schema_field.h
Comment thread src/iceberg/json_serde.cc
Comment thread src/iceberg/schema_field.cc Outdated
@huan233usc huan233usc force-pushed the feat/default-values-schema branch 3 times, most recently from 1ee5b32 to 34470af Compare June 16, 2026 05:30
@huan233usc huan233usc requested a review from WZhuo June 16, 2026 05:38

@WZhuo WZhuo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Iceberg v3 column default value support at the schema layer by carrying, JSON-(de)serializing, validating, and projecting initial-default / write-default literals (foundation for later read/write-path work).

Changes:

  • Extend SchemaField to store initial-default / write-default as shared immutable literals and include them in equality/ID-reassignment rebuilds.
  • Add JSON serde for initial-default / write-default using existing single-value literal serialization.
  • Update schema projection to use FieldProjection::Kind::kDefault when an expected field is missing but has initial-default, and add/extend unit tests + v3 metadata fixture.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/iceberg/test/schema_util_test.cc Adds projection tests for missing fields with initial-default and for ignoring defaults when the field is present.
src/iceberg/test/schema_test.cc Adds schema validation/version-gating and ID-reassignment preservation tests for default values.
src/iceberg/test/schema_json_test.cc Adds round-trip and failure tests for JSON serialization/parsing of default values (incl. nested structs).
src/iceberg/test/resources/TableMetadataV3Valid.json Adds a v3-valid table metadata JSON fixture.
src/iceberg/schema.cc Preserves defaults during ID reassignment and adds default-related validation in Schema::Validate.
src/iceberg/schema_util.cc Projects missing fields with initial-default as kDefault rather than error/null.
src/iceberg/schema_field.h Extends SchemaField API/storage to carry default literals and expose them via optional reference accessors.
src/iceberg/schema_field.cc Implements default accessors, validation of defaults, and includes defaults in equality.
src/iceberg/json_serde.cc Serializes/parses initial-default / write-default on schema fields via literal single-value serde.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/iceberg/schema_field.cc
Comment thread src/iceberg/schema.cc
First of a multi-part split of column default value support (apache#730) — the
schema foundation the read and evolution paths build on. Purely additive;
no read/write behavior change on its own.

- SchemaField carries `initial-default` / `write-default` (immutable
  std::shared_ptr<const Literal>) with copy-preserving WithInitialDefault /
  WithWriteDefault modifiers; getters return optional<reference_wrapper>.
- JSON serde reads/writes `initial-default` / `write-default` via the
  existing single-value serialization.
- Schema::Validate rejects default values below format v3 and validates
  they are non-null primitive literals matching the field type.
- Generic schema projection maps a column missing from a data file with an
  initial-default to FieldProjection::Kind::kDefault.

Read-path application (Parquet/Avro) and schema evolution follow in separate
PRs. See apache#731 for the full end-to-end proof-of-concept.
@huan233usc huan233usc force-pushed the feat/default-values-schema branch from 34470af to f663e0e Compare June 18, 2026 17:21
Comment thread src/iceberg/test/resources/TableMetadataV3Valid.json Outdated
Comment thread src/iceberg/json_serde.cc Outdated
This file was added by this PR but is referenced by no test (table-metadata
resources are loaded by explicit name via ReadTableMetadataFromResource) and
carries no column defaults, so it is dead. Remove it.
Add SchemaField default-value coverage directly to json_serde_test (the
binary that exercises json_serde.cc): one test pins the initial-default /
write-default wire format (incl. their absence when unset), and one round-trips
a field with both defaults through ToJson -> FieldFromJson for every primitive
type, exercising the non-trivial date/timestamp/decimal/uuid/binary encodings
the default path reuses.
@huan233usc huan233usc requested a review from zhjwpku June 20, 2026 18:10
@huan233usc

Copy link
Copy Markdown
Contributor Author

Thank you @zhjwpku and @WZhuo for review!
Hi @wgtmac, do you mind take another look as you requested copilot to review on your behalf. Thanks

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static review notes; I did not build or run tests.

Comment thread src/iceberg/json_serde.cc
Comment thread src/iceberg/schema_field.cc
Comment thread src/iceberg/schema_field.cc Outdated
Comment thread src/iceberg/test/json_serde_test.cc
Addresses wgtmac's review feedback:

- Enforce UTC offset for timestamptz / timestamptz_ns default values in
  FieldFromJson. The shared timestamp parser accepts any offset and silently
  normalizes to UTC, so C++ could accept default metadata Java rejects and
  rewrite the offset on write. Added TemporalUtils::IsUtcOffset, which reuses
  the existing timezone-suffix parser.
- ValidateDefault now casts the default literal to the field type (Literal::CastTo)
  instead of requiring an exact type match, matching Java's Types.NestedField
  (e.g. an int default on a long field). Uncastable or out-of-range defaults are
  still rejected.
- Documented that non-primitive (struct/list/map) defaults remain unsupported,
  matching Java's current model.
- Extended the json_serde round-trip tests with time/timestamptz/timestamp_ns/
  timestamptz_ns and a negative case for non-UTC timestamptz defaults; added a
  schema test for the cast-to-field-type behavior.
Directly cover the UTC/non-UTC/unparseable offset cases the timestamptz
default-value enforcement relies on.
The AMD64 Windows job failed in its cleanup step (rm -rf example/build:
Device or resource busy) after the example compiled and linked successfully;
re-run CI.
@huan233usc huan233usc requested a review from wgtmac June 23, 2026 21:33

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two remaining static review notes; I did not build or run tests.

Comment thread src/iceberg/schema_field.cc Outdated
// be cast to the field type or fall outside its range (CastTo signals out-of-range as
// an above-max/below-min sentinel).
auto field_type = std::static_pointer_cast<PrimitiveType>(field.type());
auto cast = value.CastTo(field_type);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validates the cast but keeps storing the original literal. Java stores the result of castDefault, so a programmatic default like Literal::Long(...) on a timestamp field would validate here, then later serialize or project as a long instead of a timestamp default. Either the field needs to retain the casted literal, or this should keep requiring an exact type match.


Status ValidateDefault(const SchemaField& field, const Literal& value,
std::string_view kind) {
if (value.IsNull() || value.IsAboveMax() || value.IsBelowMin()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rejects explicit null defaults. The spec allows optional field defaults to be null and says they may be explicitly set. Java mostly models a null default as absence, but C++ now has a present-null Literal path that will fail validation and, for initial-default, also trips the format-version gate by presence rather than by non-null value.

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just finished a human review over non-test files. Let me know what you think. Thanks!

///
/// The returned reference is a non-owning view into a value owned by this field;
/// it remains valid for the lifetime of this SchemaField.
[[nodiscard]] std::optional<std::reference_wrapper<const Literal>> initial_default()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove [[nodiscard]] because users should always know what they are doing when these are called. I think we need also clean up existing functions as well.

/// \brief Get the shared owning pointer to the `initial-default` value, or null if
/// absent. Prefer initial_default() for reading; this exists so a rebuilt field can
/// share the (immutable) value rather than copy it.
[[nodiscard]] const std::shared_ptr<const Literal>& initial_default_ptr() const;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks odd to provide both initial_default and initial_default_ptr. Should we keep one, like const std::shared_ptr<const Literal>& initial_default() const?

std::shared_ptr<Type> type_;
bool optional_;
std::string doc_;
// Default values are owned by this field and never mutated after being set; copies

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion but the comment looks too verbose.

}

Result<bool> TemporalUtils::IsUtcOffset(std::string_view str) {
ICEBERG_ASSIGN_OR_RAISE(auto timestamp_with_offset, ParseTimestampWithZoneSuffix(str));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows Z and -00:00 as well. This is not allowed by the spec, right? However, Java impl seems doing the same thing. Perhaps we need to document this behavior explicitly?

#include "iceberg/util/macros.h"

namespace iceberg {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always create a new literal even when their types are the same which is unlikely?

Comment thread src/iceberg/schema_field.cc Outdated
// an above-max/below-min sentinel).
auto field_type = std::static_pointer_cast<PrimitiveType>(field.type());
auto cast = value.CastTo(field_type);
if (!cast.has_value() || cast->IsAboveMax() || cast->IsBelowMin()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is better to use ICEBERG_ASSIGN_OR_RAISE to respect the original error instead of checking cast.has_value()?

Comment thread src/iceberg/json_serde.cc Outdated
// Defaults are validated to be primitive literals matching the field type, so
// single-value serialization cannot fail here.
if (field.initial_default().has_value()) {
ICEBERG_ASSIGN_OR_THROW(json[kInitialDefault],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it cannot not fail, we shouldn't use ICEBERG_ASSIGN_OR_THROW here. Let's change the signature to return Result<nlohmann::json> just in case any error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wgtmac, this is in general a good call. But the changing the signature to return Resultnlohmann::json just in case any error requires changing lots of places that is not related on default values.

I updated the pr but also create a pure refactor in #785

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Merged that PR.

Comment thread src/iceberg/json_serde.cc Outdated
if (!is_utc) {
return JsonParseError(
"Invalid timestamptz default '{}' for {}: default values must use UTC "
"(offset 'Z' or '+00:00')",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have commented earlier, this can also accept -00:00 so the error message is not consistent.

Comment thread src/iceberg/json_serde.cc Outdated
return {};
}
if (!value.is_string()) {
// Let LiteralFromJson report the type mismatch.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly error out here?

Comment thread src/iceberg/json_serde.cc Outdated
ICEBERG_ASSIGN_OR_RAISE(auto name, GetJsonValue<std::string>(json, kName));
ICEBERG_ASSIGN_OR_RAISE(auto required, GetJsonValue<bool>(json, kRequired));
ICEBERG_ASSIGN_OR_RAISE(auto doc, GetJsonValueOrDefault<std::string>(json, kDoc));
ICEBERG_ASSIGN_OR_RAISE(std::optional<nlohmann::json> initial_default_json,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use auto?

- ValidateDefault: use internal::checked_pointer_cast and ICEBERG_ASSIGN_OR_RAISE
  so an uncastable default surfaces the original CastTo error instead of a generic
  message; keep the out-of-range (sentinel) case as an invalid-schema error.
- json_serde: error out directly when a timestamptz default is not a string, drop
  the offset-specific wording from the UTC error message (it also accepts -00:00),
  and use auto for the parsed default JSON.
- Update schema tests for the propagated cast error and the out-of-range case.
…erializers

Column default values make single-value (Literal) serialization fallible, so the
schema/type serializers must propagate errors instead of throwing. Change ToJson for
SchemaField, Type, Schema, TableMetadata and TableUpdate (plus the REST
CreateTableRequest/CommitTableRequest/LoadTableResult/CommitTableResponse, which embed
them) to return Result<nlohmann::json>, replacing the ICEBERG_ASSIGN_OR_THROW in
SchemaField default serialization with ICEBERG_ASSIGN_OR_RAISE.

- The shared ToJsonList template (also used by infallible serializers) is left
  unchanged; TableMetadata serializes its schema list with an explicit loop.
- REST adds an ICEBERG_DECLARE_JSON_SERDE_FALLIBLE macro variant for the four models
  that embed a schema/metadata; the other REST models keep the bare-json ToJson.
- Callers bottom out at existing Result boundaries (ToJsonString wrappers,
  rest_catalog.cc, TableMetadataUtil::Write).
- Tests unwrap via ICEBERG_UNWRAP_OR_FAIL.
Extract the FromJson declarations into ICEBERG_DECLARE_FROM_JSON so the four
schema/metadata-bearing models declare a Result-returning ToJson without
duplicating the whole serde macro.
Keep ICEBERG_DECLARE_JSON_SERDE identical to upstream and declare the four
schema/metadata-bearing models (LoadTableResult, CreateTableRequest,
CommitTableRequest, CommitTableResponse) explicitly, since only their ToJson
return type (Result) differs.
…-schema

# Conflicts:
#	src/iceberg/catalog/rest/rest_catalog.cc
wgtmac pushed a commit that referenced this pull request Jun 27, 2026
…lizers (#785)

## What

Change the schema/type/metadata `ToJson` serializers from returning bare
`nlohmann::json` to `Result<nlohmann::json>`:

- `json_serde`: `ToJson(SchemaField | Type | Schema | TableMetadata |
TableUpdate)`
- REST: `ToJson(CreateTableRequest | CommitTableRequest |
LoadTableResult | CommitTableResponse)`

Errors propagate via `ICEBERG_ASSIGN_OR_RAISE`; callers bottom out at
the existing `Result`-returning boundaries (`ToJsonString`,
`rest_catalog`, `TableMetadataUtil::Write`).

## Why

Preparation for v3 column default values (#730 / #746). Single-value
(`Literal`) serialization is fallible, and column defaults invoke it
from schema serialization, so these serializers need to propagate the
error instead of throwing. Splitting it out keeps the feature PR focused
on the feature.

## Notes

- **No behavior change** — every conversion still succeeds today; only
the return type changes.
- The shared `ToJsonList` template stays bare (it also serializes
infallible types such as partition specs and snapshots); `TableMetadata`
serializes its schema list with an explicit loop.
- The REST `ICEBERG_DECLARE_JSON_SERDE` macro is unchanged; the four
schema/metadata-bearing models are declared explicitly so only their
`ToJson` return type differs.

## Testing

No behavior change; existing tests are adapted to the new return type.
Full build and `ctest` pass locally.
if (lhs == nullptr || rhs == nullptr) {
return lhs == rhs;
}
return *lhs == *rhs;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that Literal::operator<=> returns unordered when any side IsNull() returns true so two null defaults do not equal. I think we should fix Literal::operator<=> to be null safe?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or add a overload Literal::NullSafeEquals()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants