feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780
feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780WZhuo wants to merge 1 commit into
Conversation
Add ArrowRowBuilder (inspect/row_builder_internal) to materialize in-memory rows into an ArrowArray for an arbitrary Iceberg schema, with typed append helpers (AppendNull/Boolean/Int/String/StringMap) reused by later metadata tables.
| file_writer.cc | ||
| inspect/history_table.cc | ||
| inspect/metadata_table.cc | ||
| inspect/row_builder_internal.cc |
There was a problem hiding this comment.
We can just name it to row_builder.cc because _internal suffix is used for marking header files that do not need to be installed.
|
|
||
| #pragma once | ||
|
|
||
| /// \file iceberg/inspect/row_builder_internal.h |
There was a problem hiding this comment.
Why putting it here instead of treating it as a general utility for building arrow arrays?
| ArrowError error; | ||
| ICEBERG_NANOARROW_RETURN_UNEXPECTED_WITH_ERROR( | ||
| ArrowArrayInitFromSchema(array.get(), &arrow_schema, &error), error); | ||
| ICEBERG_NANOARROW_RETURN_UNEXPECTED(ArrowArrayStartAppending(array.get())); |
There was a problem hiding this comment.
If this call fails, the ArrowArray initialized above is destroyed without ArrowArrayRelease, so nanoarrow-owned state leaks. Please guard it right after ArrowArrayInitFromSchema and only disarm the guard once ownership has moved into ArrowRowBuilder.
|
I think this is a good point to factor out a small internal nanoarrow builder/helper layer instead of adding another local copy of the same lifecycle code. We already have the same pattern in a few production paths: initialize from schema, start appending, append child values, finish each row/list/map element, then finish building. For example, A concrete way to keep this scoped would be:
That should fix the leak path here and avoid repeating this fragile nanoarrow lifecycle in the next metadata-table PRs without turning this into a broad refactor. |
Summary
Adds
ArrowRowBuilder(inspect/row_builder_internal), a schema-driven helper that materializes in-memory rows into an ArrowArrowArray(a batch) for an arbitrary Iceberg schema. It wraps the nanoarrow boilerplate and exposes per-column access plus typed append helpers, so metadata tables (snapshots, history, manifests, …) can emit rows without re-implementing it.This is the first of a series splitting metadata-table support into focused PRs; the
InMemoryBatchReaderand theSnapshotsTable::Scanintegration are intended to follow in separate PRs that build on this.What's included
ArrowRowBuilder::Make/column/FinishRow/Finishand typed helpersAppendNull/AppendBoolean/AppendInt/AppendString/AppendStringMap.iceberglibrary — it only needs nanoarrow +ToArrowSchema(no Apache Arrow), matching peers likemanifest_adapterandarrow_c_data_util.row_builder_test.cccovering typed appends (int32/string/int64/boolean/map), null handling for optional columns, multi-entry/empty string maps, zero-row batches, and column-index bounds. Compiled into themetadata_table_testtarget.Testing
cmake --build build --target metadata_table_testthen ran it — 9/9 tests pass (4 existingMetadataTableTest+ 5 newArrowRowBuilderTest).ctestgreen.arrow::ImportRecordBatch), so its target isUSE_BUNDLE.Notes
metadata_table_test.cccontinues to build there.