Skip to content

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Jan 13, 2026

Purpose

Subtask of #4471,

New interface:

// Interface for FormatWriterFactory implementations that support variant schema inference.
public interface SupportsVariantInference {

FormatWriter createWithShreddingSchema(
        PositionOutputStream out, String compression, RowType inferredShreddingSchema)
        throws IOException;
}

New impl

/**
 * A generic writer that infers the shredding schema from buffered rows before writing.
 *
 * <p>This writer buffers rows up to a threshold, infers the optimal schema from them, then writes
 * all data using the inferred schema. It works with any format that implements {@link
 * SupportsVariantInference}.
 */
InferVariantShreddingWriter

New options:

    public static final ConfigOption<Integer> VARIANT_SHREDDING_MAX_SCHEMA_WIDTH =
            key("variant.shredding.maxSchemaWidth")
                    .intType()
                    .defaultValue(300)
                    .withDescription(
                            "Maximum number of shredded fields allowed in an inferred schema.");

    public static final ConfigOption<Integer> VARIANT_SHREDDING_MAX_SCHEMA_DEPTH =
            key("variant.shredding.maxSchemaDepth")
                    .intType()
                    .defaultValue(50)
                    .withDescription(
                            "Maximum traversal depth in Variant values during schema inference.");

    public static final ConfigOption<Double> VARIANT_SHREDDING_MIN_FIELD_CARDINALITY_RATIO =
            key("variant.shredding.minFieldCardinalityRatio")
                    .doubleType()
                    .defaultValue(0.1)
                    .withDescription(
                            "Minimum fraction of rows that must contain a field for it to be shredded. "
                                    + "Fields below this threshold will remain in the un-shredded Variant binary.");

    public static final ConfigOption<Integer> VARIANT_SHREDDING_MAX_INFER_BUFFER_ROW =
            key("variant.shredding.maxInferBufferRow")
                    .intType()
                    .defaultValue(4096)
                    .withDescription("Maximum number of rows to buffer for schema inference.");

The code was generated along with Qoder AI.

@Zouxxyy Zouxxyy force-pushed the dev/variant-write-infer1 branch from 6892290 to 50f1666 Compare January 14, 2026 03:54
@Zouxxyy Zouxxyy force-pushed the dev/variant-write-infer1 branch from 50f1666 to b8e25ba Compare January 14, 2026 04:04
@Zouxxyy Zouxxyy changed the title [variant] Introduce InferVariantShreddingParquetWriter [variant] Introduce InferVariantShreddingWriter Jan 14, 2026
@JingsongLi
Copy link
Contributor

+1

@JingsongLi JingsongLi merged commit 049e1e6 into apache:master Jan 14, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants