Skip to content

Add example ALP files #105

@alamb

Description

@alamb

Indeed, as part of adding the Adaptive Lossless Floating-Point encoding to the Parquet standard, we should provide sample files in the parquet-testing repo that other implementations can use to verify they correctly read such files

Quoting @CurtHagenlocher on the dev list

As part of the process of amending the Parquet format, perhaps it would be a good idea for early implementations to generate sample files and commit them to apache/parquet-testing: Apache Parquet Testing for other implementations to leverage?

Suggested Requirements

Size

Given that the parquet-format repository is checked out many times by many different repositories as part of CI, it is important to keep the size of these example files down. They should typically be no more than a few kb in size at most

Reference Values

I suggest we follow the model of BYTE_STREAM_SPLIT (see here) and create a single parquet file that has multiple columns with the different test and validation sets.

For example, one column of PLAIN encoded f32 and a column of PLAIN encoded f64 as baseline and then several columns of the same data encoded using ALP with different parameters (to cover parts of the spec)

ALP / patterns

We should ensure the dataset has ALP data with the following properties:

  • Vectors with no exceptions
  • Vectors with 1 exception
  • Vectors with many exceptions
  • Vectors with NAN, INF, etc
  • Vectors with many/most exceptions (e.g. random float data)
  • Vectors with ALL exception
  • Vectors with NULL values
  • A page with multiple Vectors that have different exponents/values
  • All possible ALP bit widths sizes (1 -> 15 == 65k)
  • Both f32 and f64
  • a. Null values

Documentation

Here is some other Documentation that I think shows the best practice

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions