-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Describe the enhancement requested
Hello GraphAr community,
I would like to propose the creation of a native Go SDK for GraphAr and am willing to kickstart the implementation and maintain it.
Motivation
Go (Golang) is a dominant language in the cloud-native landscape, widely used for building high-performance backend services, data pipelines, and infrastructure tooling. Currently, for a Go application to interact with GraphAr data, it would need to rely on complex solutions like CGO bindings to the C++ library or inter-process communication with a Java/Spark service.
A native Go SDK would significantly lower the barrier to adoption for the vast Go ecosystem by providing an idiomatic, efficient, and dependency-free way to read and write GraphAr formatted data. This would enable direct integration with Go-based graph databases, analysis tools, and data processing frameworks.
Preliminary Design Proposal (Open for Feedback)
To ensure consistency and maintainability, the Go SDK's design will be heavily inspired by the architecture of the existing C++ and Java/Spark/Python/Rust libraries. The core idea is to follow the same layered approach:
info Package: Pure Go data structures that represent the GraphAr schema (GraphInfo, VertexInfo, EdgeInfo, PropertyGroup, etc.) and logic for parsing/serializing the .info.yml files. This will leverage the existing Proto definitions introduced in PR #573.
storage Package: An abstraction layer for accessing the underlying storage (e.g., local filesystem, S3), making the SDK storage-agnostic.
parquet Package: A dedicated module for handling Parquet file I/O, as it's the most common payload format. This will leverage a robust Go Parquet library (e.g., parquet-go or arrow/go).
reader Package: Provides high-level APIs to read vertex/edge chunks, handling different AdjList types and navigating through property groups.
writer Package: Provides high-level APIs to write vertex/edge data into the correct chunked and partitioned directory structure, including generating offset and metadata files.
Phased Development Plan
I propose tackling this in a phased approach to deliver value incrementally and gather feedback:
Phase 1: Core Schema & Reader (Read-Only)
-
Implement the info package for YAML parsing and validation (based on generated Protos).
-
Implement the storage layer with an initial local filesystem backend.
-
Implement the parquet reader logic.
-
Implement the reader package to read existing GraphAr vertex and edge data (all AdjList types).
-
Add comprehensive unit tests using sample data generated by the Spark library to ensure compatibility.
Phase 2: Writer Implementation (Read-Write)
-
Implement the writer package to generate valid GraphAr directory structures and metadata.
-
Implement logic to write vertex and edge property chunks to Parquet files.
-
Implement logic for writing adjlist and offset chunks.
-
Add round-trip tests (write with Go SDK, read with Go/Spark SDK).
Phase 3: Advanced Features & Optimization
-
Add support for other storage backends (e.g., S3).
-
Performance profiling and optimization.
-
Add high-level graph traversal APIs (optional, based on community needs).
Next Steps
I am excited about the possibility of bringing GraphAr to the Go community.
As I am new to this codebase, I would greatly appreciate any guidance from the mentors regarding the directory structure, CI integration, or any specific requirements I should be aware of.
If the community gives the green light on this plan, I am ready to start working on Phase 1.
Thanks!
Component(s)
Other