Skip to content

ID type #34

@dmgcodevil

Description

@dmgcodevil

ID type

Summary

Allow users to name the primary key column when defining a schema, instead of forcing it to be "id". The ID column remains required and auto-generated (INT64), but the name is user-defined.

Motivation

Today every node schema has a hidden, hardcoded "id" column prepended by prepend_id_field(). Users cannot:

  • Name the primary key to match their domain model (e.g. person_id, order_id, pk)
  • Distinguish primary keys across schemas in joined results without relying on schema.id dot-notation
  • Import external datasets that use a different column name for the identifier

The "id" string is baked into ~15 code locations across 10+ files via field_names::kId.

Current Behavior

  1. Schema creation (SchemaRegistry::create in src/schema/schema.cpp):
    Calls prepend_id_field() which unconditionally adds arrow::field("id", arrow::int64()) as field 0.

  2. Node creation (NodeManager::create_node in include/core/node.hpp):
    Writes the auto-generated ID into the field named field_names::kId ("id").

  3. Query execution (src/query/execution.cpp, include/query/row.hpp):
    Looks up field_names::kId by name to extract node IDs for join keys, row deduplication, and result ordering.

  4. Storage restore (src/storage/storage.cpp):
    Matches column_name == field_names::kId to extract the node ID during shard reload.

  5. Shell (apps/tundra_shell.cpp):
    Skips the "id" field during INSERT (auto-generated), displays it in results.

  6. Edge store (src/core/edge_store.cpp):
    Edges have their own structural "id" column (separate concern, see non-goals).

Proposed Design

Schema-level ID metadata

Add an id_field_name attribute to Schema:

struct Schema {
  // ... existing members ...
  std::string id_field_name_;  // e.g. "person_id", default "id"

  const std::string& id_field_name() const { return id_field_name_; }
  std::shared_ptr<Field> id_field() const { return get_field(id_field_name_); }
};

Schema creation API

When creating a schema, the user specifies which field is the ID:

CREATE (Person { person_id: ID, name: STRING, age: INT32 })

The ID type marker tells the system:

  • This field is the primary key
  • It is INT64, auto-generated, non-nullable
  • Exactly one field per schema must be marked ID

If no field is marked ID, the system can either:

  • (A) Reject the schema with an error: "exactly one ID field required"
  • (B) Auto-prepend a default "id" field (backward-compatible)

Recommendation: Option B for backward compatibility, with a deprecation warning encouraging explicit ID declaration.

Internal changes

Replace all field_names::kId lookups on node schemas with schema->id_field_name():

Location Current Proposed
Schema::crete_arrow_schema Prepends hardcoded "id" Validates exactly one ID-typed field exists
NodeManager::create_node schema->get_field("id") schema->id_field()
prepend_id_field() Adds arrow::field("id", int64) Remove; ID field is part of user schema
Storage::read_shard column_name == kId column_name == schema->id_field_name()
Row::extract_schema_ids field == kId field == schema->id_field_name()
Query execution (join keys, dedup) kId Resolve from schema at query-plan time
tundra_shell.cpp CREATE Skips "id" field Skips the schema's id_field_name()
Shard min_id / max_id tracking Assumes "id" Uses schema->id_field_name()

Persistence / metadata

SchemaMetadata (used for snapshot serialization) needs a new field:

{
  "name": "Person",
  "id_field": "person_id",
  "fields": [ ... ]
}

Existing snapshots without "id_field" default to "id" for backward compatibility.

Edge ID columns

Edge structural columns ("id", "source_id", "target_id", "created_ts") remain hardcoded. Edges are system-managed and not user-schema-defined — this is a separate concern. A future enhancement could allow custom edge property schemas, but that is out of scope here.

Affected Files

File Change
include/schema/schema.hpp Add id_field_name_, id_field(), validation
src/schema/schema.cpp Validate ID field on creation, remove prepend_id_field usage
include/arrow/utils.hpp Deprecate/remove prepend_id_field()
src/arrow/utils.cpp Remove prepend_id_field()
include/core/node.hpp Use schema->id_field() instead of kId
src/storage/storage.cpp Use schema->id_field_name() for ID column detection
include/query/row.hpp Use schema-aware ID resolution
src/query/execution.cpp Use schema-aware ID resolution in join/dedup logic
apps/tundra_shell.cpp Use schema-aware ID skip in CREATE, display in results
include/storage/metadata.hpp Add id_field to SchemaMetadata serialization
include/common/constants.hpp kId becomes the default, not the only option
src/main/database.cpp Pass schema context where ID field name is needed

Testing

  • Unit test: create schema with custom ID name, insert nodes, verify ID column name in results
  • Unit test: create schema without explicit ID, verify "id" is auto-prepended (backward compat)
  • Unit test: reject schema with two ID fields
  • Unit test: reject schema with zero ID fields (if option A chosen)
  • Snapshot round-trip: save and restore a schema with custom ID, verify field names survive
  • Join test: join two schemas with different ID column names
  • Shell test: CREATE/MATCH with custom ID field name

Non-Goals

  • Custom ID types (e.g. UUID, STRING primary keys) — INT64 auto-increment only for now
  • Custom edge ID column names — edges remain system-managed
  • Composite primary keys — single-column only

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions