Skip to content

semanticarts/gist-deref

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gist Static RDF and HTML Content for Dereferencing IRIs (gist-deref)

Static dereferenceable linked-data artifacts for the Semantic Arts gist ontology.

This repository generates and hosts one small RDF fragment per named gist: term so authoritative IRIs such as https://w3id.org/semanticarts/ns/ontology/gist/Account can resolve to useful machine-readable descriptions, or human-readable descriptions when published alongside the generated content. The content in docs/ can be hosted on any static HTTP/S server. Content can be made available at gist's authoritative IRIs through w3id.org redirects to that server with basic content negotiation as proposed below.

Dereferencing Behavior and W3C Documents

No W3C Recommendation defines what triples a server should return when an ontology term IRI is dereferenced. There are no widely-adopted common practices.

The relevant W3C documents are:

RDF 1.1 Concepts (a Recommendation) is explicit that the RDF specs do not address dereferencing behavior:

Perhaps the most important characteristic of IRIs in web architecture is that they can be dereferenced, and hence serve as starting points for interactions with a remote server. This specification is not concerned with such interactions. It does not define an interaction model.

Concise Bounded Description (CBD) and Symmetric CBD (SCBD)

The Concise Bounded Description (CBD) is the subgraph of all triples whose subject is the given term, plus recursively the CBDs of any blank nodes reached as objects. This provides a self-contained description of a node's outgoing properties.

The Symmetric CBD (SCBD) extends the CBD by also including all triples where the term appears as the object, pulling in incoming edges too. The result is a fully symmetric neighborhood of the node in both directions.

Because gist makes extensive use of axioms, a lot of the semantics of a given term are defined via incoming edges by axioms that are grouped not with the given term but with other, closely related terms. We use SCBD to provide the consumer with all axioms that directly pertain to the meaning of the referenced term.

What Is Included in this Repository

  • Hosted Artifacts
    • docs/terms/: generated static term files. The current checkout contains 216 terms in three serializations each:
      • Turtle: *.ttl
      • RDF/XML: *.rdf
      • JSON-LD: *.jsonld
    • docs/ontologies/: manually-added, WIDOCO-generated files for human-readable presentation.
    • docs/demo.html: demonstration page for returning results independently of redirects from authoritative IRIs. Not served as the site root, so the namespace IRI's HTML branch can route elsewhere.
  • Redirection Rules
    • tools/semanticarts.htaccess: proposed w3id.org Apache rewrite rules for the gist namespace and ontology document IRIs.
  • Artifact-Generation Code & Tests
    • scbd_no_orphans.py: core extraction function — implements the SCBD variant described below (SCBD with orphan blank-node fragments filtered out).
    • relabel.py: deterministic blank-node relabeling — replaces rdflib's parse-time bnode IDs with hashes of canonical anchor paths so re-runs don't churn IDs.
    • canonicalize.py: post-processing for the JSON-LD and RDF/XML serializer output — sorts dicts, arrays, and XML elements so element ordering is stable across runs (preserving @list order, which is semantically significant).
    • build.py: CLI that loads one or more source ontology Turtle files and writes per-term fragments to an output directory.
    • tests/: pytest suites covering the extraction logic (test_scbd_no_orphans.py, 6 cases), the relabeling (test_relabel.py, 7 cases — including a round-trip isomorphism check for every serialization), and the canonicalization (test_canonicalize.py, 8 cases — semantics preservation, idempotence, byte stability, and @list order preservation).

The generated files in this repository currently target gist 14.1.0.

How the Term-Specific RDF Serializations are Generated

build.py loads the source ontology modules into a single merged graph, then for each named term in the target namespace writes a per-term fragment to docs/terms/{LocalName}.{ttl,rdf,jsonld}.

Extraction of Symmetric Concise Bounded Description (SCBD)

Each fragment is the Symmetric Concise Bounded Description (SCBD) of the term, as defined in the W3C Member Submission CBD - Concise Bounded Description, with one minor adjustment: orphan blank-node fragments are filtered out.

The extraction proceeds in two phases:

  • Phase 1 — outgoing CBD: all triples reachable from the term via blank-node chains (restrictions, list cells, class expressions, etc.) are included.
  • Phase 2 — back-references: for each triple (s, p, term) where s is a blank node, the algorithm walks backward through blank-node chains looking for a named (IRI) ancestor. If one is found, the full chain and its CBD expansion are included. If no named ancestor exists, the fragment is dropped.

That drop step is the only departure from spec-compliant SCBD, and it's a minor one: the dropped fragments are blank-node subgraphs that have no path to any named IRI in the source graph (for example, stray one-element rdf:List cells that survive serialization but no longer belong to a containing class expression). Including them in a per-term fragment is noise — the consumer cannot interpret a list cell without the class expression that owns it.

This still captures every genuinely useful back-reference, such as owl:unionOf or owl:intersectionOf expressions on other named classes that reference the term.

Deterministic output

rdflib assigns fresh blank-node identifiers on every parse, and its RDF/XML and JSON-LD serializers emit elements in set-iteration order. Both effects would otherwise produce noisy diffs on every rebuild. build.py neutralizes them in two passes after extraction:

  • relabel.py replaces each blank node with one whose identifier is hash(canonical-path-from-named-ancestor). Structurally identical bnodes collapse to a single label, which is sound under RDF simple entailment.
  • canonicalize.py post-processes the serializer output. JSON-LD dicts and arrays are sorted recursively, except for arrays under @list (which encode rdf:List and are semantically ordered). RDF/XML elements are sorted by (tag, attributes, text, subtree-signature); CR entities (
) are swapped through a Unicode sentinel across the parse/serialize cycle so CRLF line endings inside multi-line literals survive XML 1.0's text normalization.

Result: rerunning build.py on an unchanged source ontology produces byte-identical .ttl, .rdf, and .jsonld files. The round-trip test in test_relabel.py verifies that every serialization parses back to a graph isomorphic to the original.

Requirements

  • Python 3.11 or newer
  • rdflib
python -m pip install rdflib

For running the test suite:

python -m pip install pytest
python -m pytest tests/

Rebuilding the Term Files

Place the gist web download bundle inside the repository root (it is gitignored):

gist14.1.0_webDownload/
  ontologies/
    turtle/
      gistCore14.1.0.ttl
      gistRdfsAnnotations14.1.0.ttl
      gistSubClassAssertions14.1.0.ttl

Then run from the repository root:

python build.py \
  gist14.1.0_webDownload/ontologies/turtle/gistCore14.1.0.ttl \
  gist14.1.0_webDownload/ontologies/turtle/gistRdfsAnnotations14.1.0.ttl \
  gist14.1.0_webDownload/ontologies/turtle/gistSubClassAssertions14.1.0.ttl \
  docs/terms \
  --namespace https://w3id.org/semanticarts/ns/ontology/gist/

The output directory will contain:

docs/
  terms/
    Account.ttl
    Account.rdf
    Account.jsonld
    ...

Publishing

The generated docs/ directory is suitable for any static web host. The rewrite rules in tools/semanticarts.htaccess are for w3id.org; they redirect term and ontology-document IRIs to the corresponding files hosted from docs/, and redirect the namespace IRI itself to the Semantic Arts landing page (see the routing table below).

Before deploying those rules, update the base URL in tools/semanticarts.htaccess if the published site is not:

https://semanticarts.github.io/gist-deref

The current rewrite rules cover:

  • https://w3id.org/semanticarts/ns/ontology/gist/
  • https://w3id.org/semanticarts/ns/ontology/gist/{Term}
  • https://w3id.org/semanticarts/ontology/{OntologyDocument}

The namespace IRI itself redirects to the existing Semantic Arts landing page at https://ontologies.semanticarts.com/ontology/Namespace.html, regardless of Accept header.

For term IRIs, content negotiation redirects to the published /terms/ path backed by docs/terms/ in this repository:

  • text/turtle -> /terms/{Term}.ttl
  • application/rdf+xml -> /terms/{Term}.rdf
  • application/ld+json or application/json -> /terms/{Term}.jsonld
  • text/html -> the term anchor in the WIDOCO HTML documentation
  • default clients -> Turtle

The .htaccess file also contains routes for full-ontology documents and WIDOCO HTML documentation. The current repository snapshot includes generated per-term files under docs/terms; add or publish the matching ontology/ and html/ assets before relying on those routes in production. Those assets are not produced by build.py — drop the source ontology Turtle/RDF-XML/JSON-LD files into docs/ontologies/ and the WIDOCO output into docs/ontologies/ (or docs/html/) by hand.

The local docs/ontologies/gist-widoco.html file has been patched so fragment URLs with bare gist local names, such as #Address, scroll to the WIDOCO entity whose HTML id is the full gist IRI. Preserve or reapply that hash-navigation change if the WIDOCO page is regenerated.

Example

After publication and w3id.org configuration, clients can request a specific serialization of a term:

curl -L -H "Accept: text/turtle" https://w3id.org/semanticarts/ns/ontology/gist/Account
curl -L -H "Accept: application/ld+json" https://w3id.org/semanticarts/ns/ontology/gist/Account

The same generated files can also be inspected directly in docs/terms/.

Development Notes

  • Generated files are committed so that they can be deployed as static assets from the git repository.
  • Re-run build.py when updating to a new gist version or when the extraction logic changes.
  • Output is byte-stable across rebuilds; see the Deterministic output section above for how blank-node IDs and serializer ordering are normalized.