Create Input Files to the Entity Alignment Models from TTL/XML/NT Source and Target KGs

📢 Note:

For the latest updates and project-wide maintenance, please follow our project repository here:
DACE-DL/Create_Input_Data_to_EA_Models.
All contents of this personal repository are fully mirrored in the project repository to ensure consistency. This repository contains the original version of the code and remains valid for citation purposes.

📢 Note: The results of this work contributed to a Semantic Web Journal submission, available here:
An Analysis of the Performance of Representation Learning Methods for Entity Alignment: Benchmark vs. Real-world Data

This repository creates proper input files for various Entity Alignment (EA) models using reference alignment files and RDF-based source and target knowledge graphs in .ttl, .xml, or .nt formats.

Please note: the generated inputs are formatted to allow EA models to run on a wide variety of datasets without causing runtime errors, even on real-world and heterogeneous KGs.

We preprocess and generate the required JSON, PKL, and text files for original implementations of the following EA methods:

In addition, given a source and target KG and a reference alignment file (in TTL/XML/NT formats), this repository supports preparing inputs for the Entity-Matchers framework. Using Entity-Matchers, you can easily run and evaluate multiple EA models on your custom datasets.

To promote fair evaluation, we move beyond the commonly used Hits@K metric and instead report Precision, Recall, and F1-Score, enabling better alignment quality analysis.

🛠️ Community Note: Any contributions that add input generation scripts for other EA models are highly encouraged!

🎯 Purpose

Most existing EA models assume that their input files are already pre-processed and formatted according to specific internal conventions. As a result, researchers often reuse limited benchmark datasets and avoid experimenting with new or real-world KGs.

This tool removes that barrier by transforming raw input files—two KGs (in .ttl, .nt, or .xml) and an alignment file—into the exact format expected by each EA model, without modifying the original EA code. This facilitates fair comparison, reproducibility, and exploration of EA performance on arbitrary datasets.

🔍 Scripts Overview

1. `prepare_data.py`

Purpose: Parses RDF knowledge graphs and a reference alignment file to generate input files required by downstream EA models.
Inputs:
- Set via param.py:
  - source_file_name, target_file_name, reference_alignment_file
  - file format (e.g., ttl, xml, nt)
  - EA model-specific configuration parameters
Outputs: See the 📤 Output Files section below.
Execution:
1. Place all raw files (source, target, and alignment) into the raw_files/ directory
2. Edit the param.py file to define:
  - File names (without folder prefix)
  - File types (e.g., 'ttl', 'xml')
  - Optional model-specific settings
3. Run the script via:
```
bash run.sh
```

🧭 Execution Order

Place your .ttl, .xml, or .nt files in the raw_files/ folder
Edit param.py to set filenames and parameters
Run:
```
bash run.sh
```

📤 Output Files

The output of this tool consists of model-specific input files generated from the raw KGs and alignment file. These outputs are fully compatible with the original implementations of EA models such as BERT-INT, RDGCN, MultiKE, etc.

The generated files vary depending on the selected EA model (defined in param.py).

Example: BERT-INT Input Files

The following 11 files are produced when preparing Zh-En data for the BERT-INT model:

ent_ids_1: Entity IDs and URIs in source KG
ent_ids_2: Entity IDs and URIs in target KG
ref_pairs: Test entity alignment pairs (ID-encoded)
sup_pairs: Training entity alignment pairs (ID-encoded)
rel_ids_1: Relation IDs and labels in source KG
rel_ids_2: Relation IDs and labels in target KG
triples_1: Source KG triples (ID-encoded)
triples_2: Target KG triples (ID-encoded)
zh_att_triples: Source KG attribute triples
en_att_triples: Target KG attribute triples
ent_desc.pkl: Pickled dictionary of entity descriptions

🧩 These are directly usable as inputs to BERT-INT without modifying its code.

Notes

For other models like RDGCN, the format and file names differ, and are automatically generated accordingly.
All outputs are saved in a dedicated subfolder for each dataset/model combination.

🤝 Community Recommendation: Encourage Reproducible Input Pipelines

To promote reproducibility, adaptability, and broader adoption of EA models, we encourage researchers developing new Entity Alignment methods to publicly share the scripts they use to convert raw RDF knowledge graphs and alignment files into the specific input format required by their models.

Too often, EA systems are tightly coupled with benchmark-specific preprocessing pipelines, making it difficult for others to test them on new datasets. By sharing input generation code alongside model code, the EA community can:

Enable the application of models to real-world or domain-specific KGs
Simplify benchmarking across diverse datasets
Support transparent and fair comparisons
Encourage the reuse of models beyond the initial dataset they were trained on

We hope this repository serves as a practical example of how to design a modular and extensible input generation pipeline, applicable to a wide variety of EA models.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
BERT-INT		BERT-INT
MultiKE		MultiKE
RDGCN		RDGCN
entity-matchers		entity-matchers
i-Align		i-Align
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Create Input Files to the Entity Alignment Models from TTL/XML/NT Source and Target KGs

🎯 Purpose

🔍 Scripts Overview

1. `prepare_data.py`

🧭 Execution Order

📤 Output Files

Example: BERT-INT Input Files

Notes

🤝 Community Recommendation: Encourage Reproducible Input Pipelines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Create Input Files to the Entity Alignment Models from TTL/XML/NT Source and Target KGs

🎯 Purpose

🔍 Scripts Overview

1. prepare_data.py

🧭 Execution Order

📤 Output Files

Example: BERT-INT Input Files

Notes

🤝 Community Recommendation: Encourage Reproducible Input Pipelines

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `prepare_data.py`

Packages