Skip to content

multilingualprogramming/tree-sitter-multilingual

Repository files navigation

tree-sitter-multilingual

A Tree-sitter grammar for the Multilingual Programming Language.

This repository gives you:

  • A Tree-sitter parser for .multi source files
  • Query files for highlighting, indentation, and folding
  • Node and Rust bindings for embedding the parser in applications
  • Generator scripts for TextMate and Monaco grammar outputs

What is the Multilingual Programming Language?

The Multilingual Programming Language is an experimental language where the same semantic constructs can be written with keywords from multiple human languages.

Examples:

  • English: if, def, class
  • French: si, def, classe
  • German: wenn, def, klasse
  • Japanese: supported through the shared keyword registry
  • Arabic: supported through the shared keyword registry

The full keyword registry currently covers 17 languages, including Spanish, Portuguese, Chinese, Italian, Dutch, Polish, Swedish, Danish, Finnish, Hindi, Bengali, and Tamil.

Keywords can be mixed freely in the same program. The language is indentation-sensitive and broadly Python-like in structure.

What This Repository Produces

This repo is useful when you want to:

  • Parse Multilingual source code into syntax trees
  • Add syntax highlighting to an editor or viewer
  • Support indentation and folding in Tree-sitter-aware editors
  • Generate editor grammar artifacts from a single keyword registry

After building, the main outputs are:

  • src/parser.c and src/scanner.c for the Tree-sitter parser
  • bindings/node/ for Node.js consumers
  • bindings/rust/ for Rust consumers
  • queries/highlights.scm, queries/indents.scm, queries/folds.scm
  • generated/multilingual.tmLanguage.json for TextMate-compatible editors
  • generated/monarch.json for Monaco-based editors

GitHub Linguist Positioning

This repository is a good upstream grammar source for future github-linguist support, but it is not the github-linguist integration itself.

For language detection and syntax-highlighting discussions, this repository treats:

  • .multi as the canonical file extension
  • source.multi as the TextMate scope

The repository no longer advertises .ml, because that extension is already heavily associated with other languages and would create avoidable detection conflicts.

Quick Start

Prerequisites

  • Node.js 14+ for the Tree-sitter CLI and Node binding build
  • Python 3.10+ for the generator scripts
  • PyYAML for the build scripts: pip install pyyaml
  • A C compiler such as GCC, Clang, or MSVC

Build Everything

git clone https://github.com/multilingualprogramming/tree-sitter-multilingual.git
cd tree-sitter-multilingual
npm install
make all

If you only want the parser and tests:

npm install
make build
make test

Build Targets

make inject     # Expand multilingual keyword aliases into grammar.js
make generate   # Run tree-sitter generate
make build      # Build the native Node binding
make test       # Run Tree-sitter corpus tests
make tmgrammar  # Generate TextMate grammar JSON
make monarch    # Generate Monaco tokenizer JSON
make validate   # Validate keyword coverage
make all        # Run the full pipeline

Repository Layout

  • data/keywords.yaml: Canonical keyword registry across all supported languages
  • grammar.js: Tree-sitter grammar source
  • src/scanner.c: External scanner for indentation handling
  • queries/: Highlight, indentation, and folding queries
  • scripts/: Build and generation scripts
  • test/corpus/: Grammar test corpus
  • examples/: Sample .multi programs
  • bindings/node/: Node.js binding entrypoint
  • bindings/rust/: Rust crate wrapper

How the Build Works

The build is driven from data/keywords.yaml.

  1. scripts/inject_aliases.py replaces // build:inject <construct> markers in grammar.js with generated choice(...) expressions.
  2. tree-sitter generate compiles the grammar into C sources.
  3. node-gyp rebuild builds the native Node binding.
  4. scripts/build_tmgrammar.py generates generated/multilingual.tmLanguage.json.
  5. scripts/build_monarch.py generates generated/monarch.json.
  6. scripts/validate_coverage.py checks that keyword coverage is complete and consistent.

Keeping Generated Files in Sync

The files in generated/ are committed intentionally because they are directly useful to downstream editor and highlighting integrations.

If you change data/keywords.yaml, grammar.js, or generator scripts, regenerate and review the committed outputs before opening a PR:

make tmgrammar
make monarch
make validate

At a minimum, check that these files are updated together when relevant:

  • generated/multilingual.tmLanguage.json
  • generated/monarch.json
  • README.md if user-facing behavior or supported extension guidance changed

Using This Repository in Applications

1. Use It from Node.js

After building the parser, you can load it through the bundled Node binding:

const Parser = require("tree-sitter");
const Multilingual = require("./bindings/node");

const parser = new Parser();
parser.setLanguage(Multilingual);

const source = `
def greet(name):
  return f"Hello, {name}"
`;

const tree = parser.parse(source);
console.log(tree.rootNode.toString());

Use this approach when you are building:

  • A CLI formatter or linter
  • A code analysis tool
  • A desktop app or Electron app
  • A custom language service prototype

2. Use It from Rust

The Rust binding exposes a language() function for tree-sitter:

let source = r#"
def greet(name):
  return f"Hello, {name}"
"#;

let mut parser = tree_sitter::Parser::new();
parser
    .set_language(tree_sitter_multilingual::language())
    .expect("failed to load multilingual grammar");

let tree = parser.parse(source, None).expect("failed to parse source");
println!("{}", tree.root_node().to_sexp());

Use this when you want to embed the grammar in:

  • A Rust CLI
  • A language server
  • A static analysis tool
  • A backend service that parses code snippets

3. Use It in Tree-sitter-Based Editors

This repository already includes the standard query files used by many Tree-sitter integrations:

  • queries/highlights.scm
  • queries/indents.scm
  • queries/folds.scm

These are the files editors typically use for:

  • Syntax highlighting
  • Auto-indentation
  • Code folding

4. Use It in Neovim

For nvim-treesitter, the key pieces you need are:

  • The generated parser
  • The queries/ directory
  • The language registration metadata

Example setup:

require("nvim-treesitter.configs").setup {
  highlight = { enable = true },
}

To fully integrate this language in Neovim, you would typically:

  1. Register the parser with nvim-treesitter
  2. Point it at this repository
  3. Install the queries/*.scm files alongside the parser

5. Use It in VS Code or Any TextMate-Based Editor

Run:

make tmgrammar

This generates:

  • generated/multilingual.tmLanguage.json

Use that file inside a VS Code extension, or any editor/tooling stack that consumes TextMate grammars.

This is the right path for:

  • VS Code extensions
  • Syntax highlighting in Shiki-compatible pipelines
  • Any tool that relies on TextMate scopes rather than Tree-sitter directly

6. Use It in Monaco Editor

Run:

make monarch

This generates:

  • generated/monarch.json

Use that file when registering a Monaco tokenizer in:

  • Monaco Editor
  • Browser IDEs
  • Web playgrounds
  • Electron apps using Monaco

7. Use It in Static Highlighting Pipelines

If your application does not need incremental parsing, you can still use the generated grammar artifacts:

  • Use generated/multilingual.tmLanguage.json for TextMate-compatible highlighters
  • Use generated/monarch.json for Monaco-based editors

That is often the simplest option for:

  • Documentation sites
  • Playground pages
  • Code preview components
  • Read-only syntax highlighting

Example Source File

See these repository examples:

  • examples/english.multi for a straightforward English-oriented sample
  • examples/french.multi for a French-surface sample built around .multi

A small example:

def greet(name = "World"):
  return f"Hello, {name}!"

if True:
  print(greet("Alice"))

Grammar Features

Current grammar coverage includes:

  • Functions and classes
  • Assignments and expressions
  • if / elif / else
  • for and while
  • Imports
  • Arithmetic, comparison, logical, and bitwise operators
  • Strings, f-strings, numbers, lists, dicts, tuples, and sets
  • Line comments using #

Keyword Registry

data/keywords.yaml is the source of truth for keyword aliases.

It defines multilingual surface forms for constructs such as:

  • if, else, elif
  • for, while, break, continue
  • def, class, return
  • import, from, as
  • and, or, not, in, is
  • True, False, None

If you want to add or update a language, start there.

Adding a New Language

  1. Edit data/keywords.yaml
  2. Run make all
  3. Run make test
  4. Inspect the updated generated outputs

Testing

Run the grammar corpus tests with:

make test

The test files live in test/corpus/*.txt.

Known Limitations

  • No complex unpacking in assignments
  • No decorators or type annotations
  • No async / await
  • No with statements
  • Comments cannot appear inside expressions

License

MIT

Related

About

Multilingual grammar for tree-sitter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors