linguist

A fast, zero-dependency source code language statistics tool. Single static binary, ~108KB.

Written in Modula-2, compiled via m2c.

Why not github-linguist?

GitHub's linguist is the gold standard for repository language detection. It's also a Ruby gem with a transitive dependency graph that pulls in half of RubyGems, requires a working Ruby installation, and takes non-trivial effort to install on a clean machine.

This project exists because sometimes you just want to run linguist . and get a table of stats without fighting bundle install for twenty minutes.

What's the same

Recursive directory scanning
.gitignore support (nested, negation patterns, ** globs)
.gitattributes support (linguist-vendored, linguist-generated, linguist-documentation, linguist-language)
Detection by file extension, well-known filename, and shebang line
Binary file exclusion (NUL byte and control character heuristic)
Symlink skipping
Sorted by bytes descending, with percentage breakdown

What's different

	github-linguist	this
Runtime	Ruby + native extensions	Single static binary
Install	`gem install github-linguist` + deps	Copy one file
Binary size	~50MB installed	108KB
Speed	Seconds on large repos	Milliseconds
Language DB	600+ languages, Bayesian classifier, heuristics	75 languages by extension/filename/shebang/classifier
Disambiguation	Statistical classifier for ambiguous extensions	Bayesian keyword classifier for ambiguous extensions
Git integration	Reads from Git blob objects	Reads the working tree directly
Configuration	Overrides via `.gitattributes`	Same
Vendored detection	Built-in path patterns	Via `.gitattributes` only
Generated detection	Content heuristics + patterns	Via `.gitattributes` only

The main trade-off is language coverage: github-linguist knows about 600+ languages with extensive heuristics. This tool covers 75 languages and uses a Bayesian keyword classifier to disambiguate shared extensions (.h, .m, .pl) and classify extensionless files. For most codebases this is more than enough.

Install

Homebrew (macOS arm64)

brew tap fitzee/tap
brew install linguist

Build from source

Requires m2c:

cd linguist
m2c build

Binary lands in .m2c/bin/linguist. Copy it wherever you like.

Usage

linguist [options] [directory]

If no directory is given, scans the current directory.

Options

Flag	Description
`-h`, `--help`	Show help and exit
`-j`, `--json`	Output as JSON instead of a table
`-b`, `--breakdown`	List individual files per language
`--no-vendored`	Exclude files marked `linguist-vendored` in `.gitattributes`
`--no-generated`	Exclude files marked `linguist-generated` in `.gitattributes`

Examples

Basic usage -- scan the current directory:

$ linguist .
Language  Lines  Bytes  Files  Percentage
--------  -----  -----  -----  ----------
Modula-2  1437   38964  16     98.4%
TOML      18     414    1      1.0%
C         10     173    1      0.2%
Total     1465   39551  18     100.0%

Scan a specific directory:

$ linguist ~/projects/my-compiler
Language       Lines   Bytes    Files  Percentage
-------------  ------  -------  -----  ----------
Modula-2       50351   1438838  409    36.1%
Rust           24180   941055   40     23.6%
Markdown       24638   830021   222    20.8%
C              17498   610041   25     15.3%
Python         1656    66402    1      1.6%
JSON           1052    30989    7      0.7%
TypeScript     647     21132    1      0.5%
Shell          651     18175    6      0.4%
TOML           524     10195    30     0.2%
Objective-C++  291     9772     2      0.2%
YAML           68      1805     1      0.0%
Total          121556  3978425  744    100.0%

JSON output for scripting:

$ linguist --json .
{"languages":{"Modula-2":{"bytes":38964,"files":16,"lines":1437,"percentage":"98.4"},"TOML":{"bytes":414,"files":1,"lines":18,"percentage":"1.0"},"C":{"bytes":173,"files":1,"lines":10,"percentage":"0.2"}},"total_bytes":39551,"total_files":18,"total_lines":1465}

Breakdown -- see which files belong to each language:

$ linguist --breakdown .
Language  Lines  Bytes  Files  Percentage
--------  -----  -----  -----  ----------
Modula-2  1437   38964  16     98.4%
TOML      18     414    1      1.0%
C         10     173    1      0.2%
Total     1465   39551  18     100.0%

Modula-2
  src/Stats.mod
  src/Detect.def
  src/Output.mod
  src/Attrs.mod
  src/Ignore.def
  ...
TOML
  m2.toml
C
  src/bridge.c

JSON breakdown adds a file_list array to each language entry.

How detection works

Detection runs in order:

.gitattributes override -- if a file matches a pattern with linguist-language=X, that language is used unconditionally.
File extension -- the most common path. Maps .rs to Rust, .py to Python, etc. Case-insensitive matching. If the extension is ambiguous (see below), the classifier refines the result.
Well-known filename -- files like Makefile, Dockerfile, CMakeLists.txt are recognised by exact name.
Shebang -- if the file has no recognised extension, the first line is checked for #!. Interpreter names like python3, bash, node are mapped to languages.
Content classifier -- if all of the above fail, a Bayesian keyword classifier scores the file content against 27 profiled languages and picks the best match.

If none of the above match, the file is ignored (not counted).

Ambiguous extension disambiguation

Some file extensions map to multiple possible languages. When the extension match is ambiguous, the classifier tokenizes the first 8KB of the file and scores it against the candidate languages using discriminating keywords:

Extension	Candidates
`.h`	C, C++, Objective-C
`.m`	Objective-C, MATLAB
`.pl`	Perl, Prolog

For example, a .h file containing namespace, template, and std::vector will be classified as C++, while one with typedef, malloc, and unsigned will be classified as C.

Extensionless file classification

Files with no recognised extension (and no shebang match) are classified against all 27 profiled languages: C, C++, Objective-C, Java, Python, Ruby, JavaScript, TypeScript, Go, Rust, Shell, Perl, Prolog, PHP, Haskell, MATLAB, Swift, Kotlin, Scala, C#, Lua, R, Elixir, Erlang, Dart, OCaml, and SQL.

What gets skipped

Binary files -- detected by NUL bytes or high control character ratio (>5%) in the first 8KB
Hidden files -- anything starting with . (including .git, .DS_Store)
Well-known non-source directories -- .git, .hg, .svn, node_modules
Symlinks -- always skipped
.gitignore patterns -- loaded per-directory, supports nested .gitignore files, negation (!pattern), directory-only patterns (dir/), and ** globs
.gitattributes markers -- files marked linguist-vendored, linguist-generated, or linguist-documentation are excluded (when the corresponding flags are set, or for documentation always)

Supported languages

75 languages detected by extension, filename, shebang, and content classification. Partial list of the more common ones:

Ada, Assembly, Awk, Batch, C, C#, C++, CMake, CSS, Clojure, COBOL, Common Lisp, D, Dart, Diff, Dockerfile, Elixir, Emacs Lisp, Erlang, F#, Fortran, Go, GraphQL, Groovy, HCL, HTML, Haskell, INI, JSON, Java, JavaScript, Julia, Just, Kotlin, Less, Lua, Makefile, Markdown, Modula-2, Nim, Nix, OCaml, Objective-C, Objective-C++, PHP, Pascal, Perl, PowerShell, Protocol Buffers, Python, R, Racket, Ruby, Rust, SCSS, SQL, SVG, Sass, Scala, Scheme, Shell, Swift, Tcl, TeX, TOML, TypeScript, V, Vim Script, Visual Basic, XML, YAML, Zig.

Limits

Max 4096 files tracked for --breakdown output (stats are always unlimited)
Max 128 distinct languages per scan
Max 512 .gitignore patterns loaded at once
Max 128 .gitattributes rules loaded at once
Paths longer than 1023 characters are truncated
Content classifier covers 27 languages; files in unlisted languages with ambiguous or missing extensions won't be classified
Working tree only -- does not read Git objects or respect .gitattributes set via git config

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
README.md		README.md
m2.toml		m2.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

linguist

Why not github-linguist?

What's the same

What's different

Install

Homebrew (macOS arm64)

Build from source

Usage

Options

Examples

How detection works

Ambiguous extension disambiguation

Extensionless file classification

What gets skipped

Supported languages

Limits

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

linguist

Why not github-linguist?

What's the same

What's different

Install

Homebrew (macOS arm64)

Build from source

Usage

Options

Examples

How detection works

Ambiguous extension disambiguation

Extensionless file classification

What gets skipped

Supported languages

Limits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages