A fast, zero-dependency source code language statistics tool. Single static binary, ~108KB.
Written in Modula-2, compiled via m2c.
GitHub's linguist is the gold standard for repository language detection. It's also a Ruby gem with a transitive dependency graph that pulls in half of RubyGems, requires a working Ruby installation, and takes non-trivial effort to install on a clean machine.
This project exists because sometimes you just want to run linguist . and get a table of stats without fighting bundle install for twenty minutes.
- Recursive directory scanning
.gitignoresupport (nested, negation patterns,**globs).gitattributessupport (linguist-vendored,linguist-generated,linguist-documentation,linguist-language)- Detection by file extension, well-known filename, and shebang line
- Binary file exclusion (NUL byte and control character heuristic)
- Symlink skipping
- Sorted by bytes descending, with percentage breakdown
| github-linguist | this | |
|---|---|---|
| Runtime | Ruby + native extensions | Single static binary |
| Install | gem install github-linguist + deps |
Copy one file |
| Binary size | ~50MB installed | 108KB |
| Speed | Seconds on large repos | Milliseconds |
| Language DB | 600+ languages, Bayesian classifier, heuristics | 75 languages by extension/filename/shebang/classifier |
| Disambiguation | Statistical classifier for ambiguous extensions | Bayesian keyword classifier for ambiguous extensions |
| Git integration | Reads from Git blob objects | Reads the working tree directly |
| Configuration | Overrides via .gitattributes |
Same |
| Vendored detection | Built-in path patterns | Via .gitattributes only |
| Generated detection | Content heuristics + patterns | Via .gitattributes only |
The main trade-off is language coverage: github-linguist knows about 600+ languages with extensive heuristics. This tool covers 75 languages and uses a Bayesian keyword classifier to disambiguate shared extensions (.h, .m, .pl) and classify extensionless files. For most codebases this is more than enough.
brew tap fitzee/tap
brew install linguist
Requires m2c:
cd linguist
m2c build
Binary lands in .m2c/bin/linguist. Copy it wherever you like.
linguist [options] [directory]
If no directory is given, scans the current directory.
| Flag | Description |
|---|---|
-h, --help |
Show help and exit |
-j, --json |
Output as JSON instead of a table |
-b, --breakdown |
List individual files per language |
--no-vendored |
Exclude files marked linguist-vendored in .gitattributes |
--no-generated |
Exclude files marked linguist-generated in .gitattributes |
Basic usage -- scan the current directory:
$ linguist .
Language Lines Bytes Files Percentage
-------- ----- ----- ----- ----------
Modula-2 1437 38964 16 98.4%
TOML 18 414 1 1.0%
C 10 173 1 0.2%
Total 1465 39551 18 100.0%
Scan a specific directory:
$ linguist ~/projects/my-compiler
Language Lines Bytes Files Percentage
------------- ------ ------- ----- ----------
Modula-2 50351 1438838 409 36.1%
Rust 24180 941055 40 23.6%
Markdown 24638 830021 222 20.8%
C 17498 610041 25 15.3%
Python 1656 66402 1 1.6%
JSON 1052 30989 7 0.7%
TypeScript 647 21132 1 0.5%
Shell 651 18175 6 0.4%
TOML 524 10195 30 0.2%
Objective-C++ 291 9772 2 0.2%
YAML 68 1805 1 0.0%
Total 121556 3978425 744 100.0%
JSON output for scripting:
$ linguist --json .
{"languages":{"Modula-2":{"bytes":38964,"files":16,"lines":1437,"percentage":"98.4"},"TOML":{"bytes":414,"files":1,"lines":18,"percentage":"1.0"},"C":{"bytes":173,"files":1,"lines":10,"percentage":"0.2"}},"total_bytes":39551,"total_files":18,"total_lines":1465}
Breakdown -- see which files belong to each language:
$ linguist --breakdown .
Language Lines Bytes Files Percentage
-------- ----- ----- ----- ----------
Modula-2 1437 38964 16 98.4%
TOML 18 414 1 1.0%
C 10 173 1 0.2%
Total 1465 39551 18 100.0%
Modula-2
src/Stats.mod
src/Detect.def
src/Output.mod
src/Attrs.mod
src/Ignore.def
...
TOML
m2.toml
C
src/bridge.c
JSON breakdown adds a file_list array to each language entry.
Detection runs in order:
-
.gitattributesoverride -- if a file matches a pattern withlinguist-language=X, that language is used unconditionally. -
File extension -- the most common path. Maps
.rsto Rust,.pyto Python, etc. Case-insensitive matching. If the extension is ambiguous (see below), the classifier refines the result. -
Well-known filename -- files like
Makefile,Dockerfile,CMakeLists.txtare recognised by exact name. -
Shebang -- if the file has no recognised extension, the first line is checked for
#!. Interpreter names likepython3,bash,nodeare mapped to languages. -
Content classifier -- if all of the above fail, a Bayesian keyword classifier scores the file content against 27 profiled languages and picks the best match.
If none of the above match, the file is ignored (not counted).
Some file extensions map to multiple possible languages. When the extension match is ambiguous, the classifier tokenizes the first 8KB of the file and scores it against the candidate languages using discriminating keywords:
| Extension | Candidates |
|---|---|
.h |
C, C++, Objective-C |
.m |
Objective-C, MATLAB |
.pl |
Perl, Prolog |
For example, a .h file containing namespace, template, and std::vector will be classified as C++, while one with typedef, malloc, and unsigned will be classified as C.
Files with no recognised extension (and no shebang match) are classified against all 27 profiled languages: C, C++, Objective-C, Java, Python, Ruby, JavaScript, TypeScript, Go, Rust, Shell, Perl, Prolog, PHP, Haskell, MATLAB, Swift, Kotlin, Scala, C#, Lua, R, Elixir, Erlang, Dart, OCaml, and SQL.
- Binary files -- detected by NUL bytes or high control character ratio (>5%) in the first 8KB
- Hidden files -- anything starting with
.(including.git,.DS_Store) - Well-known non-source directories --
.git,.hg,.svn,node_modules - Symlinks -- always skipped
.gitignorepatterns -- loaded per-directory, supports nested.gitignorefiles, negation (!pattern), directory-only patterns (dir/), and**globs.gitattributesmarkers -- files markedlinguist-vendored,linguist-generated, orlinguist-documentationare excluded (when the corresponding flags are set, or for documentation always)
75 languages detected by extension, filename, shebang, and content classification. Partial list of the more common ones:
Ada, Assembly, Awk, Batch, C, C#, C++, CMake, CSS, Clojure, COBOL, Common Lisp, D, Dart, Diff, Dockerfile, Elixir, Emacs Lisp, Erlang, F#, Fortran, Go, GraphQL, Groovy, HCL, HTML, Haskell, INI, JSON, Java, JavaScript, Julia, Just, Kotlin, Less, Lua, Makefile, Markdown, Modula-2, Nim, Nix, OCaml, Objective-C, Objective-C++, PHP, Pascal, Perl, PowerShell, Protocol Buffers, Python, R, Racket, Ruby, Rust, SCSS, SQL, SVG, Sass, Scala, Scheme, Shell, Swift, Tcl, TeX, TOML, TypeScript, V, Vim Script, Visual Basic, XML, YAML, Zig.
- Max 4096 files tracked for
--breakdownoutput (stats are always unlimited) - Max 128 distinct languages per scan
- Max 512
.gitignorepatterns loaded at once - Max 128
.gitattributesrules loaded at once - Paths longer than 1023 characters are truncated
- Content classifier covers 27 languages; files in unlisted languages with ambiguous or missing extensions won't be classified
- Working tree only -- does not read Git objects or respect
.gitattributesset viagit config