google-robotstxt-parser

A pure JavaScript port of Google's official robotstxt C++ library. Runs in both Node.js and the browser with no dependencies.

Implements the same parsing rules, typo tolerance, and URL-matching logic that Google's own crawler uses to evaluate robots.txt files.

Installation

npm install google-robotstxt-parser

Usage

import { RobotsMatcher } from 'google-robotstxt-parser';

const matcher = new RobotsMatcher();
const robotsContent = `
User-agent: *
Dissallow: /secret/   # Typo accepted by Google!
`;

const isAllowed = matcher.allowedByRobots(robotsContent, ['Googlebot'], 'https://example.com/secret/page');
console.log(isAllowed); // false

Check a single user-agent

const allowed = matcher.oneAgentAllowedByRobots(robotsContent, 'Googlebot', 'https://example.com/public/');
console.log(allowed); // true

Check multiple user-agents at once

allowedByRobots accepts an array — the URL is blocked if any of the agents is disallowed.

const allowed = matcher.allowedByRobots(robotsContent, ['Googlebot', 'Bingbot'], 'https://example.com/page');

API

`RobotsMatcher`

Method	Description
`allowedByRobots(robotsTxt, userAgents, url)`	Returns `true` if the URL is accessible to at least one of the given user-agents.
`oneAgentAllowedByRobots(robotsTxt, userAgent, url)`	Convenience wrapper for a single user-agent string.
`disallow()`	Returns the raw disallow decision after a parse (useful after calling `allowedByRobots`).
`everSeenSpecificAgent()`	`true` if the parsed file contained a rule group for the queried agent specifically.
`matchingLine()`	Line number of the winning allow/disallow rule, or `0` if none matched.

`parseRobotsTxt(robotsBody, handler)`

Low-level parser. Pass a RobotsParseHandler subclass to react to individual directives without running the full matcher.

import { parseRobotsTxt, RobotsParseHandler } from 'google-robotstxt-parser';

class MyHandler extends RobotsParseHandler {
  handleDisallow(lineNum, value) {
    console.log(`Line ${lineNum}: Disallow ${value}`);
  }
}

parseRobotsTxt(robotsContent, new MyHandler());

Compatibility with Google's parser

This library matches Google's behaviour in several ways that differ from a naive implementation:

Typo tolerance — common misspellings like Dissallow, Disalow, User agent are accepted.
Pattern priority — longer patterns win over shorter ones, regardless of order.
Specific agent beats wildcard — if the robots.txt contains a group for the queried agent, the User-agent: * group is ignored entirely for that agent.
URL normalisation — non-ASCII characters in allow/disallow patterns are percent-encoded to match Google's canonicalisation.
UTF-8 BOM — silently stripped at the start of the file.
Line length cap — lines longer than ~16 KB are truncated, matching the C++ implementation.

`/index.html` and `/index.htm` normalisation

When an Allow pattern ends in /index.html or /index.htm but does not match the requested URL, Google's parser applies a Google-specific fallback: the pattern is re-evaluated as the parent directory path anchored with $ — i.e. /dir/index.html is re-tried as /dir/$. This means the rule grants access to the exact directory URL (/dir/) but not to arbitrary paths beneath it or to /dir/index.htm (without the trailing l). It is therefore more precise than a plain Allow: /dir/ prefix match. This behaviour is inherited directly from the upstream C++ implementation (robots.cc) and is verified by the GoogleOnly_IndexHTMLisDirectory test.

Browser usage

The library is a standard ES module with no Node.js-specific APIs, so it works directly in the browser:

<script type="module">
  import { RobotsMatcher } from './robots.js';

  const matcher = new RobotsMatcher();
  console.log(matcher.oneAgentAllowedByRobots('User-agent: *\nDisallow: /', 'MyBot', 'https://example.com/'));
</script>

License

Apache 2.0 — same as the upstream google/robotstxt repository.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
docs		docs
.gitignore		.gitignore
.releaserc.json		.releaserc.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
reporting_robots.js		reporting_robots.js
reporting_robots.test.js		reporting_robots.test.js
robots.js		robots.js
robots.test.js		robots.test.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

google-robotstxt-parser

Installation

Usage

Check a single user-agent

Check multiple user-agents at once

API

`RobotsMatcher`

`parseRobotsTxt(robotsBody, handler)`

Compatibility with Google's parser

`/index.html` and `/index.htm` normalisation

Browser usage

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

google-robotstxt-parser

Installation

Usage

Check a single user-agent

Check multiple user-agents at once

API

RobotsMatcher

parseRobotsTxt(robotsBody, handler)

Compatibility with Google's parser

/index.html and /index.htm normalisation

Browser usage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`RobotsMatcher`

`parseRobotsTxt(robotsBody, handler)`

`/index.html` and `/index.htm` normalisation