A pure JavaScript port of Google's official robotstxt C++ library. Runs in both Node.js and the browser with no dependencies.
Implements the same parsing rules, typo tolerance, and URL-matching logic that Google's own crawler uses to evaluate robots.txt files.
npm install google-robotstxt-parserimport { RobotsMatcher } from 'google-robotstxt-parser';
const matcher = new RobotsMatcher();
const robotsContent = `
User-agent: *
Dissallow: /secret/ # Typo accepted by Google!
`;
const isAllowed = matcher.allowedByRobots(robotsContent, ['Googlebot'], 'https://example.com/secret/page');
console.log(isAllowed); // falseconst allowed = matcher.oneAgentAllowedByRobots(robotsContent, 'Googlebot', 'https://example.com/public/');
console.log(allowed); // trueallowedByRobots accepts an array — the URL is blocked if any of the agents is disallowed.
const allowed = matcher.allowedByRobots(robotsContent, ['Googlebot', 'Bingbot'], 'https://example.com/page');| Method | Description |
|---|---|
allowedByRobots(robotsTxt, userAgents, url) |
Returns true if the URL is accessible to at least one of the given user-agents. |
oneAgentAllowedByRobots(robotsTxt, userAgent, url) |
Convenience wrapper for a single user-agent string. |
disallow() |
Returns the raw disallow decision after a parse (useful after calling allowedByRobots). |
everSeenSpecificAgent() |
true if the parsed file contained a rule group for the queried agent specifically. |
matchingLine() |
Line number of the winning allow/disallow rule, or 0 if none matched. |
Low-level parser. Pass a RobotsParseHandler subclass to react to individual directives without running the full matcher.
import { parseRobotsTxt, RobotsParseHandler } from 'google-robotstxt-parser';
class MyHandler extends RobotsParseHandler {
handleDisallow(lineNum, value) {
console.log(`Line ${lineNum}: Disallow ${value}`);
}
}
parseRobotsTxt(robotsContent, new MyHandler());This library matches Google's behaviour in several ways that differ from a naive implementation:
- Typo tolerance — common misspellings like
Dissallow,Disalow,User agentare accepted. - Pattern priority — longer patterns win over shorter ones, regardless of order.
- Specific agent beats wildcard — if the robots.txt contains a group for the queried agent, the
User-agent: *group is ignored entirely for that agent. - URL normalisation — non-ASCII characters in allow/disallow patterns are percent-encoded to match Google's canonicalisation.
- UTF-8 BOM — silently stripped at the start of the file.
- Line length cap — lines longer than ~16 KB are truncated, matching the C++ implementation.
When an Allow pattern ends in /index.html or /index.htm but does not match the requested URL, Google's parser applies a Google-specific fallback: the pattern is re-evaluated as the parent directory path anchored with $ — i.e. /dir/index.html is re-tried as /dir/$. This means the rule grants access to the exact directory URL (/dir/) but not to arbitrary paths beneath it or to /dir/index.htm (without the trailing l). It is therefore more precise than a plain Allow: /dir/ prefix match. This behaviour is inherited directly from the upstream C++ implementation (robots.cc) and is verified by the GoogleOnly_IndexHTMLisDirectory test.
The library is a standard ES module with no Node.js-specific APIs, so it works directly in the browser:
<script type="module">
import { RobotsMatcher } from './robots.js';
const matcher = new RobotsMatcher();
console.log(matcher.oneAgentAllowedByRobots('User-agent: *\nDisallow: /', 'MyBot', 'https://example.com/'));
</script>Apache 2.0 — same as the upstream google/robotstxt repository.