Stay organized with collectionsSave and categorize content based on your preferences.
Monday, July 01, 2019
For 25 years, theRobots Exclusion Protocol (REP)was only a de-facto standard. This had frustrating implications sometimes. On one hand, for
webmasters, it meant uncertainty in corner cases, like when their text editor includedBOMcharacters in
their robots.txt files. On the other hand, for crawler and tool developers, it also brought
uncertainty; for example, how should they deal with robots.txt files that are hundreds of
megabytes large?
Today,we announcedthat we're spearheading the effort
to make the REP an internet standard. While this is an important step, it means extra work for
developers who parse robots.txt files.
We're here to help: weopen sourcedthe C++ library that our production systems use for parsing and matching rules in robots.txt
files. This library has been around for 20 years and it contains pieces of code that were written
in the 90's. Since then, the library evolved; we learned a lot about how webmasters write
robots.txt files and corner cases that we had to cover for, and added what we learned over the
years also to the internet draft when it made sense.
We also included a testing tool in the open source package to help you test a few rules. Once
built, the usage is very straightforward:
If you want to check out the library, head over to our GitHub repository for therobots.txt parser. We'd love
to see what you can build using it! If you built something using the library, drop us a comment onTwitter, and if you have comments
or questions about the library, find us onGitHub.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],[],[[["\u003cp\u003eThe Robots Exclusion Protocol (REP), used for controlling web crawler access, is becoming an internet standard after 25 years as a de-facto standard.\u003c/p\u003e\n"],["\u003cp\u003eGoogle open-sourced their C++ robots.txt parsing library to aid developers in implementing the standardized REP.\u003c/p\u003e\n"],["\u003cp\u003eThe open-sourced library incorporates 20 years of Google's experience and knowledge in handling robots.txt files and edge cases.\u003c/p\u003e\n"],["\u003cp\u003eA testing tool is included within the open-source package to facilitate easy verification of robots.txt rules.\u003c/p\u003e\n"],["\u003cp\u003eDevelopers are encouraged to utilize the library and share their creations or feedback with Google.\u003c/p\u003e\n"]]],["Google is leading efforts to formalize the Robots Exclusion Protocol (REP) as an internet standard, previously only a de-facto standard. They have open-sourced their C++ library, used for 20 years to parse and match rules in robots.txt files, to assist developers. This library now includes a testing tool, `robots_main`, for checking rules. Developers can engage with Google via GitHub and Twitter. The aim is to address past uncertainties.\n"],null,["# Google's robots.txt parser is now open source\n\nMonday, July 01, 2019\n\n\nFor 25 years, the [Robots Exclusion Protocol (REP)](https://www.robotstxt.org/norobots-rfc.txt)\nwas only a de-facto standard. This had frustrating implications sometimes. On one hand, for\nwebmasters, it meant uncertainty in corner cases, like when their text editor included\n[BOM](https://en.wikipedia.org/wiki/Byte_order_mark) characters in\ntheir robots.txt files. On the other hand, for crawler and tool developers, it also brought\nuncertainty; for example, how should they deal with robots.txt files that are hundreds of\nmegabytes large?\n\n\nToday, [we announced](/search/blog/2019/07/rep-id) that we're spearheading the effort\nto make the REP an internet standard. While this is an important step, it means extra work for\ndevelopers who parse robots.txt files.\n\n\nWe're here to help: we [open sourced](https://github.com/google/robotstxt)\nthe C++ library that our production systems use for parsing and matching rules in robots.txt\nfiles. This library has been around for 20 years and it contains pieces of code that were written\nin the 90's. Since then, the library evolved; we learned a lot about how webmasters write\nrobots.txt files and corner cases that we had to cover for, and added what we learned over the\nyears also to the internet draft when it made sense.\n\n\nWe also included a testing tool in the open source package to help you test a few rules. Once\nbuilt, the usage is very straightforward:\n\n\n`robots_main \u003crobots.txt content\u003e \u003cuser_agent\u003e \u003curl\u003e`\n\n\nIf you want to check out the library, head over to our GitHub repository for the\n[robots.txt parser](https://github.com/google/robotstxt). We'd love\nto see what you can build using it! If you built something using the library, drop us a comment on\n[Twitter](https://twitter.com/googlesearchc), and if you have comments\nor questions about the library, find us on\n[GitHub](https://github.com/google/robotstxt).\n\n\nPosted by [Edu Pereda](https://twitter.com/epere4),\n[Lode Vandevenne](https://github.com/lvandeve), and\n[Gary Illyes](https://garyillyes.com/+), Search Open Sourcing team"]]