Stay organized with collectionsSave and categorize content based on your preferences.
Tuesday, November 01, 2011
As the web evolves, Google's crawling and indexing capabilities also need to progress. Weimproved our indexing of Flash, built
a more robustinfrastructure called Caffeine,
and we even startedcrawling formswhere it makes
sense. Now, especially with the growing popularity of JavaScript and, with it, AJAX, we're
finding more web pages requiringPOSTrequests—either for the entire content of
the page or because the pages are missing information and/or look completely broken without the
resources returned fromPOST. For Google Search this is less than ideal, because when
we're not properly discovering and indexing content, searchers may not have access to the most
comprehensive and relevant results.
We generally advise to useGETfor fetching resources a page needs, and this is by far our preferred method of crawling. We've
started experiments to rewritePOSTrequests toGET, and while this
remains a valid strategy in some cases, often the contents returned by a web server forGETvs.POSTare completely different. Additionally, there are
legitimate reasons to usePOST(for example, you can attach more
data to aPOSTrequest than aGET). So, whileGETrequests
remain far more common, to surface more content on the web, Googlebot may now performPOSTrequests when we believe it's safe and appropriate.
We take precautions to avoid performing any task on a site that could result in executing an
unintended user action. OurPOSTrequests are primarily for crawling resources that
a page requests automatically, mimicking what a typical user would see when they open the URL in
their browser. This will evolve over time as we find better heuristics, but that's our current
approach.
Let's run through a fewPOSTrequest scenarios that demonstrate how we're improving
our crawling and indexing to evolve with the web.
Crawling a resource via aPOSTXMLHttpRequest: In this
step-by-step example, we improve both the indexing of a page and its Instant Preview by
following the automaticXMLHttpRequestgenerated as the page renders.
Google crawls the URL, yummy-sundae.html.
Google begins indexing yummy-sundae.html and, as a part of this process, decides to attempt
to render the page to better understand its content and/or generate the Instant Preview.
During the render, yummy-sundae.html automatically sends an XMLHttpRequest for a resource,
hot-fudge-info.html, using thePOSTmethod.
<html>
<head>
<title>Yummy Sundae</title>
<script src="jquery.js"></script>
</head>
<body>
This page is about a yummy sundae.
<div id="content"></div>
<script>
$(document).ready(function() {
$.post('hot-fudge-info.html', function(data)
{$('#content').html(data);});
});
</script>
</body>
</html>
The URL requested throughPOST, hot-fudge-info.html, along with its data payload,
is added to Googlebot's crawl queue.
Googlebot performs aPOSTrequest to crawl hot-fudge-info.html.
Google now has an accurate representation of yummy-sundae.html for Instant Previews. In
certain cases, we may also incorporate the contents of hot-fudge-info.html into
yummy-sundae.html.
Google completes the indexing of yummy-sundae.html.
User searches for "hot fudge sundae".
Google's algorithms can now better determine how yummy-sundae.html is relevant for this query,
and we can properly display a snapshot of the page for Instant Previews.
Improving your site's crawlability and indexability
General advice for creating crawlable sites is found in ourHelp Center.
For webmasters who want to help Google crawl and index their content and/or generate the Instant
Preview, here are a few simple reminders:
PreferGETfor fetching resources, unless there's a specific reason to usePOST.
Verify that we're allowed to crawl the resources needed to render your page. In the example
above, if hot-fudge-info.html is disallowed byrobots.txt,
Googlebot won't fetch it. More subtly, if the JavaScript code that issues theXMLHttpRequestis located in an external.jsfile disallowed by
robots.txt, we won't see the connection between yummy-sundae.html and hot-fudge-info.html, so
even if the latter is not disallowed itself, that may not help us much. We've seen even more
complicated chains of dependencies in the wild. To help Google better understand your site it's
almost always better to allow Googlebot to crawl all resources. You can test whether resources are blocked throughWebmaster ToolsLabs>Instant Previews.
Make sure to return the same content to Googlebot as is returned to users' web browsers.Cloaking(sending different content to Googlebot than to users) is a violation of ourWebmaster Guidelinesbecause, among other things, it may cause us to provide a searcher with an irrelevant result
—the content the user views in their browser may be a complete mismatch from what we
crawled and indexed. We've seen numerousPOSTrequest examples where a webmaster
non-maliciously cloaked (which is still a violation), and their cloaking—on even the
smallest of changes—then caused JavaScript errors that prevented accurate indexing and
completely defeated their reason for cloaking in the first place. Summarizing, if you want your
site to be search-friendly, cloaking is an all-around sticky situation that's best to avoid. To verify that you're not accidentally cloaking, you can useInstant Previewswithin Webmaster Tools, or try setting the User-Agent string in your browser to something like:
Your site shouldn't look any different after such a change. If you see a blank page, a
JavaScript error, or if parts of the page are missing or different, that means that something's
wrong.
Remember to include important content (that is, the content you'd like indexed) as text, visible
directly on the page and without requiring user-action to display. Most search engines are
text-based and generally work best with text-based content. We're always improving our ability
to crawl and index content published in a variety of ways, but it remains a good practice to
use text for important information.
Controlling your content
If you'd like to prevent content from being crawled or indexed for Google Web Search, traditionalrobots.txt rulesremain the best method. To prevent the Instant Preview for your page(s), please see ourInstant Previews FAQwhich describes theGoogle Web PreviewUser-Agent and thenosnippetmetatag.
Moving forward
We'll continue striving to increase the comprehensiveness of our index so searchers can find more
relevant information. And we expect our crawling and indexing capability to improve and evolve
over time, just like the web itself. Please let us know if you have questions or concerns.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],[],[[["\u003cp\u003eGooglebot may now perform \u003ccode\u003ePOST\u003c/code\u003e requests to crawl and index content that requires it, while still preferring \u003ccode\u003eGET\u003c/code\u003e requests as the primary crawling method.\u003c/p\u003e\n"],["\u003cp\u003eGooglebot's \u003ccode\u003ePOST\u003c/code\u003e requests are primarily aimed at accessing content that is requested automatically by a page, similar to a typical user's browser behavior.\u003c/p\u003e\n"],["\u003cp\u003eTo ensure optimal crawlability and indexability, websites should prefer \u003ccode\u003eGET\u003c/code\u003e requests, allow Googlebot to crawl all necessary resources, and avoid cloaking.\u003c/p\u003e\n"],["\u003cp\u003eContent that webmasters wish to exclude from Google's search results can be controlled using robots.txt rules and the \u003ccode\u003enosnippet\u003c/code\u003e meta tag.\u003c/p\u003e\n"],["\u003cp\u003eGoogle's crawling and indexing technologies will continue evolving to better understand and represent the changing nature of the web.\u003c/p\u003e\n"]]],["Google's web crawling has evolved to better index dynamic content. Googlebot now performs `POST` requests in certain cases, such as when pages rely on them to load content via `XMLHttpRequest` or redirects. They prefer `GET` requests and advise webmasters to use `GET` unless `POST` is necessary. Webmasters are encouraged to allow Googlebot to crawl all page resources and ensure content displayed to Googlebot matches that of users, and also include important content as text.\n"],null,["# GET, POST, and safely surfacing more of the web\n\nTuesday, November 01, 2011\n\n\nAs the web evolves, Google's crawling and indexing capabilities also need to progress. We\n[improved our indexing of Flash](/search/blog/2008/06/improved-flash-indexing), built\na more robust\n[infrastructure called Caffeine](/search/blog/2010/06/our-new-search-index-caffeine),\nand we even started\n[crawling forms](/search/blog/2008/04/crawling-through-html-forms) where it makes\nsense. Now, especially with the growing popularity of JavaScript and, with it, AJAX, we're\nfinding more web pages requiring `POST` requests---either for the entire content of\nthe page or because the pages are missing information and/or look completely broken without the\nresources returned from `POST`. For Google Search this is less than ideal, because when\nwe're not properly discovering and indexing content, searchers may not have access to the most\ncomprehensive and relevant results.\n\n\nWe generally advise to use\n[`GET`](https://www.google.com/search?q=GET+POST+HTTP)\nfor fetching resources a page needs, and this is by far our preferred method of crawling. We've\nstarted experiments to rewrite `POST` requests to `GET`, and while this\nremains a valid strategy in some cases, often the contents returned by a web server for\n`GET` vs. `POST` are completely different. Additionally, there are\nlegitimate reasons to use `POST` (for example, you can attach more\ndata to a `POST` request than a `GET`). So, while `GET` requests\nremain far more common, to surface more content on the web, Googlebot may now perform\n`POST` requests when we believe it's safe and appropriate.\n\n\nWe take precautions to avoid performing any task on a site that could result in executing an\nunintended user action. Our `POST` requests are primarily for crawling resources that\na page requests automatically, mimicking what a typical user would see when they open the URL in\ntheir browser. This will evolve over time as we find better heuristics, but that's our current\napproach.\n\n\nLet's run through a few `POST` request scenarios that demonstrate how we're improving\nour crawling and indexing to evolve with the web.\n\nExamples of Googlebot's `POST` requests\n---------------------------------------\n\n- *Crawling a page via a POST redirect* \n\n ```\n \u003chtml\u003e\n \u003cbody onload=\"document.foo.submit();\"\u003e\n \u003cform name=\"foo\" action=\"request.php\" method=\"post\"\n \u003cinput type=\"hidden\" name=\"bar\" value=\"234\"/\u003e\n \u003c/form\u003e\n \u003c/body\u003e\n \u003c/html\u003e\n ```\n- *Crawling a resource via a `POST` `XMLHttpRequest`* : In this step-by-step example, we improve both the indexing of a page and its Instant Preview by following the automatic `XMLHttpRequest` generated as the page renders.\n 1. Google crawls the URL, yummy-sundae.html.\n 2. Google begins indexing yummy-sundae.html and, as a part of this process, decides to attempt to render the page to better understand its content and/or generate the Instant Preview.\n 3. During the render, yummy-sundae.html automatically sends an XMLHttpRequest for a resource, hot-fudge-info.html, using the `POST` method. \n\n ```\n \u003chtml\u003e\n \u003chead\u003e\n \u003ctitle\u003eYummy Sundae\u003c/title\u003e\n \u003cscript src=\"jquery.js\"\u003e\u003c/script\u003e\n \u003c/head\u003e\n \u003cbody\u003e\n This page is about a yummy sundae.\n \u003cdiv id=\"content\"\u003e\u003c/div\u003e\n \u003cscript\u003e\n $(document).ready(function() {\n $.post('hot-fudge-info.html', function(data)\n {$('#content').html(data);});\n });\n \u003c/script\u003e\n \u003c/body\u003e\n \u003c/html\u003e\n ```\n 4. The URL requested through `POST`, hot-fudge-info.html, along with its data payload, is added to Googlebot's crawl queue.\n 5. Googlebot performs a `POST` request to crawl hot-fudge-info.html.\n 6. Google now has an accurate representation of yummy-sundae.html for Instant Previews. In certain cases, we may also incorporate the contents of hot-fudge-info.html into yummy-sundae.html.\n 7. Google completes the indexing of yummy-sundae.html.\n 8. User searches for \"hot fudge sundae\".\n 9. Google's algorithms can now better determine how yummy-sundae.html is relevant for this query, and we can properly display a snapshot of the page for Instant Previews.\n\nImproving your site's crawlability and indexability\n---------------------------------------------------\n\n\nGeneral advice for creating crawlable sites is found in our\n[Help Center](https://www.google.com/support/webmasters/bin/answer.py?answer=40349).\nFor webmasters who want to help Google crawl and index their content and/or generate the Instant\nPreview, here are a few simple reminders:\n\n- Prefer `GET` for fetching resources, unless there's a specific reason to use `POST`.\n- Verify that we're allowed to crawl the resources needed to render your page. In the example above, if hot-fudge-info.html is disallowed by [robots.txt](/search/docs/crawling-indexing/robots/intro), Googlebot won't fetch it. More subtly, if the JavaScript code that issues the `XMLHttpRequest` is located in an external `.js` file disallowed by robots.txt, we won't see the connection between yummy-sundae.html and hot-fudge-info.html, so even if the latter is not disallowed itself, that may not help us much. We've seen even more complicated chains of dependencies in the wild. To help Google better understand your site it's almost always better to allow Googlebot to crawl all resources. \n You can test whether resources are blocked through [Webmaster Tools](https://search.google.com/search-console) *Labs* \\\u003e [*Instant Previews*](/search/blog/2011/05/troubleshooting-instant-previews-in).\n- Make sure to return the same content to Googlebot as is returned to users' web browsers. [Cloaking](/search/docs/essentials/spam-policies#cloaking) (sending different content to Googlebot than to users) is a violation of our [Webmaster Guidelines](/search/docs/essentials) because, among other things, it may cause us to provide a searcher with an irrelevant result ---the content the user views in their browser may be a complete mismatch from what we crawled and indexed. We've seen numerous `POST` request examples where a webmaster non-maliciously cloaked (which is still a violation), and their cloaking---on even the smallest of changes---then caused JavaScript errors that prevented accurate indexing and completely defeated their reason for cloaking in the first place. Summarizing, if you want your site to be search-friendly, cloaking is an all-around sticky situation that's best to avoid. \n To verify that you're not accidentally cloaking, you can use [Instant Previews](/search/blog/2011/05/troubleshooting-instant-previews-in) within Webmaster Tools, or try setting the User-Agent string in your browser to something like: \n\n ```\n Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)\n ```\n Your site shouldn't look any different after such a change. If you see a blank page, a JavaScript error, or if parts of the page are missing or different, that means that something's wrong.\n- Remember to include important content (that is, the content you'd like indexed) as text, visible directly on the page and without requiring user-action to display. Most search engines are text-based and generally work best with text-based content. We're always improving our ability to crawl and index content published in a variety of ways, but it remains a good practice to use text for important information.\n\nControlling your content\n------------------------\n\n\nIf you'd like to prevent content from being crawled or indexed for Google Web Search, traditional\n[robots.txt rules](/search/docs/crawling-indexing/robots/robots_txt#syntax)\nremain the best method. To prevent the Instant Preview for your page(s), please see our\n[Instant Previews FAQ](https://sites.google.com/site/webmasterhelpforum/en/faq-instant-previews)\nwhich describes the `Google Web Preview` User-Agent and the `nosnippet` `meta` tag.\n\nMoving forward\n--------------\n\n\nWe'll continue striving to increase the comprehensiveness of our index so searchers can find more\nrelevant information. And we expect our crawling and indexing capability to improve and evolve\nover time, just like the web itself. Please let us know if you have questions or concerns.\n\n\nWritten by\n[Pawel Aleksander Fedorynski](https://plus.google.com/103690467358879664235/about),\nSoftware Engineer, Indexing Team, and\n[Maile Ohye](/search/blog/authors/maile-ohye),\nDeveloper Programs Tech Lead"]]