In the user-agent string, you'll see “Amazonbot” together with additional agent information. An example looks like this:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML\, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
Amazonbot respects the robots.txt protocol, honors user-agent and allow/disallow directives, enabling webmasters to manage how crawlers access their site. However, Amazonbot does not currently support the crawl-delay directive or robots meta tags on HTML pages such as “nofollow” and "noindex". In the example below, Amazonbot will not crawl web pages that are under /do-not-crawl/ or /not-allowed.
User-agent: Amazonbot # Amazon's user agent
Disallow: /do-not-crawl/ # disallow this directory
User-agent: * # any robot
Disallow: /not-allowed/ # disallow this directory
Amazonbot attempts to read robots.txt at the host level (e.g., ‘test.amazon.com’), however it does not look for this file at the domain level (e.g., ‘amazon.com’). In the event Amazonbot cannot fetch robots.txt due to IP or user agent blocking, parsing errors, network timeouts, or any other non-successful status codes (such as 3XX, 4XX or 5XX), Amazonbot will attempt to refetch robots.txt or use a cached copy from the last 30 days. If both these approaches fail, Amazonbot will behave as if robots.txt does not exist and will crawl the site. When accessible, Amazonbot will respond to changes in robots.txt files within 24 hours.
Amazonbot supports the link-level rel=nofollow directive. Include these in your HTML like this to keep Amazonbot from following and not crawling a particular link from your website.
<a href="signin.php" rel=nofollow>Sign in</a>
...
Amazonbot supports a page-level meta tag that you can use to signal to Amazon that you do not want your crawled data to be used for training large language models. To provide this signal, include <META NAME=”ROBOTS” CONTENT=”NOARCHIVE”> in your HTML.
Page-Level Restrictions: To apply the 'noarchive' directive at the page level, add a <meta> tag in the <head> section of your HTML:
<meta name="robots" content="noarchive">
To specifically target Amazonbot, replace name=“robots” with name=“amazonbot”
<meta name="amazonbot" content="noarchive">
Header-Level Restrictions: For broader control across different content types (pages, images, videos, PDFs, etc.), you can use the X-Robots-Tag in the HTTP response headers.
HTTP/1.1 200 OK
Date: Tue, 15 Oct 2024 08:09:00 GMT
X-Robots-Tag: noarchive
To restrict Amazon specifically, add “amazonbot” to the tag
HTTP/1.1 200 OK
Date: Tue, 15 Oct 2024 08:09:00 GMT
X-Robots-Tag: amazonbot: noarchive
To learn more about partnering with Amazon on Alexa and other virtual assistant use cases, feel free to contact us at pubpartnerships@amazon.com. If you are a content owner, please identify your domain(s), RSS feed links, and/or other references to your content within your message.
Verify that a crawler accessing your server is the official Amazonbot crawler by using DNS lookups. This helps you identify other bots or malicious agents that may be accessing your site while claiming to be Amazonbot.
You can use command line tools to verify Amazonbot by following these steps:
For example:
$ host 12.34.56.789
789.56.34.12.in-addr.arpa domain name pointer 12-34-56-789.crawl.amazonbot.amazon.
$ host 12-34-56-789.crawl.amazonbot.amazon
12-34-56-789.crawl.amazonbot.amazon has address 12.34.56.789
If you have questions or recommendations on what crawl rate to use, please contact us at amazonbot@amazon.com. If you are a content owner, please always include any relevant domain names in your message.