About Amazonbot

Amazonbot is Amazon's web crawler used to improve our services, such as enabling Alexa to more accurately answer questions for customers.

How can I control what I share with Amazon?

Robots.txt

Amazonbot respects the robots.txt protocol, honors the user-agent and the allow/disallow directives, enabling webmasters to manage how crawlers access their site. Amazonbot attempts to read robots.txt files at the host level (for example example.com), so it looks for robots.txt at example.com/robots.txt. If a domain has multiple hosts, then we will honor robots rules exposed under each host. For example, in this scenario, if there is also a site.example.com host, it will look for robots.txt at example.com/robots.txt and also at site.example.com/robots.txt. If example.com/robots.txt blocks Amazonbot, but there are no robots.txt files on site.example.com or page.example.com, then Amazonbot cannot crawl example.com (blocked by its robots.txt), but will crawl site.example.com and page.example.com.

In the event Amazonbot cannot fetch robots.txt due to IP or user agent blocking, parsing errors, network timeouts, or any other non-successful status codes (such as 3XX, 4XX or 5XX), Amazonbot will attempt to refetch robots.txt or use a cached copy from the last 30 days. If both these approaches fail, Amazonbot will behave as if robots.txt does not exist and will crawl the site. When accessible, Amazonbot will respond to changes in robots.txt files within 24 hours.
Amazonbot honors the "Robots Exclusion protocol" defined at (https://www.rfc-editor.org/rfc/rfc9309.html) and recognizes the following fields. The field names are interpreted as case-insensitive. However the values for each of these fields are case-sensitive.

user-agent: identifies which crawler the rules apply to.
allow: a URL path that may be crawled.
disallow: a URL path that may not be crawled.
sitemap: the complete URL of a sitemap.

Note: Amazonbot does not currently support the crawl-delay directive

User-agent: Amazonbot # Amazon's user agent
Disallow: /do-not-crawl/ # disallow this directory
User-agent: * # any robot
Disallow: /not-allowed/ # disallow this directory

In the example above, Amazonbot will not crawl web pages that are under /do-not-crawl/ or /not-allowed.

Link-Level Rel Parameter

Amazonbot supports the link-level rel=nofollow directive. Include these in your HTML like this to keep Amazonbot from following and not crawling a particular link from your website.

<a href="signin.php" rel=nofollow>Sign in</a>
...

Metatag

Amazonbot supports page-level robots meta tags that you can use to control how Amazon uses your data. You can specify page-level settings by including a meta tag on HTML pages or in an HTTP header. You can combine multiple values using a comma (e.g., "noindex, noarchive"). To detect meta tags, Amazonbot needs to access your page, so do not block your page with robots.txt which will prevent it from being recrawled.

Page-Level Restrictions: To apply the 'noarchive' directive at the page level, add a <meta> tag in the <head> section of your HTML:

<meta name="robots" content="noarchive">

To specifically target Amazonbot, replace name=“robots” with name=“amazonbot”

<meta name="amazonbot" content="noarchive">

Header-Level Restrictions: For broader control across different content types (pages, images, videos, PDFs, etc.), you can use the X-Robots-Tag in the HTTP response headers.

HTTP/1.1 200 OK
Date: Tue, 15 Oct 2024 08:09:00 GMT
X-Robots-Tag: noarchive

To restrict Amazon specifically, add “amazonbot” to the tag

HTTP/1.1 200 OK 
Date: Tue, 15 Oct 2024 08:09:00 GMT
X-Robots-Tag: amazonbot: noarchive

Amazonbot supports the following robots meta tags.

Meta tag values and description:

“noarchive” – Do not use the page for model training. If you don’t specify this rule, the page may be used for training Gen AI models.
“noindex” – Do not index the page. If you don't specify this rule, the page may be indexed and eligible to appear in experiences such as Alexa.
“none” – Same as “noindex”

For Amazonbot to remove any previously crawled content from our index, we need to access the page to check for the “noindex” or “none” metatag.

How Can I Identify Amazonbot?

In the user-agent string, you'll see “Amazonbot” together with additional agent information. An example looks like this:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML\, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)

Verifying Amazonbot

Verify that a crawler accessing your server is the official Amazonbot crawler by using DNS lookups. This helps you identify other bots or malicious agents that may be accessing your site while claiming to be Amazonbot.

You can use command line tools to verify Amazonbot by following these steps:

Locate the accessing IP address from your server logs
Use the host command to run a reverse DNS lookup on the IP address
Verify the retrieved domain name is a subdomain of crawl.amazonbot.amazon
Use the host command to run a forward DNS lookup on the retrieved domain name
Verify the returned IP address is identical to the original IP address from your server logs

For example:

$ host 12.34.56.789
789.56.34.12.in-addr.arpa domain name pointer 12-34-56-789.crawl.amazonbot.amazon.

$ host 12-34-56-789.crawl.amazonbot.amazon
12-34-56-789.crawl.amazonbot.amazon has address 12.34.56.789

Contact Us

If you are a content owner/publisher and have questions or recommendations on what crawl rate to use, please contact us at amazonbot@amazon.com. Always include any relevant domain names in your message.

Publisher partnerships

To learn more about partnering with Amazon on Alexa and other virtual assistant use cases, email us at pubpartnerships@amazon.com. If you are a content owner, please include your domain(s), RSS feed links, and/or other references to your content within your message.