The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from rest of the site's content. The sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.
Sitemaps are particularly beneficial on websites where:
Some areas of the website are not available through the browsable interface
The Sitemap XML protocol is also extended to provide a way of listing multiple Sitemaps in a 'Sitemap index' file. The maximum Sitemap size of 50 MiB or 50,000 URLs means this is necessary for large sites.
An example of Sitemap index referencing one separate sitemap follows.
If Sitemaps are submitted directly to a search engine (pinged), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the robots.txt file by adding the following line:
The <sitemap_location> should be the complete URL to the sitemap, such as:
This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, multiple "Sitemap:" records may be included in robots.txt, or the URL can simply point to the main sitemap index file.