A site map (or sitemap) is a list of pages of a web site.
There are three primary kinds of site map:
Site maps used during the planning of a Web site by its designers.
Human-visible listings, typically hierarchical, of the pages on a site.
Structured listings intended for web crawlers such as search engines.
Sitemap of Google
Sitemaps may be addressed to users or to software. Many sites have user-visible sitemaps which present a systematic view, typically hierarchical, of the site. These are intended to help visitors find specific pages, and can also be used by crawlers. Alphabetically organized site maps, sometimes called site indexes, are a different approach.
They also act as a navigation aid by providing an overview of a site’s content at a single glance.
Google introduced the Sitemaps protocol so web developers can publish lists of links from across their sites. The basic premise is that some sites have a large number of dynamic pages that are only available through the use of forms and user entries. The Sitemap files contains URLs to these pages so that web crawlers can find them. Bing, Google, Yahoo and Ask now jointly support the Sitemaps protocol.
Since the major search engines use the same protocol, having a Sitemap lets them have the updated page information. Sitemaps do not guarantee all links will be crawled, and being crawled does not guarantee indexing. Google Webmaster Tools allow a website owner to upload a sitemap that Google will crawl, or they can accomplish the same thing with the robots.txt file.
XML Sitemaps have replaced the older method of “submitting to search engines” by filling out a form on the search engine’s submission page. Now web developers submit a Sitemap directly, or wait for search engines to find it. Regularly submitting an updated sitemap when new pages are published may allow search engines to find and index those pages more quickly than it would by finding the pages on its own.
Robots exclusion standard
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, spambots, malware and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard can be used in conjunction with Sitemaps, a robot inclusion standard for websites.
When a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.example.com/robots.txt). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the website. If this file doesn’t exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
A robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com. In addition, each protocol and port needs its own robots.txt file; http://example.com/robots.txt does not apply to pages under http://example.com:8080/ or https://example.com/.
Some major search engines following this standard include Ask, AOL, Baidu, DuckDuckGo, Google, Yahoo!, and Yandex. Bing still is not fully compatible with the standard as it cannot inherit settings from the wildcard (*).
The volunteering group Archive Team explicitly ignores robots.txt for the most part, viewing it as an obsolete standard that hinders web archival efforts. According to project leader Jason Scott, “unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website’s context.”
For some years, the Internet Archive did not crawl sites with robots.txt, but in April 2017, it announced that it would no longer honour directives in the robots.txt files. “Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes”. This was in response to entire domains being tagged with robots.txt when the content became obsolete.