Friday, July 12, 2013

XML Sitemaps

XML Sitemaps
Google, Yahoo!, and Microsoft all support a protocol known as XML Sitemaps. Google first
announced it in 2005, and then Yahoo! and Microsoft agreed to support the protocol in 2006.
Using the Sitemaps protocol you can supply the search engines with a list of all the URLs you
would like them to crawl and index.
Adding a URL to a Sitemap file does not guarantee that a URL will be crawled or indexed.
However, it can result in pages that are not otherwise discovered or indexed by the search
engine getting crawled and indexed. In addition, Sitemaps appear to help pages that have been
relegated to Google’s supplemental index make their way into the main index.
This program is a complement to, not a replacement for, the search engines’ normal, link-based
crawl. The benefits of Sitemaps include the following:
• For the pages the search engines already know about through their regular spidering, they
use the metadata you supply, such as the last date the content was modified (lastmod
date) and the frequency at which the page is changed (changefreq), to improve how they
crawl your site.
• For the pages they don’t know about, they use the additional URLs you supply to increase
their crawl coverage.
• For URLs that may have duplicates, the engines can use the XML Sitemaps data to help
choose a canonical version.
• Verification/registration of XML Sitemaps may indicate positive trust/authority signals.
• The crawling/inclusion benefits of Sitemaps may have second-order positive effects, such
as improved rankings or greater internal link popularity.
The Google engineer who in online forums goes by GoogleGuy (a.k.a. Matt Cutts, the head of
Google’s webspam team) has explained Google Sitemaps in the following way:
Imagine if you have pages A, B, and C on your site. We find pages A and B through our normal
web crawl of your links. Then you build a Sitemap and list the pages B and C. Now there’s a
chance (but not a promise) that we’ll crawl page C. We won’t drop page A just because you
didn’t list it in your Sitemap. And just because you listed a page that we didn’t know about
doesn’t guarantee that we’ll crawl it. But if for some reason we didn’t see any links to C, or
maybe we knew about page C but the URL was rejected for having too many parameters or
some other reason, now there’s a chance that we’ll crawl that page C.
Sitemaps use a simple XML format that you can learn about at http://www.sitemaps.org. XML
Sitemaps are a useful and in some cases essential tool for your website. In particular, if you
have reason to believe that the site is not fully indexed, an XML Sitemap can help you increase
the number of indexed pages. As sites grow in size, the value of XML Sitemap files tends to
increase dramatically, as additional traffic flows to the newly included URLs.

Layout of an XML Sitemap
The first step in the process of creating an XML Sitemap is to create an .xml Sitemap file in a
suitable format. Since creating an XML Sitemap requires a certain level of technical know-how,
it would be wise to involve your development team in the XML Sitemap generator process
from the beginning. Figure 6-2 shows an example of some code from a Sitemap.
FIGURE 6-2. Sample XML Sitemap from Google.com
To create your XML Sitemap, you can use the following:
An XML Sitemap generator
This is a simple script that you can configure to automatically create Sitemaps, and
sometimes submit them as well. Sitemap generators can create these Sitemaps from a URL
list, access logs, or a directory path hosting static files corresponding to URLs. Here are
some examples of XML Sitemap generators:
SourceForge.net’s google-sitemap_gen
ROR Sitemap Generator
XML-Sitemaps.com Sitemap Generator
Sitemaps Pal
XML Echo
Simple text
You can provide Google with a simple text file that contains one URL per line. However,
Google recommends that once you have a text Sitemap file for your site, you use the
Sitemap Generator to create a Sitemap from this text file using the Sitemaps protocol.
Syndication feed
Google accepts Really Simple Syndication (RSS) 2.0 and Atom 1.0 feeds. Note that the

feed may provide information on recent URLs only.

No comments:

Post a Comment