XML Sitemaps
Google, Yahoo!, and Microsoft all support a protocol known as XML
Sitemaps. Google first
announced it in 2005, and then Yahoo! and Microsoft agreed to
support the protocol in 2006.
Using the Sitemaps protocol you can supply the search engines with
a list of all the URLs you
would like them to crawl and index.
Adding a URL to a Sitemap file does not guarantee that a URL will
be crawled or indexed.
However, it can result in pages that are not otherwise discovered
or indexed by the search
engine getting crawled and indexed. In addition, Sitemaps appear
to help pages that have been
relegated to Google’s supplemental index make their way into the
main index.
This program is a complement to, not a replacement for, the search
engines’ normal, link-based
crawl. The benefits of Sitemaps include the following:
• For the pages the search engines already know about through
their regular spidering, they
use the metadata you supply, such as the last date the content was
modified (lastmod
date) and the frequency at which the page is
changed (changefreq), to
improve how they
crawl your site.
• For the pages they don’t know about, they use the additional
URLs you supply to increase
their crawl coverage.
• For URLs that may have duplicates, the engines can use the XML
Sitemaps data to help
choose a canonical version.
• Verification/registration of XML Sitemaps may indicate positive
trust/authority signals.
• The crawling/inclusion benefits of Sitemaps may have
second-order positive effects, such
as improved rankings or greater internal link popularity.
The Google engineer who in online forums goes by GoogleGuy (a.k.a.
Matt Cutts, the head of
Google’s webspam team) has explained Google Sitemaps in the
following way:
Imagine if you have pages A, B, and C on your site. We find pages
A and B through our normal
web crawl of your links. Then you build a Sitemap and list the
pages B and C. Now there’s a
chance (but not a promise) that we’ll crawl page C. We won’t drop
page A just because you
didn’t list it in your Sitemap. And just because you listed a page
that we didn’t know about
doesn’t guarantee that we’ll crawl it. But if for some reason we
didn’t see any links to C, or
maybe we knew about page C but the URL was rejected for having too
many parameters or
some other reason, now there’s a chance that we’ll crawl that page
C.
Sitemaps use a simple XML format that you can learn about at http://www.sitemaps.org. XML
Sitemaps are a useful and in some cases essential tool for your
website. In particular, if you
have reason to believe that the site is not fully indexed, an XML
Sitemap can help you increase
the number of indexed pages. As sites grow in size, the value of
XML Sitemap files tends to
increase
dramatically, as additional traffic flows to the newly included URLs.
Layout of an XML Sitemap
The first step in the process of creating an XML Sitemap is to
create an .xml Sitemap file in a
suitable format. Since creating an XML Sitemap requires a certain
level of technical know-how,
it would be wise to involve your development team in the XML
Sitemap generator process
from the beginning. Figure 6-2 shows an example of some
code from a Sitemap.
FIGURE 6-2. Sample XML Sitemap from
Google.com
To create your XML Sitemap, you can use the following:
An
XML Sitemap generator
This is a simple script that you can configure to automatically
create Sitemaps, and
sometimes submit them as well. Sitemap generators can create these
Sitemaps from a URL
list, access logs, or a directory path hosting static files
corresponding to URLs. Here are
some examples of XML Sitemap generators:
• SourceForge.net’s
google-sitemap_gen
• ROR Sitemap Generator
• XML-Sitemaps.com Sitemap
Generator
• Sitemaps Pal
• XML Echo
Simple
text
You can provide Google with a simple text file that contains one URL
per line. However,
Google recommends that once you have a text Sitemap file for your
site, you use the
Sitemap Generator to create a Sitemap from this text file using
the Sitemaps protocol.
Syndication
feed
Google accepts Really Simple Syndication (RSS) 2.0 and Atom 1.0
feeds. Note that the
feed
may provide information on recent URLs only.
No comments:
Post a Comment