Header image (Atlantic Cod)
Ode is simple (Simple means that you know how it works).

Message

Hi and welcome to Ode. Glad you stopped by.

Important: Ode is very new. I'm adding content here just as quickly as I can. If something or other referred to in the weblog seems to be missing please be a little patient.

I am working on it every day and will be until this site is completely up.

-Rob

News Subscribe to RSS2.0 Feed button

Fri, 15 Feb 2008

Fresh Today: Sitemaps

Sitemaps are important especially for a dynamic site (at least Google, Yahoo, and Microsoft seem to think so). Even though Ode uses a simple URL/addressing scheme, left on it's own there are still some issues related to indexing that need to be cleaned up.

How does a search engine operation like Google find and index your site?

Here's a very short answer that ignores most of what's involved, including issues of relevance and rank:

The job of a seach engine is typically described in three parts

  1. The crawl
  2. The index
  3. The runtime query processor

Of the three, only the crawler is relevant to this discussion.

A crawler (also sometimes called a spider) is nothing but a software program which when pointed at a page on the web moves from link to link recording all of the pages it finds so that they can later be indexed (maybe, all pages crawled are not indexed).

So essentially this is it:

  • The crawler makes a request, in a way that's very similar to what you do when you request a page in a web browser.
  • It hands the page returned off to be indexed
  • and it makes note any links so that they can be requested and examined for links, and so on.

Links lead to links, lead to links etc.

In theory, the crawler can follow links from your home page to every other page on your site. In practice this is rarely the case, especially with dynamic sites.

Some pages may not be referenced at all, or more likely, pages will be connected in complex way, with the same page(s) referenced multiple times, often by different names, and at apparently different locations.

This is especially problematic for dynamic sites because it is usually not the case that there is a one-to-one correspondence between name, location, and page.

There are other issues as well, for example, dynamic sites may generate complex parameterized addresses that search engines simply avoid altogether.

This can lead to a couple of less the ideal outcomes:

  1. Your site may not be discovered at all, or only incompletely discovered.
  2. The site may be nearly fully indexed but in a confusing way so that results returned in response to queries at a search engine that do lead to your site, don't lead to the correct page or break altogether.

Again, the nature of a dynamic site is that content which in one place on a given request may be someplace else, or gone entirely, the next day, or even the very next request.

A concrete example of this is the homepage for a weblog.

Typically a weblog is essentially a collection of short posts listed in reverse chronological order.

Each new post added to the site starts out at the top of the list, which usually places it at the top of the homepage.

Because a site may have hundreds, thousands, or even hundreds of thousands of posts, all of them cannot be displayed on one page (because downloading and navigating that page would be exceedingly time consuming and tedious). So, only a relatively small number of posts are displayed on the homepage.

Each new post pushes the others down a place, until eventually they disappear from the homepage entirely (e.g. if the 50 most recent posts are displayed on the homepage then as soon as you reach 50 posts total, every new post you make will push one off the page).

If a search engine indexes your homepage, then someone who queries that engine may get a result that includes the information that was once, but is no longer there. That same person may click the link, arrive at your homepage, and be lost when the content they were looking for is nowhere to be seen.

Sitemaps are an attempt to avoid many of these problems.

They work like this:

You, as the maintainer, list all of the pages on your site in a single file (or a collection of files) in a format that is easy for a search engine's crawler to digest.

When the crawler finds your site it read the sitemap file(s) rather than or in addition to working through the links on the site directly.

Assuming the sitemap includes all of the pages on your site, and each page only once, you have made it easier for the crawler to find all of the pages you want included in the search engine's index ultimately.

The trouble of course is in generating the sitemaps.

Each time you add or remove a page you need to update the file(s) to reflect that change.

There are several tools available online that will allow you to create a static site map once (or more than once by repeatedly running the tool) but this isn't a solution to the problem.

Maintaining a sitemap manually, even for very small sites, is tiresome at best and most likely unworkable.

With the help of the sitemapper addin, Ode takes care of sitemaps for you.

  • It generates the initial site map file and then (by default) updates it every 24 hours automatically.
  • If you should accidentally (or intentionally) delete the sitemap file, Ode will recreate it for you.
  • You're free to rebuild the sitemap more frequently if you like.

Simple sitemaps. Now I feel better.