The Infamous Canonical URL Issue
Posted on January 18, 2007
Difficult as it may be to believe, but by January of 2007, Google is still unable to recognize when URLs that obviously lead to the same page are in fact the same page. So what’s a URL, and what’s the problem here?
URL (pronounced you-are-ell, or sometimes “earl” as in Duke of) stands for Uniform Resource Locator. It’s the technical name for the address of a particular web page. For example, the URL of this site’s home page is
http://www.tropicalwebworks.org, and the URL of this page is http://www.tropicalwebworks.org/2007/01/18/infamous-canonical-url/.
It’s common that any particular web page may be reached at multiple URLs. If this site were not configured optimally, the home page might be reachable at both http://www.tropicalwebworks.org and http://tropicalwebworks.org (notice the missing “www.”). Normal people would logically think that this would be desirable: After all, you don’t want people to get a “server not found” error if they try to get to your site without including the www part.
But Google sees these as two completely separate URLs that just happen to contain exactly the same content. There are two problems with such a situation:
- First, the “strength” of that page, and its ability to turn up in the search engine results, is diluted. Some of the page’s strength is allotted to one version, and some to the other, and neither “page” performs as well as it would if all the strength were concentrated in one page.
- And second, Google attempts to filter out pages containing duplicate content, based on the reasonable logic that people don’t want to see multiple results in their searches for the exact same thing. Thus, since both of these “pages” contain the exact same content, one of them will suffer in searches due to the dupe content filter.
- http://www.example.com
- http://example.com
- http://www.example.com/index.html
- http://example.com/index.html
http://www.example.com/subdirectory/, again leaving out the actual filename index.html.
I apply an appropriate 301 permanent redirect to the www version of every web site I develop. It’s not something I charge extra for, or something that I tout to my clients as being anything special. It’s about a 20-second task to set up the 301 properly. And I never link to directory index pages by filename. I don’t know why some of the big companies aren’t aware of this issue, or, if they are aware, why they don’t care enough to do it properly. It raises the question, if they’re so ignorant, or uncaring, about a thing that is so simple to do right, in how many other areas are they incompetent?
Tags: 301, canonical URL, duplicate content, Google