This week I’ve been working on fixing some duplicate content problems on our site coming from the Site Map module. In doing so, it’s become obvious that the design of Drupal’s taxonomy module makes it very easy to accidentally end up with Google indexing multiple copies of your term pages. This SEO problem isn’t just happening to Drupal rookies. Even top Drupal firms are screwing this up. Take a look at some examples from prominent Drupal company websites:
[Update: Most of these have now been fixed, so the results now are lower than I state. But they originally were correct.]
Lullabot’s Podcast Taxonomy Page
There are 84 Google-indexed pages for only 10 pager pages for this term. PLUS, there are an additional 20 indexed pages at the aliased URL for this page.
Acquia's Drupal Planet Taxonomy Page
55 indexed pages for 26 pager pages, plus 37 indexed pages indexed for the alias address.
It’s clear that Google is routinely picking up duplicates of taxonomy pages for Drupal sites. Why? Examining the results above, I see a few different reasons:
- Some of these sites are not using the Global Redirect module, which means that the internal taxonomy page path “taxonomy/term/##” returns the exact same page as the aliased address. It doesn’t redirect. It just shows the same page. That in itself wouldn’t be a problem if nobody ever linked to the internal path. (because then Google would never pick it up) But, it’s clear that linking mistakes have been made at some point, or some module has exposed the internal links to Google by using them on a page. Now they are in the Google index and they won’t come out on their own.
- The term depth argument for the taxonomy page (and taxonomy_term view) allows identical pages to be indexed at both at “taxonomy/term/##” and “taxonomy/term/0”, and potentially also at “taxonomy/term/##/all”. (The “/all” link may result in different content if you have hierarchical tags. But, in most cases, it’s the same.) The Global Redirect module can take care of the “/0” for you. Handling the “/all” will take some more effort. (see the solution below)
- The taxonomy pages actually allow anything to be put in the depth argument position. This leads to a problem where if you have an accidental relative link in a node on the page, you will create an entirely new set of indexed pages. For example, on page two of the Development Seed results above are URLs that look like this:
At some point in the past, there was probably a link that didn’t have an http:// at the beginning, so it was treated like a relative link. Google followed the link, and a whole new series of identical indexed pages was created by the following the pager links on these pages.
Fixing This Problem
There’s a few steps you can take on your Drupal site if you want to prevent these duplicate term pages from getting indexed or if you want to tell Google to remove already indexed pages from their results.
- If you haven’t, fix your .htaccess to redirect to a single domain as described in this previous post.
- Install the Global Redirect module. This will redirect the taxonomy page's internal path URL with and without the “/0” to your user-friendly alias URL.
- To fix the issue with arbitrary text being appended to the URL, you can add a Rewrite rule to your .htaccess to redirect URLs with additional arguments to the main URL. The one I have put on this site is:
RewriteRule ^taxonomy/term/([0-9]+)/(.*)$ /taxonomy/term/$1 [NC,L,QSA,R=301]
This rule won’t work for sites that need to use the “/all” address, but I’m sure it can be rewritten to support that. I’m not a RewriteRule expert, though, so if someone has an alternative, please post it in the comments and I’ll update the post.
- If you have hierarchical tags, and you want to make the default taxonomy page function like the “/all” page, you can enable the taxonomy_term view, remove the Term Depth argument, and then set your own depth on the Term ID argument.
Views with Pagers - Should You Index All the Pages?
The last thing you might consider doing is telling Google not to index pager subpages beyond the first one. This is a preference issue. Personally, I think it’s better to have Google focus its results on a single page for each term. In my opinion, it’s better for SEO and better for users who click on result links to go to the first page for a term rather than somewhere in the middle. But, I can understand why some people might want all the pages indexed to make sure nothing is missed.
If you do want to remove your pager pages from the Google index, you need to add a NOINDEX,FOLLOW meta tag to all the pager pages, except for the first one. There are two ways you can do this:
- Change the setting for this in the Nodewords module - DO NOT DO THIS! While there is a configuration option designed for this purpose in Nodewords, it’s got a major bug (http://drupal.org/node/835172) in the stable release. Do not mess with this option, or you’ll likely end up getting NOINDEX on the exact pages you want Google to index. The thing is coded backwards or something. This issue's supposedly fixed in the -dev version if you want to try that.
- Follow the instructions on this page to modify your theme page template to add the required meta tag to pager pages. This isn’t the best long-term option, especially if you’re using a standard template, since you’ll lose this change in an upgrade. But, until Nodewords is fixed, it’s the best way to go.
I’ve implemented all of the techniques listed above on this site, and now we have a single indexable page for each taxonomy term. All the other variations 301 redirect to the user-friendly aliased URL. Hopefully, this will allow us to concentrate as much page rank as possible on those pages for those terms. I recommend that everyone who uses the Taxonomy module on their Drupal website and is concerned about SEO take a look at how your site is being indexed. These fixes are pretty easy. But, it's clear (and surprising) that hardly anyone in the Drupal community has noticed this issue. I'm adding these items to my pre-launch checklist for all future websites.