This morning’s panel on duplicate content included representatives from Google, Yahoo!, Ask, and Microsoft. Each panelist brought some interesting perspective on the problem of duplicate content, along with strategies to fix those problems.
There are several sources of duplicate content, including site structure, dynamic urls, multiple domain names, and scraping (other sites stealing your content). Solving a scraping problem involves filing complaints and possible legal action. But the other sources of duplicate content have technical solutions that are relatively straightforward to implement.
- Crawl your own site — this will help you identify any index barriers, including duplicate content.
- Force all urls on your site to use the same hostname and domain name — use a 301 redirect in your htaccess file.
- Prevent indexing of url arguments that don’t change the content on the page — sort links, printer friendly pages, rss feeds, session ids and tracking ids all need to be excluded from the index. Use robots.txt with wildcards, Yahoo’s dynamic url rewriter, or custom links based on search engine user agents. Google suggests using cookies to control site behavior instead of url parameters, but this seems like a bad idea to me.
While there’s no penalty for duplicate content, pages that share content will suffer from diluted value in the search indexes. By consolidating your urls within the index, your remaining pages will be valued more highly and therefore perform better in search results.