Bill has a great post (again) on duplicate content, and looks at what conditions may cause a search engine not to list pages:
1. Product descriptions from manufacturers, publishers, and producers reproduced by a number of different distributors in large ecommerce sites
2. Alternative print pages
3. Pages that reproduce syndicated RSS feeds through a server side script
4. Canonicalization issues, where a search engine may see the same page as different pages with different URLs
5. Pages that serve session IDs to search engines, so that they try to crawl and index the same page under different URLs
6. Pages that serve multiple data variables through URLs, so that they crawl and index the same page under different URLs
7. Pages that share too many common elements, or where those are very similar from one page to another, including title, meta descriptions, headings, navigation, and text that is shared globally.
8. Copyright infringement
9. Use of the same or very similar pages on different subdomains or different country top level domains (TLDs)
10. Article syndication
11. Mirrored sites
What a great checklist! Bill goes into a lot more detail, so be sure to read his post.
Bill also looks at some of the papers on duplicate content issues. I’m reading through one of the Microsoft papers, which is hard-going yet interesting, and it occurs to me that keyword based SEO has a fundamental problem:
all things being equal, if you choose the same keyword phrase that a lot of other people are using, you may be more likely to be taken out by duplicate content filters
The probability that two unique pages have the same text phrase, at a higher than average density is low, and therefore should raise duplicate content flags – not necessarily because the pages are an exact match, but they are too similar in terms of content to be shown in the same SERP.