Backlink Taxonomy

The incidents of backlink spam that we have been able to discover all fall into one of four categories:

Plain-text spam

This category includes wiki and blog-comment spam links that appear in plain text and are visible to all visitors of a site, regardless of their choice of browser. They are easily found by Google searches and usually are the result of website abandonment, lax authorship policies, and/or a lack of effective spam prevention plugins. Having to fight this kind of contamination of websites is a widely known problem and a variety of solutions exist.

Cloak and Redirector

This category is for spam links that are not made visible to end users who visit the site, but instead are shown only to search engines. The code that injects the links also watches for incoming requests that are referred by search engines and redirects end users to an external site if specific search terms are present in the referring search URL.

If the cloak activates on User-Agent header, these hacks can be easily verified; if they activate based on known crawler IP ranges, verification can only be done by the "owner" of the site (via Google Webmaster Tools). The redirector parts of these hacks don't differentiate between web clients and can be verified if the trigger search terms are known (which they usually are).

Redirector-only

This category is for spam scripts that are a redirector, but don't go to any lengths to hide their presence (they are not cloaks). The redirector-only scripts that we have seen redirected even without the presence of a search-based Referer. Because they don't hide themselves, besides triggering the redirect, they also can be discovered by anyone sufficiently familiar with the architecture of the website (i.e. the "what is that directory doing in my site? I didn't create that!" reaction).

Cloak-only

Cloaks that are a cloak only and don't also act as a redirector fall into this category. We have only seen one instance of this, and it was to place cloaked links to additional cloak-redirectors on the same infected host. The goal was not to get the cloaked-only links into the Google index, but to pass on those links to Google for additional crawling and indexing. Detecting this category of hack can be much more difficult if the cloak is IP-based. A cloak that is also a redirector is easy(er) to detect via the redirect; without the redirector, and if IP-based, it can only be verified by access to either the files of the site, via tools in the Google Webmaster Toolkit, or by the presence of the cloaked links in the Google cache of the crawled page (if the cache is available).

Backlink discovery vectors

When searching for backlink spam, we have found references to the backlinks in the following sources:

Search engines

As mentioned elsewhere in this documentation, the end goal of backlink spammers is to get traffic from end user searches at various search engines. In order to be effective, the backlinks cannot be hidden from search engines. As long as you able to guess the terms that backlinks are advertising, you should be able to identify backlink infested sites.

Twitter

In a recent backlink hack, advertisements for viagra were tweeted, with links to the corresponding redirectors. Whether this was to capture twitter searches for "viagra" (really?) or to seed search engines that crawl twitter is unknown. Much like Google Alerts, Twitter provides an RSS feed for a given search (updated only when the feed is queried); automated querying of such feeds may provide early backlink discovery.

Source code

If you have access to the source files of the website (e.g. ssh access to the host and read permissions to the files/directories), you can use common tools, such as grep, to search the actual website files for either references to suspect terms or for common cloaking code fragments.

Webserver access logs

If you have access to the log files of the website, periodic automated scans for suspect terms can reveal backlink spam installations, especially if Referer headers are logged along with the requests.