Backlink spam is a problem

Georgia Tech OIT has discovered that a growing number of our campus websites contain backlink spam, often in the form of links to sites illegally selling pharmaceuticals. The presence of these backlinks violates the campus CNUSP, and the discovery, eradication and prevention of all backlink spam is now a priority with OIT.

From wikipedia:

“Backlinks are incoming links to a website or web page. Inbound links were originally important (prior to the emergence of search engines) as a primary means of web navigation; today their significance lies in search engine optimization (SEO). The number of backlinks is one indication of the popularity or importance of that website or page (for example, this is used by Google to determine the PageRank of a webpage).”

Backlinks

Backlinks can be used to try and improve a webpage's ranking in Google search results; the more backlinks to a given webpage using a given term as the link text, the higher the ranking for that webpage when a user searches for that term in Google. In Figure 1 to the left, webpage A has multiple webpages linking to it with the term "viagra" and thus searching for "viagra" at Google gives back webpage A's link higher in the listing of results than if webpages B, C, and D did not link to it.

The Google ranking of the webpages that contain the backlinks also factors into their usefulness to a spammer; given that web pages served from important educational institutions carry significant ranking weight, Georgia Tech is a desirable target for spammers.

OIT is observing is an increasing number of spam backlinks to external (non-Georgia Tech) websites being served from Georgia Tech websites. In almost all cases, these links are to sites selling illegal pharmaceuticals (e.g. viagra, cialis) or similar black-market goods. The end goal of the spammers is to get their websites higher in Google search results by creating as many links to their sites as they can from Georgia Tech hosted websites.


Two types of backlinks

There are two kinds of backlinks being observed: cloaked and non-cloaked

Cloaked backlinks are links that are visible in a webpage only when that webpage is crawled by a search engine (Google); when the same page is viewed by a (normal) web browser, the links are not present in the page. This effectively hides the links from users and (importantly) admins of the given site, leaving some cloaked ads to be active for weeks or months before they are discovered and removed.

Cloak Example

Non-cloaked backlinks are links that are visible on a webpage all the time. Both Google and end users see the links in a page when they visit. These are often used when cloaked backlink insertion is not possible and are less desirable from the spammers' view point, as they should be discovered and removed quickly by the admins of the website.

Openad

Two methods of backlink insertion

We have seen two basic methods of backlink spam creation on our websites:

Compromise: Website compromise often leads to cloaked backlinks. In this case, the spammer takes advantage of a hole in a web application’s security to modify the code of a website to serve cloaked ads for several or even all pages of a site.

Open Registration: Some websites allow end users to post content, either without registering, or by registering for an account that is given posting privileges without having been vetted by the site’s admins. Spammers will take advantage of these sites to post backlink spam in such places as wiki pages, blog entries/comments, user profile pages, etc. With backlinks created in this manner, the website has not been compromised, per se, but instead has been just used to create the links.

Finding backlink insertions

Because the goal of the spammers is to game Google, there is one website they won’t hide their backlinks from: Google. By using Google itself, you can find backlink spam being served from your website. There are two ways to “use” Google to search for and confirm backlinks in your site: guessing terms and crawling your site as Googlebot.

Guessing terms involves searching Google for pages from your website that contain certain common spam keywords. For example, you might search for pages from your site that contain the word “viagra” in them. Unless you are purposely serving pages that talk about viagra, any hits that come back from such a search will be cloaked backlinks to sites selling viagra. For example, to find any pages on the www.foo.gatech.edu website that contain the word “cialis,” you would search in Google :

+site:www.foo.gatech.edu +cialis

Note the +’s, as they are required for this to work correctly. If there are any results returned, then it means that when Google crawled that site it found pages that definitely contained the word “cialis.” In the case of a cloaked ad, clicking on such a link in Google usually results in going first to www.foo.gatech.edu and then being immediately redirected to an external commerce website.

Example Search

Guessing terms is the easiest way to find backlink spam, but may miss some cloaked ads if your guesses are not correct. Suggested terms to search for are:

viagra, cialis, rolex, penis, oxycontin, fioricet

By searching Google's results, you may also get results for pages that are no longer serving the backlinks (perhaps because they have been deleted or modified since Google crawled them).

You can emulate clicking on a Google result via other tools, such as curl. The two lines below would look for either a cloaked or non-cloaked backlink using the term "viagra" being served from the page http://www.foo.gatech.edu/somepage.php :

curl -s http://www.foo.gatech.edu/somepage.php | grep -i viagra

curl -si -e 'http://www.google.com/#?q=viagra' http://www.foo.gatech.edu/somepage.php  | 
   egrep 'HTTP/1.1 302|HTTP/1.1 301'

A non-empty result from either command would indicate the presence of a spam backlink.

Crawling your site as Googlebot is another, more technically complicated method that can be used to find and/or confirm the presence of cloaked backlinks. By faking the User Agent header of an HTTP request, you can appear to a website as a Google crawler. So, you can compare what is returned by a query with a normal User Agent with what is returned by a query with a Googlebot User Agent. For example, using curl to request pages:

curl -s -A 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7' 
   -o regular_crawl.html http://www.foo.gatech.edu/somepage.php 

curl -s -A 'Googlebot/2.1 (+http://www.google.com/bot.html)' -o google_crawl.html 
   http://www.foo.gatech.edu/somepage.php

diff regular_crawl.html google_crawl.html

Unless the page has something unusual in it (such as a rotating image banner, plaintext hit counter), the two results should be the same. If they differ, it means that that page is serving different content to Google than to regular users and nine times out of ten it’s a spam backlink that is the difference.

One disadvantage of this crawling method is that it will only find cloaked backlinks; non-cloaked backlinks appear the same to all requestors and thus would be missed by this technique. For this reason, you might want to combine the methods by additionally searching the googlebot result:

egrep -i ‘viagra|cialis|rolex|penis|oxycontin|fioricet’ google_crawl.html

Any results from this would indicate a spam backlink, whether cloaked or not.

Fighting backlink insertions

There are four aspects to battling backlink spam in your website: detection, clean-up, prevention, and monitoring.

Detection: As outlined in the section above, you should use Google (and possibly other search engines such as Bing) to search for backlink spam being served by your website(s).

Clean-up: You need to remove any offending content you find by editing or removing the source pages. This may be simple, in the case of a non-cloaked static HTML backlink, or more complicated, in the case of a web application compromise that modified or injected code to generate the backlinks. If you are affiliated with Georgia Tech, OIT can provide (limited) resources to help you (at least figure out how to) clean your site of the backlinks.

You will also need to remove Google’s cache of your offending pages. See Google documentation at http://goo.gl/g7bl for instructions.

Prevention: No matter what the method of backlink insertion, it is imperative that you keep your OS, web applications and any associated plugins/modules up to date. For example, a website served by an outdated version of Wordpress CMS is easy pickings for spammers wishing to take advantage of a known exploit to insert backlink spam.

If you run a website that allows for users to create and publish content, you must ensure (at least) four things:

Failure to do any of these steps will likely result in backlink spam showing up on your site.

Monitoring: You will want to do some periodic monitoring of Google search results in order to be alerted to the presence of backlink spam in your site. Google searches can be automated by Google Alerts and combined via Yahoo Pipes to create a single feed that monitors Google for new results containing backlink spam on your site.

What OIT is doing about backlink spam

OIT is in the process of creating a system that monitors all *.gatech.edu URLs in Google for common backlink spam and then once verified, alerts the owners of the site of the findings. This doesn't mean that you can relax and just let OIT handle this problem, though; you should be proactive and use the information provided above to check your website for, and if necessary, clean it of backlink spam.

Also, providing an easily found and frequently checked contact email address on your site would be very helpful to us in getting word to you if we find backlink spam on your website.