Automating Alerts

We have seen in the Backlinks Overview that one way of finding backlink spam is Guessing Terms... i.e. searching Google for pages on your site containing suspect words or phrases, such as "viagra", "cialis", "rolex", etc. Normally, searching Google is a manual task; you must visit the website in a browser, enter the term(s) and get a set of results. Google provides a mechanism, Google Alerts, that allow you to have Google periodically conduct a search and then make results available to you via email or a feed. This is the primary mechanism that Georgia Tech is using to discover backlink spam on our websites.

Setting up a basic Alert

If you have a Google account, you can create up to 1,000 separate alerts via the Google Alerts site. For each alert, you must choose:

For example, the form below would create an alert that would email adam.arrowood@oit.gatech.edu a list of results for pages containing the word "viagra" and links containing "gatech", doing so as soon as Google indexed the new information:

Ga Create

Below is an example of an email sent by a Google Alert set up to search for "oxycontin" on our websites :

Ga Example

Using inurl: vs. site:

In the example above, "inurl:gatech" was used, and not "site:gatech.edu", because we are interested in alerts to pages that are at a URL, or contain in their content a URL, "gatech". Wanting to be alerted even for suspect terms in external (non-gatech) URLs comes from finding that most backlink spam insertion is not a isolated event; backlink spam is almost always linked to from link farms, as illustrated below:

Linkfarms

In the example email message in the previous section, the Alert was triggered when Google crawled a link farm page at http://jcj.mlds.10.eu/, which contained a link to http://sos.gatech.edu/node/1781. The page on the sos.gatech.edu website contained the actual link to the pharmaceutical site.

Alerting on inurl: instead of site: is a double-edged sword:

You may also notice that inurl:gatech was used, and not inurl:gatech.edu. This is because there is a limitation to Google's inurl: syntax; only one word may be searched for in the URLs of a page. Any space or word break (e.g. "." is a break) will end the search term; a search for inurl:gatech.edu is the same as doing a search for inurl:gatech edu. Such a search would return pages that contain the word "edu" and links containing the word "gatech".

Single vs. Multiple Alerts

Given that you are going to want to watch your websites for a number of suspicious terms (e.g. "viagra", "cialis", "rolex"), you will want to set up Google Alerts for each term. You may be tempted to create a combined Alert, such as:

+inurl:gatech (+viagra OR +cialis OR +rolex)

...but there is a downside to this: if you use multiple terms and receive an Alert notification, you may not know which term was the actual match, and that will effect your ability to automate verification of an Alert. Knowing exactly which term matched comes in handy not only in programmatically verifying backlinks, but also in communications with the person(s) in charge of a given website. For that reason, you should set up separate Alerts for each term:

Ga Manage

ATOM feeds

If you choose to consume a Google Alerta via a "feed", then Google automatically adds a subscription to an ATOM feed of the alert to your Google Account's Google Reader online feed reader. However, you are not limited to reading your Google Alert feeds in Google Reader. By accessing Google Reader's "Manage subscriptions" interface, you can find the source URL of the feed and use that URL in any ATOM-compatible feed reader. The feeds URLs are of the format:

http://www.google.com/reader/public/atom/user/googleUserID/state/com.google/alerts/feedID

Despite the presence of a numerical user ID in the feed URL, the URLs allow for public, non-authenticated access.

Using Yahoo Pipes

After creating several separate Google Alerts, each for a different suspicious term, you may find tracking each of these in separate ATOM feeds may be tiresome. Yahoo Pipes is a web content aggregator that allows you to combine your multiple ATOM feeds into one stream, output as an RSS feed, a JSON object, or one of several different formats. Similar to Google Alerts, the use of Yahoo Pipes is free and requires the use of an existing Yahoo account.

Pipes are created via an online graphical editor, assembled via drag-and-drop of pre-constructed building blocks. To create a Pipe that combines several Google alerts, start with a "Fetch Feed" module (found in the "Sources" group). Enter in each of your separate Google Alert ATOM feed URLs. Next, connect the output (bottom) of the Pipe to the input of a "Sort" module (in the "Operators" group). Adjust the "Sort by" criteria to sort "item.pubDate" in ascending order. Finally, connect the output of the Sort to the Pipe Output module :

Pipe

Once you Save and Run the Pipe, you will be presented with a list of possible output formats, including: various widget formats, rss feed, json, php object, and email.

Further Automation

One large downside to Google Alerts is the large number of false positives that can be generated (often from link farms). By consuming the ATOM feeds (or Yahoo Pipe of combined feeds) programatically, you can automate verification of possible backlinks. Using curl and a scripting language (such as php), you can write a script to be run periodically that does the following :
  1. Consume the next unread item from the source feed
  2. Retrieve the URL source of the feed item, filtering it for URLs specific to your organization (e.g. in our case *.gatech.edu)
  3. Take each of the filtered URLs, and try to verify the presence of either the suspect term or a cloak:
    1. Retrieve the URL, scanning for the suspect term searched for in the particular alert
    2. Retrieve the Google cache of the URL, scanning for the suspect term
    3. Retrieve the URL using a Googlebot User Agent header, looking for an HTTP Redirect, or if content is returned, scanning for the suspect term
    4. Retrieve the URL using a faked Google search URL for the Referer header, looking for an HTTP Redirect, or if content is returned, scanning for the suspect term
  4. If any positive results are found from the previous step, sending the appropriate notice to your Information Security department