We have seen in the Backlinks Overview that one way of finding backlink spam is Guessing Terms... i.e. searching Google for pages on your site containing suspect words or phrases, such as "viagra", "cialis", "rolex", etc. Normally, searching Google is a manual task; you must visit the website in a browser, enter the term(s) and get a set of results. Google provides a mechanism, Google Alerts, that allow you to have Google periodically conduct a search and then make results available to you via email or a feed. This is the primary mechanism that Georgia Tech is using to discover backlink spam on our websites.
For example, the form below would create an alert that would email firstname.lastname@example.org a list of results for pages containing the word "viagra" and links containing "gatech", doing so as soon as Google indexed the new information:
Below is an example of an email sent by a Google Alert set up to search for "oxycontin" on our websites :
In the example above, "inurl:gatech" was used, and not "site:gatech.edu", because we are interested in alerts to pages that are at a URL, or contain in their content a URL, "gatech". Wanting to be alerted even for suspect terms in external (non-gatech) URLs comes from finding that most backlink spam insertion is not a isolated event; backlink spam is almost always linked to from link farms, as illustrated below:
In the example email message in the previous section, the Alert was triggered when Google crawled a link farm page at http://jcj.mlds.10.eu/, which contained a link to http://sos.gatech.edu/node/1781. The page on the sos.gatech.edu website contained the actual link to the pharmaceutical site.
Alerting on inurl: instead of site: is a double-edged sword:
You may also notice that inurl:gatech was used, and not inurl:gatech.edu. This is because there is a limitation to Google's inurl: syntax; only one word may be searched for in the URLs of a page. Any space or word break (e.g. "." is a break) will end the search term; a search for inurl:gatech.edu is the same as doing a search for inurl:gatech edu. Such a search would return pages that contain the word "edu" and links containing the word "gatech".
Given that you are going to want to watch your websites for a number of suspicious terms (e.g. "viagra", "cialis", "rolex"), you will want to set up Google Alerts for each term. You may be tempted to create a combined Alert, such as:
+inurl:gatech (+viagra OR +cialis OR +rolex)
...but there is a downside to this: if you use multiple terms and receive an Alert notification, you may not know which term was the actual match, and that will effect your ability to automate verification of an Alert. Knowing exactly which term matched comes in handy not only in programmatically verifying backlinks, but also in communications with the person(s) in charge of a given website. For that reason, you should set up separate Alerts for each term:
If you choose to consume a Google Alerta via a "feed", then Google automatically adds a subscription to an ATOM feed of the alert to your Google Account's Google Reader online feed reader. However, you are not limited to reading your Google Alert feeds in Google Reader. By accessing Google Reader's "Manage subscriptions" interface, you can find the source URL of the feed and use that URL in any ATOM-compatible feed reader. The feeds URLs are of the format:
Despite the presence of a numerical user ID in the feed URL, the URLs allow for public, non-authenticated access.
After creating several separate Google Alerts, each for a different suspicious term, you may find tracking each of these in separate ATOM feeds may be tiresome. Yahoo Pipes is a web content aggregator that allows you to combine your multiple ATOM feeds into one stream, output as an RSS feed, a JSON object, or one of several different formats. Similar to Google Alerts, the use of Yahoo Pipes is free and requires the use of an existing Yahoo account.
Pipes are created via an online graphical editor, assembled via drag-and-drop of pre-constructed building blocks. To create a Pipe that combines several Google alerts, start with a "Fetch Feed" module (found in the "Sources" group). Enter in each of your separate Google Alert ATOM feed URLs. Next, connect the output (bottom) of the Pipe to the input of a "Sort" module (in the "Operators" group). Adjust the "Sort by" criteria to sort "item.pubDate" in ascending order. Finally, connect the output of the Sort to the Pipe Output module :
Once you Save and Run the Pipe, you will be presented with a list of possible output formats, including: various widget formats, rss feed, json, php object, and email.