RoboCage README
What is RoboCage?
RoboCage is a Zope product that produces random text out of a word dictionary. It mixes fake email addresses into that random text, as well as links back to itself (with different URLs, though). Thus, it provides a cage-like facilitly to "catch" email-harvesting robots.
Why?
Some people say the internet is bad because of a whole lot of reasons (pornography, insecurity, personal information etc.). It is probably just a mirror image of today's society.
It is evil, however, to write a robot program (probably in Perl or some other evil language) and collect email addresses for spam email. It is not possible to stop those robots but it is possible to feed them with false information. That's the idea of RoboCage.
How come?
The idea started at the EuroZope Conference in Berlin, July 12-13th 2001; during the following BBQ, to be exact. There was a bunch of people around Martijn Faassen that was joking about all kinds of internet phenomens. spam robots naturally had to occurr in that discussion. A contest of anti-robot solutions was planned. RoboCage is my contenstant.
How does it work?
The heart of RoboCage is the random text generator. It draws its data from a language file, which contains a word in each line. Ispell dictionaries , for example, work well. Language files are stored in the "lang" subdirectory.
These kinds of dictionaries have one problem however. One can draw any random word from it, but the resulting text would not "feel" like real language. For in real language, some words (like "the" and "of" in English) appear more often than others.
On the Wortschatz website , the German University of Leipzig provides a list of the most frequent words in four different languages (English, German, Dutch and French), each ordered by the words' frequency. RoboCage comes with those four lists of words, containing the 10000 most frequent words in those languages.
Depending on what type of input file you use, select the appropiate option in the management interface. RoboCage works right out of the box such that only the language needs to be set. If you're using a different dictionary file (e.g. for a different language than those four provided) that is alphabetically sorted, you should change the setting.
In case of a wordlist that has been ordered by frequency, each word is assigned a relative weight (relative meaning relative to the other words). The first couple of words are to appear more often than the last couple. This leads to the relative weight of 1/N for the Nth word in the list. The first word has the relative weight of 1/1, the second 1/2, the third 1/3 and so on. The sum of all weights corresponds to the area underneath the graph of 1/x from 1 to Z where Z is the number of total words. The area underneath 1/x is given by ln(x), the natural logarithm function. Its reverse function is e^x. Therefore, a random number R between 0 and ln(Z), where Z is the total number of words, corresponds to a certain word. That word will be the Nth word where N = e^R.
In case of an alphabetical-ordered or an unordered list, all words have the same probability. Email addresses and links to itself are randomly spread throughout the page.
Credits
- Thanks to Martijn and the other guys from the EuroZope BBQ. It was fun!
- Thanks to some guys at #[email protected] for helping me out with the random text generator.
- The actual credit goes to JohnPC, I guess, his "bla" being the mother of all robot catchers.
License
RoboCage is distributed under the terms of the GNU GPL