This package contains a stripped down version of the SpamBayes classifier, with the following changes:
- The classifier and tokenizer code has been kept. All other code has been removed.
- The tokenizer has been stripped down and simplified. In particular all code designed specifically for email parsing has been removed.
- The ClassifierDb class has been reduced to a simple dict subclass. The custom pickling code has been removed, as have all database backends.
- The remaining code has been updated and made compatible with Python 3.
- An orthogonalsparse bigram (OSB) transformation has been added.
- Unicode handling has been improved.
I use sbclassifier to protect websites against contact form spam.
With a training set of a handful each of spam and non-spam messages it is already useful. Once the training data set gets above about 20 messages of each type I am happy to let it filter out the most obvious spam.
The above script will print out:
0.902 [('*H*', 0.104), ('*S*', 0.908), ('can', 0.155), ('for', 0.845), ('service', 0.845), ('traffic', 0.845), ('and', 0.908)]
sbclassifier assigns 90% probability to this unknown message being spam. It can also produce a sequence of (word, probability) pairs that reveals the tokens that were important in this calculation.
The spambayes source repository contains a wealth of information on how and why the classifier works as it works, as does the SpamBayes wiki.
Copyright (C) 2002-2013 Python Software Foundation; All Rights Reserved
The Python Software Foundation (PSF) holds copyright on all material in this project. You may use it under the terms of the PSF license; see LICENSE.txt.