« Record Your Next Board Meeting Minutes | Main | RedwoodVirtual support sucks »

HOWTO SpamAssassin Site-Wide Training For All Mailboxes on Postfix Mail Server

UPDATE: Avoid RedwoodVirtual for Linux hosting right now. My server has been out 48 hours and they have been nearly unreachable. See my post for more info.

This is a follow up post to improve SpamAssassin training from my earlier HowTo on setting up a Debian PostFix mail server. My earlier posts on setting up a Debian Mail Server are here and here.

Since I have a number of people with mailboxes on my mail server, I wanted to set up the spam filtering so that training on one account (mine) benefited everyone on the system.

The changes are working quite well. Once again - I got assistance from Kellan. He pointed me at this site for information on what is called the site-wide bayesian filtering for SpamAssassin.

Bayesian filtering is the collaborative filtering technology SpamAssassin uses to learn to identify Spam. As scum-spammers regularly change their outbound spam to avoid SpamAssassin - you must regularly train your system to keep up with their latest techniques.

First, I changed the the local.cf file in /etc/mail/spamassassin
local.cf in /etc/mail/spamassassin

# This is the right place to customize your installation of SpamAssassin.
#
# See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
# tweaked.
#
###
#
# rewrite_subject 0
# report_safe 1
# trusted_networks 212.17.35.
bayes_path /etc/mail/spamassassin/bayes
required_hits 3.5

Bayes_path tells spamassassin to use the knowledge it gains from me about spam in that directory /etc/mail/spamassassin/bayes. If you don't have this, I think SpamAsssassin keeps a separate dictionary for each user. Then, every user has to train SpamAssassin individually.

I also changed the permissions on these directories to make sure other user's would be able to read them for spam detection. I'm not sure if this was necessary:
> cd /etc/mail/spamassassin
> chmod -R 775 b*

Required_hits is a threshold. The lower the number, the more strict the spam routing becomes. The lower you go, the more likely you'll have valid messages misrouted as spam to your junkmail folder. 3.5 seems to work well for me. 5.0 is the default.

I restarted SpamAssassin to switch to the new settings:
> sudo /etc/init.d/spamassassin restart

I use Mozilla Thunderbird for reading email. I've set up Thunderbird to route suspected Spam to a Junkmail folder. Thunderbird understands the mail headers SpamAssassin adds to my mail headers - and routes suspected spam to Junkmail. However, the Junkmail folder may sometimes move valid messages that just happen to resemble spam. And, it may miss some spam, leaving it in my Inbox - this happens whenever scum-spammers change the structure of their messages.

Next, I needed to pre-train my dictionary with a large collection of valid messages. Review your inbox - and get rid of all the spam in it. Or, create a new folder and move only non-spam to it. We're going to use this directory to train SpamAssassin once with a large set of your spam-free/valid messages.

> sudo sa-learn --ham -C /etc/mail/spamassassin --showdots --dir /home/myusername/Maildir/.Inbox/cur

Next, I created two new folders in Thunderbird:

VerifiedSpam
VerifiedSpamMisses

At this point, if you have a large collection of known Spam messages, place them in VerifiedSpam and train SpamAssassin once against it:

> sudo sa-learn --spam -C /etc/mail/spamassassin --showdots --dir /home/myusername/Maildir/.VerifiedSpam/cur

That helps set up SpamAssassin to recognize known Spam. Then, on a regular basis, you'll need to keep up with the following processes.

I've set up Thunderbird to move any messages I manually mark as spam - perhaps spam that it and SpamAssassin missed - to the VerifiedSpam folder.

Every few days, I review my Junkmail folder and move all correctly identified spam to the VerifiedSpam directory. And, I move any valid messages that were incorrectly routed to the Junkmail folder to the VerifiedSpamMisses folder. It's important for SpamAssassin to learn from its mistakes as well as its successes.

Make sure to do the above tasks carefully. If you make mistakes, you'll weaken the SpamAssassin's effectiveness - essentially training it backwards.

Now, you're ready to set up a cron job to train SpamAssassin to learn from your spam as well as the missed messages it mis-identified - on a daily basis.

In /etc/cron.d, I've created a spamtrain file:
> touch /etc/cron.d spamtrain

MAILTO=me@myaddress.org
#This tells spamass to learn from spam in my VerifiedSpam folder
20 1 * * * root sa-learn --spam -C /etc/mail/spamassassin --showdots --dir /home/myusername/Maildir/.VerifiedSpam/cur/
#This tells spamass to learn its mistaken non-spam in my VerifiedSpam Misses folder
35 1 * * * root sa-learn --ham -C /etc/mail/spamassassin --showdots --dir /home/myusername/Maildir/.VerifiedSpamMisses/cur

Now, every day at 1:20 AM and 1:35 AM, SpamAssassin improves its spam detection by learning from the spam I manually mark as well as the mistakes it made which I also manually marked.

By the way, Ham = Non-spam messages.

The -C /etc/mail/spamassassin part tells SpamAssassin to build its learning dictionary of spam and non-spam into that specific directory. The Bayes_path in local.cf tells SpamAssassin to use that directory for every mailbox user on your server.

Now, I do the work of training SpamAssassin for my account - and all the users on my server benefit from a more intelligent Spam detection dictionary.

Enjoy. Please post any comments below.

LinuxToday also posted this story about WhiteListing email using Postfix and MySQL to protect children from pornographic spam.

Comments

Jean-Marc Liotier

Here is how I set up a similar solution. It is working for my users.

http://www.ruwenzori.net/teach-sa/teach-sa.html

gianluca

Hi
in the SpamAssassin Site-Wide I suggest you to create a new account where every users send the spam-mail.
Because yuor solution (put the mail into specify maildir for *only* your user)
for your solution is necessary that, for more address's mail, added of one the line in crontab for everyone.
It is possible run sa-learn with procmailrc too.

Bye Bye
Gianluca

gimili

How about a script to make it learn from all Maildirs as follows:

#!/bin/bash

DNAME="/home/"
cd $DNAME
for I in $(ls $DNAME)
do
cd $DNAME
cd $I
cd Maildir
echo $I
sa-learn --spam -C /etc/mail/spamassassin --showdots --dir .JunkmailVerified/cur/
sa-learn --ham -C /etc/mail/spamassassin --showdots --dir .JunkmailMistake/cur/
cd ..
cd ..
done

dan

I think the best way is to have a courier shared directory for ham and spam. Then users simply move stuff to the appropriate folders. I set the permissions so they cannot see others spam/ham. I have a cron job that cleans the spam folder. I don't clean the ham folder in case someone moves one there by mistake.

Veit Nachtmann

I believe I've got a more sophisticated, therefore cooler approach :D
Executed by Cron, therefore I (root) get mail if something goes wrong...:

#!/bin/bash

list=$(find /home -type f -path '*/Maildir/.Learn-SPAM/*/*')
for filename in $list
do
if (`sa-learn --debug --spam $filename 1> /root/spamlearn.log 2>> /root/spamlearn.log`)
then
rm $filename
else
echo Error! Could not read $filename!
cat /root/spamlearn.log
fi
done

list=$(find /home -type f -path '*/Maildir/.Learn-HAM/*/*')
for filename in $list
do
user=$(echo $filename | sed 's:/Maildir.*::')
mailname=$(basename $filename)
sa-learn --ham $filename 1> /root/spamlearn.log 2>> /root/spamlearn.log
cat $filename | spamassassin -d > $user/Maildir/new/$mailname
chmod --reference=$filename $user/Maildir/new/$mailname
chown --reference=$filename $user/Maildir/new/$mailname
if [ -e $user/Maildir/new/$mailname ]
then
rm $filename
else
echo Error! Could not move HAM back to Inbox!
echo Printing logs...
cat /root/spamlearn.log
fi
done

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In.