SpamCorral Documentation

2. Spam Notifier

2.1 Description

If you use a sendmail virus/spam filter, such as MailCorral (described elsewhere on this site), the filter should redirect all received spam to the spam corral, instead of delivering it to the spammer's intended victim. This method of never actually delivering a single piece of spam is a good way of dealing with it but one that could go astray when the filter stops a piece of spam which is actually valuable (stranger things have happened). It is even possible that a non-spam message might be misidentified as spam and stopped by mistake.

How do we resolve this dilemma? Consider that, in the end, the ultimate decision, about whether a piece of spam is valuable or not, is best left up to the intended recipient. After all, they are really the only ones who know whether they actually want so see the spammer's message or not. But, if the recipient is shown every piece of spam and asked to make a decision about whether they want to see it or not, the solution is no better than the problem. The compromise solution is to only ask them once or twice a day, in a single message, and to provide a summary that contains sufficient information to allow them to decide, quickly and easily, whether they want to see the spam or not.

The SpamCorral notification program (SpamNotify.pl) can be run periodically by a cron job, whereupon it will send email notification messages to all of the recipients of spam. It summarizes each of the messages received since the last time it was run, giving the sender's address, the subject, the delivery date/time and the associated spam statistics (as a percentage, with 100% being the threshold for classification of a message as spam). This provides, on a timely basis, yet without being annoying, enough information for a person to decide whether they would like to see any of the spam they have received. Normally, the answer is no, in which case, they need do nothing.

2.2 How it Works

SpamNotify.pl scans through the email spam corral, searching for messages to send notifications for. Normally, it will find all of the messages that have arrived since the last time it was run (i.e. greater than the last ran timestamp). However, you may alter this behaviour by using one or both of its optional patterns to look only for specific messages.

If you do elect to use a pattern, the full flexibility of regular expression searching is available. The patterns given are evaluated using the Perl eval function. Consequently, if you can describe what you are looking for with a regular expression, you can find it in the spam corral.

If this is not sufficient, the even fuller flexibility of a block of code is available for the more difficult problems. If you can code your search criteria into a block of Perl code, you can apply them to messages in the spam corral.

For each spam message that matches the search criteria (either later than the last timestamp or one or both of the patterns), a notification message is sent to the recipient, indicating that there is spam waiting for them in the corral. A brief description of each spam message is given in the notification, so as to give the recipient a chance to decide whether they want to see it or not. All of the matched spam messages for a single user are sent in one notification.

2.2.1 Input

The input to this script is the spam corral, which is expected to be populated by the MailCorral sendmail milter "sendmailfilter.c". Essentially, each message in the corral is stored exactly as it was received, except for a single line of statistical information in a special header at the very front of the message.

If you wish to use this program on mail messages corraled by some other mail filter, you should easily be able to alter the portion of the code that looks for the statistical information in the header.

2.2.2 Output

The output will be one notification mail message sent to each individual recipient of spam with messages that match the search criteria. In the notification message, there will be an explanatory paragraph or two plus one paragraph for each spam message. In these paragraphs, the sender, recipient, subject, spam statistics and any annotation from a code block will appear.

The recipient of the notification will be able to use the information therein to release any spam, that they particularly care about, for deliver to themselves.

2.3 Command Line Parameters

SpamNotify.pl is invoked by the following command line:

SpamNotify.pl [--Config=configfile] [--Corral=/corral/dir] [--Debug] [--Partition[=interval]]
          [--SearchSpan=days] <uid_pattern> <search_pattern>

The meaning and usage of each of these parameters is:

--Config The name of a file containing program configuration information. This file is interpreted via the eval function so that it may contain variable assignments, etc. All of the configuration variables described in the "Configure the Message Remailer" section (e.g. $NOTIFYMSG) may be set in this file. The default file is a file named "SpamNotify.cf" in the same directory as this program. If the file doesn't exist, the internal defaults are used instead.

To use this file, set the variables as you would in Perl code. For example:

$SPAMROBOT = "myrobot";

--Corral The name of the directory where all of the incomming the spam has been corraled by the sendmail spam filter. Normally, the default directory, if not specified, is "/var/spool/MailCorral".
--Debug Turn on debugging. Dumps possibly helpful trace information to stdout if enabled.
--Partition Turn on corral partitioning. If no value is specified, the corral is partitioned in 24 hour intervals. If a value is specified, it must be a number from 1 to 24, giving the interval in hours into which the corral should be partitioned. It is probably best to choose an interval that divides evenly into 24 (i.e. 1, 2, 3, 4, 6, 8, 12 and 24).

On many systems, the amount of spam kept in the corral can approach epic proportions. For example, on a system with 5000 users, that keeps its viruses/spam for 10 days, where the users receive an average of 10 virus/spam messages per day, the corral would contain 500,000 messages. This number of messages can tax any file system and prove burdensome to process.

By partitioning the corral into intervals, the total number of messages in a single corral directory can be reduced significantly (e.g. in the above example with a partition interval of 24, the average directory would only contain 50,000 messages).

Processing of corralled messages is also speeded up, since the messages in a partition that don't apply to the time period being scanned are bypassed altogether.

Note that partitioning can be turned on/off at any time but it will take some time for the partitioned messages to work their way out of the system, once they are partitioned. Be sure, if you do turn on partitioning, that you have a version of filterclean from MailCorral version 1.0.16 or better, otherwise the partitions won't get cleaned up.

--SearchSpan When searching for a particular userid or pattern, this parameter tells the notifier how many days it should back up, from today, while searching. The default is 30 days.
<uid_pattern> A userid pattern to use in searching the corralled spam for messages to send notifications for. Normally, all messages received after the last timestamp are processed. Supplying this pattern alters that behaviour. Usually, it is used for testing purposes.

A regular expression is used, which is matched against the userid that the spam was sent to. It is evaluated by the Perl eval function. The leading and trailing forward slash must not be supplied by you. These slashes will be supplied by this program when it makes up the expression to be evaluated. Also, you need not worry about case-sensitivity because case-insensitive mode is forced.

Note that special characters must be escaped using the backslash character. Unfortunately, this is also the shell escape character so the backslash must be doubled-up on the command line, if unquoted. For example, to get the pattern shown, use:

myhost\\.com  command line
myhost\.com   yields pattern
myhost.com    searches for

Also note that, if the corral has been partitioned by the "--Partition" option, the search will only extend backwards from the current time for 30 days.

<search_pattern> A pattern to search the contents of each of the spam messages for. Any message which contains this pattern will have a notification sent for it. Normally, all messages received after the last timestamp are processed. Supplying this pattern alters that behaviour. Usually, it is used for testing purposes.

Note that, if you use this pattern, the must also be supplied. A match-all pattern of ".*" may be used if you don't care about matching user names.

Also note that, if the corral has been partitioned by the "--Partition" option, the search will only extend backwards from the current time for 30 days.

The pattern can be either a regular expression or a block of Perl code that returns true/false. Whichever is used, both will be evaluated by the Perl eval function. If a regular expression is used, the same rules that apply to (above) apply here.

If you wish to supply a block of code to be evaluated, it must begin and end with "{" and "}". If it does not, it will be treated as a regular expression and evaluated as such.

The last statement in the block of code must evaluate to a boolean value. This value must be true, if you wish the search to match the message in question and false, if you do not wish the search to match the message in question.

The block of code has the following pre-defined variables available to it:

$Message - the buffer holding the entire message, including the headers, as it was sent. Note that, if you don't use code, the pattern will be matching against this variable.

$Headers - only the headers portion of the message.

$Body - the body portion of the message, minus any headers.

$From - the sender's from address.

$To - the recipient's to address.

$Subject - the subject header line.

$Date - the message's date header line. The format is according to the SMTP RFCs is:

Day, mm Mnt yyyy hh:mm:ss +/-gggg

Where "Day" is the name of the day of the week, "mm Mnt yyyy" is the day of the month, month name and year. "hh:mm:ss" is an hour timestamp and +/-gggg is the offset in hours and minutes from GMT.

$ReplyTo - probably pretty similar to $To. The reply-to header line.

$ContentType - the content-type header line. This follows the MIME rules in the RFCs.

$Importance - the importance header line.

$MatchCount - a running count of the number of messages matched so far (does not include the current message).

$TotalCount - a running count of the number of messages processed so far (includes the current message).

$Remarks - may be set to any value by the block of code. This variable will be appended to the notification generated for any matching message. Can be used to annotate matched messages or to send information to the users.