MAILCORRAL FILTERED ITEMS

Viruses

MailCorral tries to disable or delete all harmful items found in the email messages it processes, whether they be attachments or objects embedded directly in the messages. The disabled items, such as viruses and other malicious bits of executable code, can often damage or destroy any system that receives the messages. Disabling or deleting these items prevents them from having their otherwise disastrous effect.

The internal, rules-based virus checking code looks for the following items and handles them as noted (for a complete list of all items handled, see the filter tables in the program code):

If an optional virus arbitration daemon (arbitron), such as ClamAV, is to be invoked through MailCorral's built-in interface, MailCorral sends each component of the received message, that is important from the standpoint of virus arbitration, to the virus arbitron via a pipe. The arbitron renders a decision as to whether the component contains spam or not. One or more arbitrons may be asked to render a decision on a component before a final decision is reached. The results of each determination can be evaluated using a logical expression that returns a go or no-go decision. Depending on how the expression is written, the internal rules may be ignored, a certain arbitron's results can be weighted, etc.

Based on the outcome of expression evaluation, the message component can be kept, deleted or have its name mangled (so it cannot accidentally be opened and executed). A warning or error message can be inserted into the message body and the message can be delivered or held for specific release.

Please bear in mind that, while every attempt was made to create code that would nullify all of the known malicious items at the time that MailCorral was written, virus writers and their ilk are very creative. As the, dare we say art, of virus writing progresses, new and improved viruses may determine ways to get past MailCorral. However, just as sex without a condom is more dangerous than with, email without a filter is more dangerous than with. MailCorral is sure better than nothing (some testimonial, huh). We have been using it ourselves for over ten years and, so far, nothing has squeaked through.

Whitelists And Blacklists

Spam classification can be a costly process if all of the checks for spam are applied to a message. MailCorral supports an early determination method that relies on whitelists and blacklists. It can get its lists from a number of places, including the user's local configuration file, the global configuration file and even the configuration files of spam arbitrons such as SpamAssassin.

Before MailCorral proceeds with any other type of spam classification, it checks to see whether a message sender and/or recipient is on one of the lists and, if so, it either always delivers the message or always treats it as spam (depending on which of the lists registered a hit).

Statistical Spam Determination

Prior to invoking any spam arbitron, statistical information about each message is examined to look for spam-like qualities that indicate the message is spam. If a determination can be made about a message without invoking any spam arbitron, this is done to speed up spam processing, since calling spam arbitrons is generally slow.

The statistical tests that are applied to messages to render a spam determination are fairly simple and, hence, quite rapid and yield a very high true positive score when applied to messages sent via email. They rely on the fact that normal email messages do not employ many of the tricks used by spammers to circumvent spam filtering. By employing these tricks to get around content-based filters, the spammers are essentially flagging their email messages as spam. Here is a description of the tests applied (all tests are applicable to HTML messages only):

Content-based spam identifiers (e.g. Bayesian) look at the words in a message to identify whether it is spam or not. Spammers insert HTML comments into the middle of words (e.g. Via<!--junk-->gra) to break the words up so that content-based identifiers will not see the trigger words that indicate spam. Regular email messages never insert comments into the middle of words so a ratio of the number of embedded comments to words gives a good indication of spamishness. Even a very small ratio is an excellent indicator.

Spammers use tables extensively, even going to the point of aligning single characters in table colums to create words that are not detected by content-based spam identifiers. Regular email messages do not use tables to the extent that spammers do so a high ratio of table tags to regular words can indicate spam.

Image spam presents a message through an image that is loaded from a Web server, thereby resulting in a message with zero or almost no text content, hence nothing is available for filtering. To a lesser extent, text-based spam can embed images to present a more visual message, since visual messages are apparently more appealing. Regular email, although it often includes images, seldom has these images embedded inline and does not include inline images in high proportions to text. Hence, a high ratio of inline images to regular text is a good predictor of spam.

The strength of a good spam campaign can be automatically judged by embedding links to feedback Web sites in such a manner that, when an email message is opened, the Web site is accessed and the recipient's identity is transmitted. In this way, a spammer can know who has received their spam and opened it. Links of this nature are embedded into a message as innocuous images. However, real images seldom include email addresses or parameters containing identifiers. Thus, any images that include such information should receive a high spam score.

Similarly, links to Web pages that include email addresses as parameters (not "mailto:" type links but links to a CGI script or Web page) probably indicate a spammer attempting to provide feedback to themselves with the recipient's email address. Consequently, these types of links are assumed to be indicative of spam.

Perhaps there are legitimate reasons to encode plain text and HTML as if it were binary data (e.g. alternate character sets) but the most common use of this technique has been to obscure spam and viruses from detection by scanners. This being the case, any text or HTML MIME entities that are encoded Base64 are given a high statistical spam score which, while not high enough to win a message a spam label all on its own, is high enough to ensure that any other indiscretions will push it over.

Note that the final statistical score for a message is the sum of all of the statistical tests that are enabled. These tests are employed because they often provide a rapid determination of spam and can work very well, in many instances. Mind you, not everyone is bound to agree with these statistical definitions of spam so any and all of the tests can be disabled or adjusted to fit individual preferences.

Spam Classification

Spam classification is carried out in cooperation with a spam arbitration daemon (arbitron), such as SpamAssassin. MailCorral prepares a representative message that contains all of the components of each received message that are important from the standpoint of spam arbitration. It then sends a request for spam determination, via a pipe to the arbitron, which contains the representative message and the arbitron renders a decision as to whether the message is spam or not. One or more arbitrons may be asked to render a decision on a message before a final decision is reached.

The received message is processed to create the representative message by first removing any attachments and inline components that are not directly viewable by typical mail readers (e.g. not text/plain, text/html, etc.). This is done for performance reasons, since large attachments do not contribute measurably to the spam determination process but they would normally require transmission to the arbitron. Then, any MIME entities that are encoded are decoded so that the arbitron is dealing only with plain text. This reduces the amount of data that must be piped to it and the decoding is done via a fast, C routine.

Upon receipt of the arbitron's decision, the message is disposed of according to the disposition options. If the trash option is chosen, the message is dumped and that's that. If the immediate delivery option is chosen, the spam is tagged with a subject prefix of "[SPAM]:" and sent on its merry way, after any virus filtering, etc. is done. If the corral option is chosen, the spam is filtered, prepared for delivery and then sidelined in a corral directory where a program such as SpamCorral can find it and send out receipt notifications.

Meanwhile, sendmail is instructed to reply to the sender or not, depending on the value of a parameter. A value of zero causes no replies to be sent to the sender. A value greater than zero causes no more than that many replies to be sent to the sender in a predetermined period. After that threshold is reached no more replies are sent until the end of the period passes. If a reply is sent, it has a major and minor result code of "550" and "5.7.1" plus a suitable message text that says the message is spam. Perhaps it is a bit naive to expect that spammers will do anything about this kind of delivery notification so this feature is off by default.

Archiving

There are two archiving possibilities supported by MailCorral. In either case, an exact copy of the original message (before it is filtered) is sent to the archiver.

In the first case, primarily meant to support third-party archivers that can accept messages to be archived through regular email delivery channels, a copy of the message to be archived is sent to the email address given for the archiver (presumably it is some kind of mail handling robot that can accept a message and archive it). Using this technique the archiver can be located anywhere that is reachable via email.

Using regular email channels to deliver messages to the archiver has its advantages but one big disadvantage is that, should something go wrong with the archiver or the network connection to it, the messages being archived will be bounced. Since very little alteration was done to the message being archived (essentially, all that is changed is a new "To:" address is added for the archiver), bouncebacks will be directed back to the original recipient. Frequently, this comes as a huge surprise to them, since they have no knowledge of the archive process or the archiver.

MailCorral handles this problem by looking for bouncebacks of messages marked for delivery to the archiver, while it is carrying out its other message filtering tasks. If it finds any, it will silently discard them, since the original sender should not ever see them.

The second archiving method is meant to support local archiving of messages on a very fast basis. MailCorral usually has most or all of a filtered message in memory as it processes it. For MailCorral to write the message directly to a local directory tree is easy and fast. Thus, if local archiving is your choice, letting MailCorral take care of it for you is a good idea. Optionally, MailCorral can also maintain an index of archived messages in a MySQL database or you may simply have it write the messages to the archive directory and post-index them with some program of your own design.

Validation

MailCorral is designed to pass the validation suite that BSM Development offers elsewhere on this Web site. If you think you've found a virus that isn't caught by this filter, please send it to us and we'll update the validation suite as well as the filter.