Re: The Complexity of a Simple Prefix

The main subject of this blog post is the “Re:” prefix used in the “Subject:” header line of most email replies, but the implicit subject is how MailMate handles the huge gap between theory and practice when dealing with many email headers. The story of the “Re:” prefix is in many ways a typical email header story. What may seem like a simple problem quickly becomes very complicated when 40 years of email history has made its mark. The Devil is in the details.

History

The first attempt to standardize email headers can be seen in RFC 561. This is September 1973, the document is 2 pages long, and not surprisingly it does not mention the “Re:” prefix. Note that this does not mean that the “Re:” prefix was not in use at the time. Many RFCs have been written after the fact and simply document what is already the prevailing behavior of implementations. This is emphasized by the first use of “Re:” in an RFC in 1977 where it is only present as part of an example of an email reply. This example lives on in RFC 733 (1977) and the classic RFC 822 (1982). It takes almost 20 years before RFC 822 is updated, but for the first time the “Re:” prefix is explicitly described in RFC 2822:

When used in a reply, the field body MAY start with the string “Re: “ (from the Latin “res”, in the matter of) followed by the contents of the “Subject:” field body of the original message.

It even includes an explanation of what “Re:” means although I’m with this guy on that one, but the Latin interpretation serves a purpose as we’ll see further below. For the sake of completion, the meaning of “Re:” was changed (corrected?) to be from the Latin “in re” in RFC 5322.

Ok, that was a nice bit of history and it actually started before I was born which means I don’t have to feel so old today. A similar history exists for many other of the standard email headers, but in the special case of the “Re:” prefix you might ask: Why standardize it at all? I hope that is going to be evident at the end of this blog post.

Theory

Originally, the “Re:” prefix served the simple purpose of making it easy to see when a message was a reply to another message. The use of it might even pre-date the introduction of the “In-Reply-To” header which could be seen as an alternative solution. In retrospect, it should have been standardized with the following rules:

Add “Re:” to the subject of replies and replies only.
Do not add another prefix when replying to a reply.

As we’ll see further below then the following clarification would also have been nice: Do not use any localized variants of “Re:”.

Practice

It might not have helped if the reply prefix had been standardized, but the lack of standardization certainly did not help. Variations of the prefix were invented and they resulted in various problems. I’ll describe 3 prefix related problems below.

The first problem is how to handle a reply to a reply. The naive solution is to just add another “Re:” prefix, but this is ugly and it can quickly increase the length of a subject line. A better solution is to only add the prefix if one does not already exist. Some email clients decided to do better than that and they introduced a counter in the prefix:

Re[4]: This correspondence now involves 4 messages.

That is not a bad idea, but since not all email clients supported it then a reply would often look like this:

Re: Re[4]: This correspondence now involves 4+1 messages.

The “smart” email client might be able to merge the prefixes when replying and therefore the above could be an acceptable side effect until all email clients supported the counters, but this now seems unlikely to ever happen. It does not help that this has never been standardized.

The second problem is localization. Some email clients (none mentioned none forgotten) insert a localized prefix instead of “Re:”, for example, in Danish this could be “Sv:” as an abbreviation of “Svar” (Reply). The result is predictable:

Re: Sv: Re: Sv: I don't think we are using the same email clients...

This is probably why it was decided that “Re:” is a Latin abbreviation.

The third problem is mailing list subject prefixes, for example, the MailMate mailing list prefixes subject lines with “[MlMt]” (I’ll use “blob” to refer to it as done in RFC 5256). This can confuse both email clients and mailing list software, in particular, if combined with the other problems described. The result could be a subject line like this:

Re: Sv: Re: [MlMt] Re: This is starting to get silly...

Things get even more complicated when also considering forwarding related prefixes and other semi-standard conventions. Now, the question is, what can MailMate do about this?

The heuristic solution

No perfect solution exists for this problem. The following is an excerpt from RFC 5256 where the suggested solution is to simply ignore most of the variations described:

Translations of the “re” or “fw”/”fwd” tokens are not specified for removal in the base subject extraction process. An attempt to add such translated tokens would result in a geometrically complex, and ultimately unimplementable, task.

I agree with the sentiment of this, but MailMate tries to be a little bit more pragmatic. The heuristic solution in MailMate is simple: A regular expression is used to identify the prefix of a subject line. This regular expression and many others can be found in the specifiers.plist file within the MailMate application bundle. The essence of the regular expression is as follows:

((?:\s*\p{Alpha}{2,3}(?:\[\d+\])?[:：])+)

Update October 6th, 2016: Too many false positives means that this is going to be replaced with an expression including a specific list of known prefixes. It’s also going to be configurable.

This is geek language for a (possibly repeated) sequence of two or three Unicode letters followed by an optional counter and a mandatory colon. The colon can actually be one of two kinds of colon. The usual one and a so-called full-width colon. The latter is used in the Chinese prefix “回信：” and possibly by other languages as well.

Why MailMate needs to know the prefix

MailMate is quite liberal with respect to what it identifies as a prefix. It is therefore also important that the prefix is handled with care. As we’ll see below it is mostly used for improving the display and sorting of subject lines. An exception is the generation of a subject line for a reply for which MailMate tries to never add more than a single “Re:”. This strategy adheres to Postel’s law: “Be liberal in what you accept, and conservative in what you send”.

The identification of the prefix, the blob, and the body of a subject line is used for the following purposes in MailMate:

Display all subject lines with the same order of elements and without extraneous whitespace between these elements. The blob and, in particular, the order of the prefix and the blob is not standardized and therefore varies a lot.
Proper sorting of messages by subject.
Only compare the body of a subject when trying to determine whether or not the subject has changed between a message and its reply (MailMate warns about such a change before sending a reply).
Only prefix a subject line with a single “Re:” when generating a reply.
Allow searches for messages with a particular subject blob or subject body.

The behavior of MailMate is far from perfect, but the solution to this and similar header parsing problems is very general and flexible. For example, the sorting of the messages outline by subject is handled by defining the following sortKey for the subject column of the messages outline (outlineColumns.plist):

sortKey = "subject.blob,subject.body,subject.prefix";

And the display of the subject line in the messages outline is handled by the following format string (outlineColumns.plist):

formatString = "${subject.prefix:+${subject.prefix} }${subject.blob:+[${subject.blob}] }${subject.body}";

If you don’t like that (and you feel adventurous) then you can create your own columns or override the existing ones using low-level customizations.

This post became much longer than I anticipated and it didn’t even cover all the details. Now, please don’t get me started on the intricacies of the address related email headers…