03-Email Format and structure of e-mail

The format and structure of e-mail messages are crucial for several reasons. To properly handle an e-mail message, it’s essential to understand its structure and identify all its components, including message data (e.g., sender, recipients), delivery information (e.g., e-mail servers involved, dates sent and received), message text, and attachments.

Understanding these elements is important for various processes involving e-mail, from ensuring accurate delivery to interpreting and managing message content effectively.

Firstly, to archive a message, it is essential to determine its structure and identify all the elements that comprise it, including:

Message data: Information such as the sender, recipients, etc.
Delivery information: Details about the email servers that handled the message, the date it was sent, the date it was retrieved, etc.
Message text: The content of the email.
Attachments: Any files attached to the email.

Next, these elements should be extracted from the message to help decide, through a delicate and complex process, whether the message should be archived and how it should be classified.

Finally, a decision must be made on the format in which the message and/or its components should be preserved.

Message Structure

An Internet email message consists of two main sections:

Header: A sequence of lines at the beginning of the message, generated by the sender’s email client and the email servers involved in the delivery process.
Body: The rest of the message, containing the message text in plain ASCII characters, and/or text containing non-ASCII characters, as well as binary data in plain ASCII encoding.

In the simplest case, as defined in RFC 822, the message body contains only plain ASCII characters. These messages are straightforward to handle, can be archived in their native format, and can be read again without any need for decoding.

However, most messages today use extended ASCII or Unicode characters, include attachments, or are in HTML format. In these cases, the message must be in MIME format. Therefore, the following sections focus on the structure of MIME messages.

Message Header

The message header is a sequence of lines, called header lines or simply headers, produced by the sender’s email client and the email servers along the delivery path. The header ends with a blank line, after which the message body begins.

Only a small portion of the information in the message header is displayed by email clients. This is reasonable, as there is a wide variety of headers, many of which are optional, and most users would be confused by too much detail. However, email clients typically allow users to inspect the complete header if they wish to investigate the message’s origin and delivery process.

The most common headers are shown in Table 1. These can be divided into four main categories based on the email management processes to which the data refer:

Identity: These headers specify the sender and recipients of the message and add additional details. For instance, the message is usually assigned a unique Message-ID by the sender’s email server, which can be used to reference the message in other communications. Additionally, a Return-Path can be specified, which is different from the sender’s address, to receive bounce messages. The Sender header allows specifying the person or automated agent that is actually sending the message on behalf of the official sender, as listed in the From header.
Delivery: These headers contain details about the delivery process. A Received record is added each time the message is handled by a server along the delivery path, starting with the sender’s email server and ending with the recipient’s server. A timestamp is associated with each step, specifying the local date and time the message arrived at the receiving server, expressed in standard format with GMT and time shift. Additional headers specify if the sender requested a receipt and to which address it should be sent. It is important to note that different email clients may handle receipt information differently, so the absence of a return receipt should not be taken as definitive proof that the message was not delivered or read.
Thread: These headers are used in messages sent in reply to other messages or forwarded messages, forming a thread. Some of the header information from the original message initiating the thread is included in the new message, notably the message identifier. Headers referring to threads are particularly important in email archiving as they allow for the extraction of metadata connecting a message to other messages.
MIME: These headers specify the structure of the message body and the MIME version, which remains 1.0 despite the evolution of the standard. The Content-Type header specifies whether the message contains one or several parts, and if it contains multiple parts, a boundary is specified to separate them. If the message contains a single part, the Content-Type and Content-Transfer-Encoding are directly specified in the header.
Miscellaneous: Additional headers may be added, referring to security applications, spam filtering, and other email management processes.

Common Headers(A = Always Present, F = Frequent, O = Optional)

Category	Header	Description	Origin	Present
Identity	Date:	Date/time sent	Sender client	A
	From:	Address of sender	Sender client	A
	Sender:	Address of sender’s assistant	Sender client	O
	Organization:	Organization of author	Sender client	O
	To:	Address of recipients (may be a list)	Sender client	O
	Cc:	Address of recipients in carbon copy	Sender client	F
	Bcc:	Address of recipients in blind carbon copy	Sender client	F
	Subject:	Message summary	Sender client	A
	Message-ID:	Unique identifier assigned by the sender	Sender server	F
	Return-Path:	Address for ‘bounce messages’	Sender client	O
Delivery	User-Agent:	Sender email client software	Sender client	A
	Delivered-To:	Recipient mailbox (may be a list)	Recipient server	A
	Received:	One for each step in the delivery path	Server	A
		from: Server which sent the message	Server	A
		by: Server which received the message	Server	A
		with: Server ESMTP identifier	Server	A
		date: Date/time received	Server	A
	Return-Receipt-To:	Address to send a read receipt	Sender client	O
	Disposition-Notification-To:	Address to send a read receipt	Sender client	O
Thread	In-Reply-To:	Message ID to which the message replies	Sender client	O
	References:	Message ID to which the message refers	Sender client	O
	Resent-From:	Address of sender forwarding the message	Sender client	O
	Resent-To:	Address of the recipient forwarded message	Sender client	O
	Resent-Subject:	Subject of the forwarding message	Sender client	O
MIME	MIME-Version:	Always 1.0	Sender client	A
	Content-Type:	Specifies content and structure of the body	Sender client	O
		boundary: Separator in multipart messages	Sender client	O
	Content-Transfer-Encoding	Encoding scheme	Sender client	A

Message Body

A message in MIME format may contain one or several parts.

Single-Part Messages

A single-part message is a plain text message with no attachments. The corresponding Content-Type in the header is text/plain, which also specifies character encoding. For messages containing only plain ASCII characters, the Content-Transfer-Encoding is 7-bit. If the character set is other than plain ASCII, a different encoding is used, often quoted-printable, which represents plain ASCII characters directly and encodes ISO 8859 (extended ASCII) or Unicode characters with three plain ASCII characters each. Although this and other encodings are common, many users have experienced issues with misinterpreted characters, particularly with diacritic marks, when reading messages—a common email client failure.

A similar encoding scheme, called Encoded-Word, is used for textual header information in character sets other than plain ASCII. The structure of a single-part message is represented in Figure 4. This message uses ISO 8859-1 (Western Europe) encoding and contains accented characters in both the Subject header and the text.

Example: Single-Part Message Structure

Date: Fri, 28 May 2021 16:39:57 +0200
From: “John Doe” <[email protected]>
Subject: =?iso-8859-1?Q?Meeting_with_Mr._Smith?=
MIME-Version: 1.0
Content-Type: text/plain; charset=”iso-8859-1″
Content-Transfer-Encoding: quoted-printable

Hello Mr. Smith,

Please find attached the minutes of the meeting.

Best regards,
John Doe

Multipart Messages

A multipart MIME message is used to combine several parts into a single message. Each part can have a different content type and/or encoding scheme. For example, a message with an attached image or file requires a multipart structure. Multipart messages are useful for combining different parts, such as text and HTML formats, or adding file attachments.

Figure 5 – Structure of a Multipart Message

multipart/mixed: This subtype is used to combine different types of content into a single message, such as text with an attached image or file.
multipart/alternative: This subtype contains multiple versions of the message body, for instance, plain text and HTML versions. This allows the recipient’s e-mail client to select the best format for display.
multipart/digest: This subtype is similar to multipart/mixed, but the default Content-Type value for a body part is changed from text/plain to message/rfc822. This media type indicates that the body contains an encapsulated message, which follows the syntax of an RFC 822 message. The multipart/digest type is often used for sending collections of messages in a single email, such as in e-mail forwarding.
multipart/related: This subtype provides a way to represent compound objects consisting of several interrelated parts. For example, an HTML message with embedded images would use this subtype, where the HTML document is the root part, and the images are referenced from it.
multipart/report: This subtype is used for electronic mail reports of any kind, generally for message delivery reports. It usually consists of two parts, with an optional third part. The first part contains a human-readable message describing the condition that caused the report to be generated. The second part is machine-parsable and contains an account of the reported message handling event. The optional third part may include the original message or part of it, to assist in diagnosing problems.
multipart/signed: This subtype is used to send digitally signed messages. It consists of two parts: a body part (the actual message) and a signature part. The digital signature authenticates the entire content of the first part. Multiple signature types are possible, though there is still a lack of standardization. Signed messages can also be sent using the multipart/mixed schema.
multipart/encrypted: This subtype is used to send encrypted messages. It has two parts: the first part contains information needed to decrypt the second part, which is the encrypted message. Similar to signed messages, there are different implementations specified in the Content-Type of the first part, and there is still a lack of standardization.

Each part in a multipart message is separated by a boundary string specified in the Content-Type header of the message. Multipart messages must be encoded using one of the standard encoding schemes, such as 7-bit, quoted-printable, or base64.

Example: Multipart Message Structure

Date: Mon, 31 May 2021 09:17:26 +0200
From: “Jane Smith” <[email protected]>
To: “John Doe” <[email protected]>
Subject: Meeting Notes
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=”boundary1″

–boundary1
Content-Type: text/plain; charset=”us-ascii”
Content-Transfer-Encoding: 7bit

Hi John,

Please find the meeting notes attached.

Best regards,
Jane

–boundary1
Content-Type: application/pdf; name=”meeting_notes.pdf”
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=”meeting_notes.pdf”

JVBERi0xLjQKJcTl8uXrp/Og0MTGCjMgMCBvYmoKMSAwIG9iago8PCAvVHlwZSAvRXh0R3N0IC9TdWI …

–boundary1–

Note: The binary content is encoded in Base64 to ensure safe transmission over the network.

MIME Media Types

A MIME media type is an identifier used in a Content-Type header to specify the nature of the data in the body of a MIME entity, whether it is the body of a single-part message or a part of a multipart message. MIME media types are often referred to as Internet media types because they are also used in other Internet protocols, such as HTTP. Their purpose is to enable the correct interpretation of the message content by specifying the file format of its body and attachments.

The MIME media type mechanism is defined in RFC 2046 and is designed to be extensible, as the set of media types is expected to grow significantly over time. To ensure that Internet media types are developed in an orderly, well-specified, and public manner, a registration process has been devised, managed by the Internet Assigned Numbers Authority (IANA).

Media types are two-level identifiers, specifying a top-level type and a subtype, with optional additional parameters. RFC 2046 defines seven top-level media types. Five of them are discrete data types, specifying the format of a single file, and the remaining two are composite data types, specifying the structure of a MIME body composed of multiple parts.

The five top-level discrete media types are:

text: Used for textual information. The subtype text/plain indicates plain text with no formatting and is intended to be displayed directly without special software, aside from supporting the character set specified by a charset parameter. For example: Content-Type: text/plain; charset=iso-8859–1

This indicates a text encoded in the ISO/IEC-8859-1 character set, commonly referred to as Latin 1, Western European. Other subtypes include text/html for HTML files, text/xml for XML files, and text/css for CSS (Cascading Style Sheet) files.
image: Used for image data, i.e., any information that requires a graphical display device to be rendered. Registered subtypes include widely used image types such as gif, tiff, jpeg, and png.
audio: Used for audio data, i.e., any information that requires an audio device, such as a speaker, to be rendered. The general subtype is audio/mpeg, which refers to MP3 or MPEG audio. Other audio data subtypes refer to proprietary formats, such as audio/x-ms-wma for Windows Media Audio or audio/x-wav for Waveform Audio File Format (WAV).
video: Used for time-varying picture images, possibly with color and coordinated sound. Standard (IANA-registered) subtypes include video/mpeg for MPEG-1 video with multiplexed audio, video/mp4 for MP4 video, and video/quicktime for QuickTime video. Other subtypes refer to proprietary formats, such as video/x-ms-wmv for Windows Media Video.
application: Used for data that does not fit into any of the other media types. This type of data needs to be processed by an application program to be rendered. There is a very large variety of application subtypes, with IANA having registered about 700 subtypes, most of which are vendor-specific, with identifiers beginning with vnd. For example, the application/vnd.ms-excel subtype is used for Microsoft Excel files. Due to the enormous variety, it is impossible to enumerate even a small set of relevant application subtypes.

Media Types and Dynamic Contents

The situation with media types is more complex than it might appear. Besides the IANA-registered media types, many subtypes are widely used and handled by most e-mail clients but are not yet registered with IANA. For instance:

Content-Type: application/msword; name=“sample.doc” Content-Description: sample.doc Content-Disposition: attachment; filename=“sample.doc”; size=99328; creation-date=“Tue, 05 Aug 2008 10:08:40 GMT”; modification-date=“Tue, 05 Aug 2008 10:08:40 GMT” Content-Transfer-Encoding: base64

This indicates a Microsoft Word attachment, a common occurrence. Moreover, the Content-Type definition is often completed by several parameters specifying object metadata and encoding, and it is not always evident where to find the related documentation.

Dealing with media types poses several challenges when preserving and archiving e-mail, as we will discuss in more detail in section 5. The media type paradigm was designed to give e-mail users flexibility in attaching files to messages and in defining new types according to their needs. E-mail clients are not expected to handle all media types; if they cannot process a specific data type, they simply classify it as an “unknown application.”

In contrast, the archival preservation process requires the ability to render any part of an archived message at any time in the future. Therefore, it is essential to ensure that:

All media types appearing in archived messages are registered in the archives, along with the necessary information to handle them, even if they are not registered with IANA.
An application is available for each media type registered in the archives.
A converted copy of the attachment is preserved in a format that guarantees it can be rendered at a later time.

Finally, issues arise from dynamic information that may be contained in a message. A common case involves external references (e.g., web links) or context-dependent information (e.g., date and time) in attached documents. Such messages are not self-contained and may not be properly rendered at a later time (or even at the time of arrival!). Therefore, when archiving these messages, appropriate policies should be established to either prevent dynamic content or “freeze” all dynamic references at arrival or archival time.

Archiving and Preserving Email Messages

When archiving and preserving email messages, it’s crucial to maintain the structure and format of the original message, including the message header, body, and any attachments. This also involves preserving metadata, such as the sender, recipients, and delivery information, to ensure the authenticity and integrity of the archived message.