How Email Works
Email is a store-and-forward method of exchanging messages on the Internet. This means a message sent by a user goes through an asynchronous process of delivery, typically involving a series of steps. In each step, the message is stored by an intermediate server on the network to be forwarded at a later time until it finally reaches its destination. The timing of delivery depends on the availability of network connections.
Figure 1 illustrates the delivery process, which involves a sender, Alice, and a recipient, Bob. Both Alice and Bob use specific applications called email clients, which run on their PCs to send and receive emails. These clients do not communicate directly but connect to email servers, which are specialized applications operated by Alice’s and Bob’s organizations or ISPs that manage the delivery process.
Figure 1 – Basic Email Infrastructure
The email delivery process involves the following steps:
- Alice composes the message using her email client.
- The message is formatted by Alice’s email client in a specific Internet email format and then sent to her local email server.
- Alice’s email server locates the address of Bob’s email server using the Domain Name System (DNS), the distributed directory of the Internet.
- The two email servers exchange the message, which may pass through a series of intermediate servers on the network, until it is finally stored in Bob’s personal mailbox on Bob’s email server.
- The message remains in Bob’s mailbox until he reads or downloads it using his email client.
The procedure is quite similar to the process Alice and Bob follow when exchanging letters. Local post offices play a role similar to that of local email servers, and letter delivery may go through additional post offices (intermediate servers). In both cases, delivery time and even delivery itself are not guaranteed.
The Internet is a best-effort network, meaning the message, like any other information crossing the network, must pass through several servers run by independent organizations that make no commitment to service availability or quality. Therefore, delivery time cannot be predicted, and the message may even get lost along the way.
However, as we will discuss later in more detail, all clients and servers involved in the delivery process follow a set of strict rules (protocols). This allows for the tracing of all relevant events and the recording of detailed information in a report appended to the message. Additionally, in case of delivery failure, the server may attempt delivery again, and the sender may request delivery reports and receipts to confirm that the message has been delivered and/or read by the recipient.
End-User Access to Email
End users can access the email system in several ways:
- Email Client: This method corresponds to the basic process discussed in the previous section, where the user runs a special application on their PC designed to interact with the email server. Email clients can be proprietary or open-source software, and a wide variety of them are available. Besides the basic functions of sending and retrieving messages from the email server, which are performed according to standard interaction protocols that ensure interoperability, they usually offer user-friendly interfaces and additional functions to classify and store messages, manage directories, and more. In this setup, messages are typically downloaded and stored on the user’s PC, which may not be convenient for users who need to access their mail from multiple devices.
- Webmail: This is the most common way users access email from their home PC, through a service offered by their ISPs or third-party organizations like Hotmail or Gmail. In this setup (see Figure 2), the client application running on the end user’s PC is an Internet browser (e.g., Explorer, Mozilla), which connects to a web server running a special webmail application. The web server acts as an intermediary and manages the connection with the email server. Additionally, messages are not downloaded to the user’s PC but are managed and stored directly on the web server. This provides a significant advantage for users who need to access their mail from multiple devices.
- Integrated Systems: This is the typical solution used by most corporations and large organizations. It integrates email access into a broader ‘collaborative’ environment that includes additional functions such as direct messaging, calendaring, contacts, and tasks, as well as support for mobile and web-based access to information. It also manages message storage on a central server. Popular products of this kind include Microsoft Exchange and IBM Lotus Domino. Users run proprietary client applications (e.g., Microsoft Outlook or Lotus Notes) on their PCs that connect to the corporate server, which in turn connects to the email server (see Figure 3). To assist mobile users, these systems often include an optional web interface, functionally equivalent to webmail, which allows access through a web browser. However, the primary interface is typically the proprietary one used on the organization’s intranet. Although this setup is specific and includes proprietary elements, it is essential to consider because it represents a significant portion of the market, especially for email archiving in corporations and large organizations.
Figure 2 – Webmail
Figure 3 – Corporate Mail with Integrated System
Interoperability of Email Systems
As discussed in previous sections, exchanging a message involves interaction among several agents (email clients and servers), which are generally heterogeneous systems based on different hardware and software platforms. Additionally, these systems are independently designed and implemented by different parties, potentially without any direct coordination.
One of the main challenges in the Internet email system is ensuring interoperability, i.e., correct and reliable communication among these heterogeneous systems. Interoperability is based on two main elements:
- Communication Protocols: These are sets of rules governing communication between agents, ensuring that agents can reliably and correctly interact using a common language and standard procedures.
- Message Format: This is a set of formal definitions specifying the structure of the message and how the message and its attachments are encoded, ensuring correct interpretation by different email clients and guaranteeing that the content of the message is correctly rendered to its recipient.
Another requirement is that interoperability must also be guaranteed over time. This means that when the definitions of protocols and message formats evolve, they should maintain backward compatibility, i.e., new rules should still be compatible with old ones. For example, a message formatted according to an older version of the message format standard should be presented correctly by an email client compliant with the new version. Unfortunately, this is not always the case, and it is a major concern in email archiving, where ensuring that archived messages remain readable over time, even as standards evolve, is crucial.
Internet Standards
The standardization process of the Internet is somewhat different from the usual ISO/IEC track, so it is worth explaining how these standards are developed and allowed to evolve.
Internet standards are developed and promoted by the Internet Engineering Task Force (IETF), which cooperates closely with major international standard bodies like ISO/IEC and the World Wide Web Consortium (W3C), the main international standards organization for the World Wide Web.
The standardization process, which dates back to the early days of the ARPAnet project, is highly cooperative and based on special documents called Request For Comments (RFC). RFCs are draft documents, mostly proposals for standards, published by the IETF and posted on the network as a ‘request for comments.’ Each RFC is assigned a unique number and is never rescinded or modified. If amendments are needed, a new RFC is issued with a different number, superseding the old one.
As stated in RFC 1796, which discusses the standardization process, “Not all RFCs are standards.” Some are just memoranda, remarks that people wish to share, research papers, or preliminary proposals on any matter concerning the Internet and Internet-based systems. The IETF assigns a status to each RFC.
‘Mature’ RFCs are rated Standard Track and are further divided into Proposed Standard, Draft Standard, and Internet Standard. Internet Standards (STD) each refer to an RFC (or a set of RFCs) and are given a unique number. Unlike the RFC number, when the standard evolves, the STD number does not change but simply refers to a new RFC that supersedes the original one.
Standardization of Email Transmission
Server-to-server and client-to-server interoperability are ensured by SMTP (Simple Mail Transfer Protocol), which is Internet Standard STD 10. SMTP dates back to August 1982 and is based on RFC 821. However, the protocol currently used by the majority of email applications is known as ESMTP (Extended SMTP) and is defined in RFC 2821, published in April 2001.
However, formally, the status of RFC 2821 is still a Proposed Standard, and the official standard is still the one defined by RFC 821. This situation of ‘going ahead of the official standard’ is typical of the Internet world, and it is of no use to argue whether it is right or wrong; we must simply cope with it.
SMTP specifies how the email client interacts with the email server to deliver the message and how email servers (often called SMTP servers) interact with each other to ensure the message passes through several agents and finally reaches its destination. The use of the SMTP protocol in the message delivery process is clearly shown in Figures 1 and 2.
Regarding the problem of email archiving, this standard is important because it defines the basic format of messages that can be handled by SMTP servers and go through the delivery process. This is a very basic format, supporting only simple text messages in plain ASCII (also called 7-bit ASCII or US-ASCII) characters, which are sufficient only for English and a few other languages. This limitation is overcome by defining a special way to encode richer content in plain ASCII characters, allowing the use of a more general set of characters in the message text, and including formatted text and multimedia content in email messages, as we will discuss in section 2.7.
Standardization of Client-Server Communication
Email clients can retrieve email from servers in several ways, supported by both standard and proprietary protocols. This is relevant to email archiving because the process of storing email messages must deal with how they are downloaded and handled by different client applications, which may affect the process and determine the format of archived messages.
POP3
POP3 (Post Office Protocol version 3) is the protocol most commonly used by email clients to retrieve messages from servers. The official Internet Standard is defined in STD 53 and is based on RFC 1939, published in May 1996. This protocol is limited in scope and allows for the download of messages only. It does not include the management of mail folders on the server side (e.g., Inbox, Sent, Drafts) or any other advanced features like server-based search or access to metadata. This is a severe limitation, especially when dealing with multiple clients, such as a PC and a smartphone, where folders should be synchronized.
IMAP4
IMAP4 (Internet Message Access Protocol version 4) is the most advanced and feature-rich protocol, officially defined by STD 55 and based on RFC 3501, published in March 2003. IMAP4 supports advanced folder management, server-based search, access to metadata, and offline operations. This makes it much better suited for use with multiple clients. However, it is more complex and demanding in terms of computing and network resources.
Webmail Protocols
The Webmail interface uses an Internet browser as the client application and a Web server (or a special Webmail server) as an intermediary that connects to the email server. The protocols used by the browser to communicate with the Web server are HTTP (Hypertext Transfer Protocol) and HTTPS (Secure HTTP). These protocols are not email-specific and are defined by Internet Standards STD 1 (RFC 2616, June 1999) and STD 66 (RFC 2818, May 2000), respectively.
The protocols used by the Web server to communicate with the email server are generally SMTP and IMAP, already discussed in previous sections.
This setup is highly relevant to email archiving, especially when it comes to ensuring that the archived message’s format includes all the information and content needed to faithfully reconstruct the message as seen by the user when accessing it via Webmail.
Standardization of Message Format
The Internet Standard format for email messages is defined by RFC 822 (August 1982), later superseded by RFC 2822 (April 2001), which specifies the format of the email header and body. The format defined by RFC 2822 is still the official standard, though it has been further refined by several other RFCs. The standard email format supports only plain text messages in US-ASCII encoding, which is a major limitation for modern email communication.
This limitation is overcome by the Multipurpose Internet Mail Extensions (MIME) standard, defined by STD 11, which is based on a set of five RFCs (RFC 2045 to RFC 2049, published in November 1996). MIME allows for the use of various character sets and multimedia content (e.g., images, sound, video) in email messages. It also supports the encoding of binary content in a 7-bit ASCII format, which is essential for the correct transmission of non-ASCII content in email messages. MIME is fundamental to modern email communication and is supported by almost all email clients and servers.
Conclusion
The email system is an essential part of modern communication, involving a complex and well-coordinated process of message exchange between various agents (clients and servers) over the Internet. The system is based on a set of standard protocols and message formats that ensure interoperability among heterogeneous systems and reliable communication across the network. These standards are defined and promoted by the IETF through a cooperative and evolving process of RFC publication and review.
The email system supports various access methods for end users, including traditional email clients, Webmail interfaces, and integrated corporate systems. Each method has its own strengths and weaknesses, but all rely on the same underlying protocols and message formats. The system’s success and widespread adoption are due to the interoperability guaranteed by these standards, which allow for seamless communication between different systems, platforms, and applications.
Understanding the standardization of email transmission, client-server communication, and message format is crucial for ensuring the long-term usability and accessibility of archived email messages. By following these standards, organizations can ensure that their archived email messages remain readable and accessible, even as technology evolves.