Mail Archive Application

Last modified by Vincent Massol on 2024/02/26 17:53

Description

This is a DRAFT description of possible Mail Archive application.
Screenshots are not final, but aim at giving a better idea of what to expect ...

Design

Overview

The application will be divided in :

  • xwiki-contrib-mail (jar) : provides APIs in charge of:
    • connection and connection check to email servers
    • configurable dumping of emails from email servers
    • methods to parse email headers and/or bodies
    • methods to persist emails in a local store
  • xwiki-contrib-mailarchive-api (jar) : in charge of
    • loading and parsing emails from configured servers (using xwiki-component-mail)
    • persisting them in wiki pages/objects
    • managing categories (mailing-lists, mail types)
    • threading mails from the archive or from a specific topic
  • xwiki-contrib-mailarchive-ui (xar) : all the UI bits to administrate / configure the mail archive
  • xwiki-contrib-mailarchive-admin-ui (xar) : UI for operating and navigating the mail archive

Note: xwiki-contrib-mail should be unaware of XWiki so it can be used as a generic component.

The whole application UI is splitted in 4 spaces :

  • MailArchiveCode : all utility pages / macros / classes
  • MailArchivePrefs : all preferences objects are stored in this space. This allows to easily reimport the application without overwritting existing preferences, or reset prefs at once.
  • MailArchiveItems : all mail and topic objects are stored in this space.
  • MailArchive : all entry pages presented to user (mainly MailArchive.WebHome as main entry point, + administration pages, + statistics, ...)

Items can be  :
- Simple mails (MailArchiveCode.MailClass)
- Topics (used to group emails of a topic, MailArchiveCode.MailTopicClass)
- Attached emails (MailArchiveCode.MailClass with a different value for type)

The following diagram shows the structure of the application (the XClasses defined) :

ma-xclass-uml-diagram.png

xwiki-contrib-mail

Script service:

    List<Message> fetch(String hostname, int port, String protocol, String folder, String username, String password,
        Properties additionalProperties, boolean onlyUnread);

   int check(String hostname, int port, String protocol, String folder, String username, String password,
        Properties additionalProperties, boolean onlyUnread);

    String parseAddressHeader(String header);
public class MailItem
{
    private Date date;

    private String subject;

    private String topic;

    private String from;

    private String to;

    private String cc;

    private String sender;

    private String topicId;

    private String messageId;

    private String replyToId;

    private String refs;

    private Locale locale;

    private Object bodypart;

    private String contentType;

    private String sensitivity;

    private String type;

    private String wikiuser;

    private boolean isFirstInTopic;

    private String importance;

    private Message originalMessage;

    ...
}
public class MailContent
{
   public boolean isEncrypted();

   public boolean isSigned();

   public String getText();

   public String getHtml();

   public ArrayList<MimeBodyPart> getAttachments();

   public HashMap<String, MailAttachment> getWikiAttachments();

   public void append(MailContent mailContent);

   public List<Message> getAttachedMails();

}

xwiki-contrib-mailarchive-api

Script service:

    /**
     * Checks a server/store account, using account described in specified preference page.
     *
     * @param serverPrefsDoc
     * @return
     */

   int check(String serverPrefsDoc);

   /**
     * Creates a new default loading session.
     *
     * @return
     */

   LoadingSession session();

   /**
     * Creates a loading session, based on configuration stored in specified wiki page.
     *
     * @param serverPrefsDoc
     * @return
     */

   LoadingSession session(String sessionPrefsDoc);

   /**
     * Creates a loading session from an XObject.
     *
     * @param sessionObject
     * @return
     */

   LoadingSession session(BaseObject sessionObject);

   /**
     * Triggers a synchronous Loading Session.
     *
     * @param session
     * @return
     */

   int load(LoadingSession session);

   /**
     * Triggers an asynchronous loading session.
     *
     * @param session
     * @return
     */

   Job startLoadingJob(LoadingSession session);

   /**
     * Returns the current job, if any.
     *
     * @return
     */

   Job getCurrentJob();

   /**
     * Threads messages related to a topic, given its topic ID.<br/>
     *
     * @param topicid A topic ID, as can be found in a TopicClass object instance in "topicId" field.
     * @return An array of threaded messages.
     */

   ArrayList<ThreadMessageBean> thread(String topicid);

   /**
     * Threads all messages in the mail archive.<br/>
     * The result is an array, and not a recursive structure, in order to facilitate display of the thread.<br/>
     * For each message, property "level" provides the current level in the thread hierarchy,<br/>
     * and "index" provides the sequence number in the whole thread stack.<br/>
     * For example for the following thread:<br/>
     * - "I have a question" (level:0, index:0)<br/>
     * -- "Re: I have a question" (level:1, index:1)<br/>
     * --- "Re: I have a question" (level:2, index:2)<br/>
     * -- "Re: I have a question" (level:1, index:3)<br/>
     * This allows to easily sort by thread, and display thread hierarchy.
     *
     * @return An array of threaded messages.
     */

   ArrayList<ThreadMessageBean> thread();

   /**
     * Parses a user internet address.<br/>
     * Specified in {@link org.xwiki.contrib.mailarchive.internal.utils.IMailUtils#parseUser(String, boolean)}
     *
     * @param internetAddress
     * @return A "mail archive" user.
     */

   IMAUser parseUser(String internetAddress);

   /**
     * Returns the timeline generator component. Interface specified in
     * {@link org.xwiki.contrib.mailarchive.internal.timeline.ITimeLineGenerator}
     *
     * @return
     */

   ITimeLineGenerator getTimeline();

   /**
     * Retrieves the content of an email formatted for display.
     *
     * @param mailPage The wiki page containing the Mail XObject.
     * @param cut If email history should be "cut" out of the content.
     * @return
     */

   DecodedMailContent getDecodedMailText(String mailPage, boolean cut);

   /**
     * Returns the mail archive configuration object.<br/>
     * Specified in {@link org.xwiki.contrib.mailarchive.IMailArchive#getConfiguration()}
     *
     * @return The configuration object, or null if a problem occurred.
     */

   IMailArchiveConfiguration getConfig();

The ThreadedMessageBean is essentially :

class ThreadedMessageBean {
 
 int level;
 String subject;
 String date;
 String user;
  ... other relevant message fields for display
}

The general algorithm of a mails loading session :

  • For each mails with unread flag
    • Parse the mail headers
    • Try to link new mail with existing topic
      • If no topic exists, create a new one from class MailArchiveCode.MailTopicClass
      • If topic already exists, update it (lastupdatedate and author field as needed)
      • If topic already exists but subjects are too different, create a new one
    • If this mail already exists, update it (try to relate it to an existing topic)
    • If mail does not already exist, create new mail from class MailArchiveCode.MailClass

In addition, each included email (content type "message/rfc822") is also loaded in the archive, except that in this case it's not linked to a topic but to the email container. If a mail is attached to several mails, it is loaded only once.

Note: we consider "Message-ID" to really be unique, as defined by proper RFCs, so if there are different messages with same Message-ID value, only the first one encountered will be loaded.

Note : the threading algorithm is not implemented at loading time, only attachment to a topic is done at this time, and needed headers (reply-id, topic-id, references ...) are stored along other mail information.
The threading algorithm is implemented in the Sheet that displays a Topic (MailArchiveCode.MailTopicSheet), and is essentially an implementation (with very little changes) of the Grendel app algorithm, which is MPL licensed. This algorithm is far more robust as it also takes into account the References header, which most of the time is your last chance to relate mails together ... The output of the threading is a tree structure that is flatten, so it can be managed from velocity templates easily.

Alternative algorithm that I might implement :

  • thread the whole archive and keep the root nodes in a list
  • load and parse mails from the server
  • persist all mails in wiki pages (not topics)
  • thread the whole archive and extract the root nodes
  • compare the new state of the archive with the previous state :
    • For new topics : create the topics wiki page/object
    • For topics that disappeared : remove the topics wiki page
    • For topics for which new mails were loaded : update the topic wiki page/object
      The problem to solve is that both threading algorithms do not make the same assumptions for now and so would not lead to exactly the same list of topics ...
      Other argument: current algorithm works pretty well so it might not be needed to rework that.

What is stored for a mail :
- main identification and informative headers (msg-id, reply-id, topic-id, references, subject ...)
- body in text format
- body in html format (zipped)
- attachments (as page attachments) : all files, images, ...
- attached emails (stored as other mails, but without Topic)

Improvements to be done:
- parse not only text/html alternates but the full body parts, manage appropriately multipart/* content types
- parse calendar parts

Technical brainstorming

Reply Feature

The need is to connect to an outgoing SMTP server.
We will use the one configured in XWiki Preferences.

Other properties might/will be used:

  • SMTP Username: will retrieve values in this order of priority: property from MailArchiveCode.UserClass for logged user, user ID from logged user XWiki profile, property from XWikiPreferences
  • "From" address: will retrieve values in this order of priority: property from MailArchiveCode.UserClass for logged user, user email from logged user XWiki profile, admin email from XWikiPreferences
  • SMTP password: interactively entered by logged user after click on "Reply" button

Note: Mail Archive will be able (as an option) to match wiki profiles using "From" email address against XWiki profile "email" property value. If a MailArchiveCode.UserClass exists for a profile and has a matching email address, it will be used in priority (rework parseUser()).
Note2: MailArchiveCode.UserClass object will be added to XWiki user profile.

Open question: we could imagine differentiating "From" email address (and/or SMTP username) depending on targeted Mailing-List, if user registered to different mailing-lists using different emails, but this leads to many use-cases (sending to multiple lists, ...), so have to think about it more.

Considerations on multi-wikis

Currently, the application is not aware of multi-wikis, doesn't care, and actually doesn't work if installed in a sub-wiki (because of some easy-to-fix but blocking issues).

As a target, I'd like to manage the following:
- Monolithic mode (current): install full application (ie, xwiki-contrib-mailarchive-admin-ui), in main wiki, or in any sub-wiki, configure it fully and have fun.
- Workspace mode: install full application (xwiki-contrib-mailarchive-admin-ui) in main wiki, configure mailing-list groups if needed, to target specified sub-wiki for particular mailing-lists, install end-user UI application (xwiki-contrib-mailarchive-ui) in those sub-wikis. So administration, configuration, scheduled jobs run on main wiki, but created pages, navigation, statistics etc are accessed from sub-wikis. Note: this mode does not contradict the first one, ie you can also install the full app in another sub-wiki if you need complete independance.
- farm mode: well, in fact I think it is the monolithic mode, as described above.
- mix-mode: this "ideal" mode could implement some kind of configuration inheritance, ie would allow to define some globally shared configuration (ie, define a mail type "Announcement" that could concern all mailing-lists), and also to inherit/overload this with sub-wiki configuration (ie, define a mail type "XWiki Release" for mailing-list group "XWiki"). This is highly theorical for now ...

CONS/PROS of workspace mode:

  • PROS
    • Avoids duplication of pages (very minor PRO)
    • Avoids duplication of some configurations that could be shared (for example, allows to define email types globally and reuse for all defined mailing-lists/groups). That's the main advantage in fact.
  • CONS
    • Not possible to instantiate XObject if XClass is not in same wiki

Currently I implemented separation of -admin-ui and -ui, but this is painful to maintain... May have instead to manage main/sub wikis specific options in a unique -ui to simplify.

See discussion on this topic : Considerations on application development in multi-wiki mode.

Server Password storage

There currently is a vulnerability in the app : email accounts passwords are stored with "Password" type, but end-up in "clear" in database.
They are of course not displayed in logs, but to say it short, they can pop-up in status.xml of mail archive jobs.

Mid-term solution would be to cipher passwords in database, and decipher them only when used, so there's no risk to display them in clear at any point.
A short-term workaround may be to just take care of not logging them in mail archive jobs status.xml.

See discussion on this topic : A problem of secret with jobs, and JIRA Issue XMAILARCH-40.

Moderation

Features

  • Assign one to many "moderators" to a mailing-list (or a group of mailing-list ?)
  • Do not make a Topic "public" unless a "moderator" from one of the related mailing-list / mailing-list group manually authorize it
  • Optionally:
    • EITHER Do not make a Mail "public" unless a "moderator" from one of the related mailing-list / mailing-list group manually authorize it
    • OR Automatically make all mails "public" without moderation if related "Topic" is "public" already ?
  • Automatically "REJECT" (or move to "moderation" "space", or to "SPAM SUSPECTS" "space") emails that contain particular keywords. Note: could reuse the "Types" concept here maybe, with a built-in type "SPAM" that would automatically check some keywords against subject/body content ?? Note 2 : how to provide such list of keywords ? (references list somewhere ?). Link to language ? XWiki feature ?
  • Reuse "XWiki moderation" ? (does it exist apart from users creation ?)

General note: all this is not really needed, in case the mailing-list is a "real" mailing-list, ie usually such tool would implement spam detection, moderation, etc, in first place.
This would more be needed, in case your mailing-list is a mere "distribution list", and there is no mailing-list archive already with a known tool (like, markmail or such).

Progress

  • checkMails() method moved to Java component with scriptservice
  • loadMails(maxNb) implementation is done but might be rewritten soon (see alternate algo above)
  • Topics table-view has been rewritten to use a livetable instead of html sorted table
  • Now take into account references field for threading and at loading time to attach to a topic - now it can thread gmail messages !
  • wiki pages (mails / topics) creation code
  • threading algorithm rewritten 
  • test the threading/loading algorithms with a gmail server (some manual tests done)
  • removed some obsolete dependencies : mktree, section/column macros replaced by container macro
  • rewritten parsing of mail content and moved it to xwiki-contrib-mail. Now all elements (text/html/attachments) are loaded in 1 pass. To be tested.

Remaining work

For first release

  • Move the following items to a Java component :
    • parsing of body and attachments code
    • timeline generation code (done, to be tested) - not sure about that ... generates json + inner html, not really good to generate html from java  Rework in progress 
  • Rewrite the views to use this java component  to be done + timeline integration 
  • Remove most groovy possible from the wiki pages (in progress)
  • Improve the UI main views : home page and topics view (forum-like) and statistics, to be closer to what can be expected from this kind of mail archive :
    • easy ratings
    • minimum clicks to view a mail
    • look & feel
    • use stylesheet and javascript extensions and remove most "inline" css or javascript possible (in progress)
  • applying configured regexps for mail types still does not work at all for now ... (fixed, uts added) - reopened, as it doesn't work anymore for unknown reason
  • Add possibility to store the loaded mails to the FileSystem as a backup (in xwiki permanent data directory /mailarchive/store) (done)
  • Add possibility to reload emails from the FileSystem store in progress: view original mail from UI (gmail feature "view original"), retrieving it from the store
  • unit tests (almost nothing for now emoticon_unhappy )
  • extract the xwiki-contrib-mail from xwiki-contrib-mailarchive-api, and have the mailarchive use it done
  • cleanup / refactoring mostly done
  • generate the extension .xar (done), .jar (done), and list dependencies
  • publish the code to git-contrib (done)
  • make the algorithms more robust to low-quality mail servers (be robust to absent or incoherent values in basic mail headers). This is in fact a must-be-done because a "real" mailing-list is full of exceptions and incoherence ... There is no mean to publish the app if it can't manage most real-life cases (greatly improved now by using references header and a robust threading algorithm)
  • extract new java component(s) for mail parsing / wiki page creation (future or for first release ? better now ...)
  • Get rid of mstor "fake" module (added because of conflicting dependencies)
  • Fix issue of topics created with 0 emails
  • Rework "loading" UI and integration of Scheduler jobs in progress
    • Fill session parameters done
    • Save/update/delete session with id
    • Trigger loading session from id or from BaseObject done
    • Stop running loading session
    • Pause/restart loading session (optional)
    • View currently running loading session
    • Schedule job with loading session id (make it a macro)

For the future ...

  • add "reply" feature (or for first release ??)
    • in UI
    • add information to a wiki user profile in UI : the email address in the list (that could be different than the email address in the wiki profile). Manage one by list, with option "use same email address for all lists". Integrate this in wiki profile as a new section, or custom class in the mail archive ?
  • Timeline : dynamically apply icons (types) and colors (lists) to the timeline - all this is hard-coded for now. Make the timeline show-up whatever option (bundled or online).
  • manage calendar mails / publish them to wiki calendar ?
  • store distinction between text/html body parts instead of only concatenating them (to be able to remove replies history more easily for example) - great rework of how the Message Parts are parsed ... RFC jungle on the way. For now all text and html parts are merely concatenated, and attachments saved apart as xwiki attachments. done
  • add some powerful and useful "Operations" features for the mail archive administrator :
    • split topics
    • merge topics
    • purge (with filters) (draft page done)
    • manage topics states life-cycle (not accepted / accepted / closed)
    • validation of incoming emails : manually and/or automatically (keywords ??? coherence of parts ...) to avoid spamming
  • Add possibility to load/reload a mail from a backup on FileSystem (important because it can be a way to migrate mails from a previous version of the mail archive, for instance if a new field is taken into account, during reload it would be dumped and updated in the mail pages)
  • Add possibility to import mails from a .pst folder (other formats ?)

Screens

Administration part

Administration console is splitted in several tabs.

Summary

The "Summary" tab gives some metrics, and also shows some warnings about the remaining steps of installation process ...

Admin_Summary.png

Global Parameters

The global parameters (default values will change) :

Admin_GlobalParameters.PNG

Enterprise parameters

The "Enterprise features" target more advanced features, and will be suggest to changes (for now it's only LDAP, don't know if it will be in first version) :

Admin_EnterpriseFeatures.png

As you could see, a new feature "Use local store" has been added. In this case, all mails loaded are backuped on the filesystem, under XWiki permanent data directory (under mailarchive/storage).

The objective is that once backuped, these mails can be "reloaded" later, in case of deletion or after a migration impacting the structure of mail objects. Also it could be used to migrate from a Mail Archive app to another instance. The provider used is "mstor" with a format close to Thunderbird, so maybe it could also bring some sort of thunderbird import option.

Servers

The following should be used to add server connections, and specify folders to "consume" emails from :

Admin_Servers.png

When clicking on "host name" link, the specific server page is opened, where you can also test the connection :

Servers_View.png

(the "status" is also displayed in livetable above, though it will be changed to a human-readable connection state)

When clicking on "Add" button on Admin Servers tab a pop-up provides the fields in edit mode, when saving the user is get back to Servers tab.

Mailing-lists

The same type of tabs for MAiling-lists and mail types :

Admin_Lists.png

The List-ID if specified allows to link this mailing-list to a real email List (described in proper RFC), and maybe later to provide specific links (subscribe, unsubscribe, post, help ...) retrieved from email headers.

You see that servers and mailing-lists are not linked in any way, we defined 2 server connections (in fact 2 folders on same mail account), but they contain 3 different mailing-lists. This could be a problem if you are registered to several lists and mails sent to more than 1 list might be duplicated. In fact it could be an acceptable constraint of this app.

Mail Types

Admin_Types.png

As an added feature to improve readability and coherence, user can assign a color to a mailing-list, and an icon to a mail type. The colors and icons are also used in the timeline view, and in the statistics (and wherever possible).

Advanced parameters

A new Advanced Parameters tab has been added, most of the time users should not bother with but it's here more for compatibility with existing instance of the mail archive (here on my server emoticon_smile ) :

Admin_AdvancedParameters.png

Notes :

  • for types I'm not sure the regexp will exactly look like this ...
  • The "Mail" type will certainly be "built-in" as it's the default type of any mail. For other types there will certainly be some kind of "priority" to be applied, as a mail can't be of several types (or should it be ... ?)
  • Types patterns are stored in a text-area, with each pattern occupying two lines : first line is the list of fields (separated by ','), and second line is the regexp. The Sheet for MailType will allow to enter the patterns easily.
  • First release of this app will be tested to work against exchange and gmail mail servers. It should work with other servers as long as they manage properly basic mail headers. To work for every server some robustness improvements against basic headers should be done.

Navigation

Home page

Here is a preview of the home page :

Home_Livetable.png

Notice that the tag cloud is used to filter mailing-lists and types (very practical). It's a standard xwiki livetable with a customized macro for computing number of messages for a topic, and showing avatar of user from "From" header. Also, the ratings column should not be displayed if the ratings macro & app are missing, or if ratings option is unchecked in global prefs.

Clicking on a topic opens it, clicking on a user link shows some stats about user (maybe this could be configurable through an option, if you don't want to show stats).

Topics

A topic viewed in "forum" view, here with "Mint" theme :

Topic_ForumView.png

When a "From" email address is matched with a wiki profile (and/or an LDAP account), and if proper options are activated, the link to the profile is added for this user, and the avatar of this person (if any) is shown, as displayed here for Vincent.

The "List" view is another typical livetable view, the "threads" view is not shown here as it's still unfinished. All in all what is missing from the forum view is the possibility to sort messages differently (by default they are sorted by thread order), and the ratings (if activated and extensions needed present).

Statistics

Some stats about the archive, a specific list, a specific user, and combinations, will be available. For now it's far from final but here is a little preview :

Stats_Lists.png

Stats_Time.png

As you can see, the colors defined for the lists are used in the stats.

Note : it's pretty more advanced using XWiki {{chart}} macro, but as I was not very satisfied of the result, I tried to implement the stats using Highcharts.JS with prototype adapter.

As you can see the result is quite nice, though miss the animations you have with highcharts + jquery. For now it's a big-spaguetti-mess of velocity, javascript and SQL, but maybe a new extension could be created from that at the end, something like a {{highchart}} macro. Ideally it could have the same options & syntax as the {{chart}} macro.

PROS :

  • nice look&feel
  • dynamic (tooltips ...)
  • better control (colors, legends, positioning, ...)

CONS:

  • not images means not exportable as pdf or others

There are also some statistics about a specific user:

Stats_Users.png

Spreading the archive

Another view ("remote integration") provides users the possibility to easily integrate information from the archive in other sites or applications.

RemoteIntegration.png

This view lets you choose some filters (lists, types, ...), the wanted format (html/xml/json/rss), and where you want to put it (html page or xwiki page).
Then it will generate the content block that you just have to paste in your wiki/html page. For XML/JSON/RSS it merely provides a URL that you can use elsewhere as feed provider.

Dependencies

(O)ptional vs (M)andatory dependencies list :

  • Tabs Macro (M)
  • Add excel export to livetable macro (O)
  • Simile Timeline (O)
  • Ratings application + plugin (O)

Licenses & third-parties

  • Grendel threading algorithm (MPL)
  • Java lib-pst (Apache License 2)
    • To read .pst outlook format for .pst import
  • mstor and/or javamaildir for backups on filesystem (javamail Store providers) and the dependencies that come with them ... (ehcache, backport-edu-concurrent)

Discussions


 

Get Connected