messy

A tool suite for electronic messages.

Features

Input Formats

Messy recursively reads archive and container formats and parses several types of messages. Typically, one or more messages are stored in a file using a container format. One or more of those container files are then stored within an archive.

Archive Formats

These are general-purpose archive formats, not specific to messages.

Single compressed files
- Gzip (.gz)
- Bzip2 (.bz2)
- Compress (.Z)
Multiple files stored without compression
- Tar (.tar)
Multiple files stored with compression
- Zip (.zip)
- 7-Zip (.7z)

Container Formats

Mbox files.
- Supports various subtypes.
Newline-delimited JSON (ndjson) files.
Hamster data.dat files.
Single-message files.
- File extensions .eml and .msg.
- Newsspool messages, no file extension, name is an integer number.

Message Formats

Internet Message Format (IMF) used with email and Usenet messages.
A News messages, a 1980s format for Usenet messages.
JSON tweets distributed as a directory tree of , each compressed with bzip2, the directory tree then packed in a single tar archive file.

Storage

Upload messages to an Elasticsearch instance.

Status

Created November 8th, 2020. As of 2021, a one-person hobby project. Command-line application msgcli can be used to explore message archives, converting messages to JSON and printing them to standard output.

Goals

Human Goals

Help users sort through, triage, clean up and consolidate their messages as a basis for discovery, backup and archival.
Enable digital preservation of public messages as a part of computing history.
Simplify bulk exchange of messages between interested parties.

Technological Goals

Parse electronic messages of various types.
Support different file formats.
Read messages from servers with different protocols.
Handle extraction of attachments and references to external information.
Create a message database with full text search and reporting.
Analyze messages to allow more fine-grained search, separate public from private ones.

Command-Line Application

Command-line application msgcli reads messages from standard input or files, converts them and prints a summary of each message to standard output or upoads it to Elastic.

Clone the git repository and install msgcli locally:

$ javac -version
# ... should print version 1.8 or higher
$ cd ~
$ git clone https://github.com/marco-schmidt/messy.git
...
$ cd messy
$ ./gradlew :msgcli:install
...
$ alias m='/path/to/homedir/messy/msgcli/build/install/msgcli/bin/msgcli'
$ m ../test.mbox
...

The application can now be used with m.

This makes msgcli upload the content of a twitter stream tar file to Elasticsearch running locally listening on port 9200:

$ export MESSY_OUTPUT_FORMAT=ELASTIC
$ m /path/to/twitter-stream-2017-07-01.tar
{"@timestamp":"2021-12-04T16:49:56.631+01:00","message":"Connected to Elastic server 'localhost:9200'.","logger_name":"messy.msgsearch.elastic.ElasticOutputProcessor","thread_name":"main","level":"INFO","level_value":20000,"server_type":"Elastic","host":"localhost","port":9200,"app_name":"msgcli"}
{"@timestamp":"2021-12-04T16:49:56.655+01:00","message":"Opening file '/path/to/twitter-stream-2017-07-01.tar' (35864390 bytes).","logger_name":"messy.msgcli.app.InputProcessor","thread_name":"main","level":"INFO","level_value":20000,"file_name":"/path/to/twitter-stream-2017-07-01.tar","file_size":35864390,"app_name":"msgcli"}
...

This uses Unix tool find to create a list of mbox files and pipe them to msgcli which will print two properties as tab-separated values to standard output:

$ export MESSY_OUTPUT_FORMAT=TSV
$ export MESSY_OUTPUT_ITEMS=AUTHOR_ID,AUTHOR_NAME
$ find /mnt/hdd2/archive/usenet -type f -name '*.mbox'|m -@
...

Known Limitations

7-Zip streams can only be opened as files, not as part of archives.
Hamster message data files have no magic bytes file signature to properly identify them. Their file name data.dat is therefore used to detect them.

Technology Stack

Written in Java 8, using Adoptium (but any JDK version 8 or higher should do).
Build tool gradle, as a multi-project build with the gradle wrapper.
Hosted in a public git repository at GitHub.
Continuous integration with GitHub Workflow Java CI.
Dependencies:
- JUnit for unit tests,
- archive I/O from Apache Commons Compress,
- MIME support from Jakarta Mail,
- logging with SLF4J and Logback,
- Lucene for full-text search.
Static code analysis with
- gradle plugins SpotBugs, checkstyle and Forbidden API Checker and
- service Codacy.
Project comes with an Eclipse configuration file and gradle is configured to generate a workspace for Eclipse. Any other Java IDE will probably also work.
Code formatting and license header with gradle spotless plugin. Also format automatically when saving in Eclipse (if provided configuration file is used, see below for gradle Eclipse workspace setup).
Vulnerability analysis:
- Gradle plugin dependencyCheck. It compares direct and transitive dependencies to CVE entries in the National Vulnerability Database (NVD).
- GitHub workflow service CodeQL.
API documentation with javadoc.
Code coverage reporting with jacoco and codecov.io.
Check for new versions of dependencies with gradle plugin versions.
Create reports of dependencies and their licenses and check licenses against positive list.

Development Setup

Install JDK 8 or higher on the system.
Set environment variable JAVA_HOME to the JDK installation path, include its bin subdirectory in PATH variable. Run javac -version and possibly which java to make sure that the right Java compiler and virtual machine are available now.
Clone the messy git repository.
Navigate to cloned working copy and run ./gradlew check as an initial toolchain check.
Install Eclipse IDE, run ./gradlew eclipse in the cloned working copy, open Eclipse and import projects msg*.

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
.github		.github
config		config
gradle		gradle
msgcli		msgcli
msgdata		msgdata
msgio		msgio
msgsearch		msgsearch
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

messy

Features

Input Formats

Archive Formats

Container Formats

Message Formats

Storage

Status

Goals

Human Goals

Technological Goals

Command-Line Application

Known Limitations

Technology Stack

Development Setup

About

Releases

Packages

Languages

License

marco-schmidt/messy

Folders and files

Latest commit

History

Repository files navigation

messy

Features

Input Formats

Archive Formats

Container Formats

Message Formats

Storage

Status

Goals

Human Goals

Technological Goals

Command-Line Application

Known Limitations

Technology Stack

Development Setup

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages