A tool suite for electronic messages.
Messy recursively reads archive and container formats and parses several types of messages. Typically, one or more messages are stored in a file using a container format. One or more of those container files are then stored within an archive.
These are general-purpose archive formats, not specific to messages.
- Single compressed files
- Gzip (.gz)
- Bzip2 (.bz2)
- Compress (.Z)
- Multiple files stored without compression
- Tar (.tar)
- Multiple files stored with compression
- Zip (.zip)
- 7-Zip (.7z)
- Mbox files.
- Supports various subtypes.
- Newline-delimited JSON (ndjson) files.
- Hamster data.dat files.
- Single-message files.
- File extensions .eml and .msg.
- Newsspool messages, no file extension, name is an integer number.
- Internet Message Format (IMF) used with email and Usenet messages.
- A News messages, a 1980s format for Usenet messages.
- JSON tweets distributed as a directory tree of , each compressed with bzip2, the directory tree then packed in a single tar archive file.
- Upload messages to an Elasticsearch instance.
Created November 8th, 2020. As of 2021, a one-person hobby project. Command-line application msgcli can be used to explore message archives, converting messages to JSON and printing them to standard output.
- Help users sort through, triage, clean up and consolidate their messages as a basis for discovery, backup and archival.
- Enable digital preservation of public messages as a part of computing history.
- Simplify bulk exchange of messages between interested parties.
- Parse electronic messages of various types.
- Support different file formats.
- Read messages from servers with different protocols.
- Handle extraction of attachments and references to external information.
- Create a message database with full text search and reporting.
- Analyze messages to allow more fine-grained search, separate public from private ones.
Command-line application msgcli
reads messages from standard input or files, converts them and prints a summary of each message to standard output or upoads it to Elastic.
Clone the git repository and install msgcli locally:
$ javac -version
# ... should print version 1.8 or higher
$ cd ~
$ git clone https://github.com/marco-schmidt/messy.git
...
$ cd messy
$ ./gradlew :msgcli:install
...
$ alias m='/path/to/homedir/messy/msgcli/build/install/msgcli/bin/msgcli'
$ m ../test.mbox
...
The application can now be used with m
.
This makes msgcli upload the content of a twitter stream tar file to Elasticsearch running locally listening on port 9200:
$ export MESSY_OUTPUT_FORMAT=ELASTIC
$ m /path/to/twitter-stream-2017-07-01.tar
{"@timestamp":"2021-12-04T16:49:56.631+01:00","message":"Connected to Elastic server 'localhost:9200'.","logger_name":"messy.msgsearch.elastic.ElasticOutputProcessor","thread_name":"main","level":"INFO","level_value":20000,"server_type":"Elastic","host":"localhost","port":9200,"app_name":"msgcli"}
{"@timestamp":"2021-12-04T16:49:56.655+01:00","message":"Opening file '/path/to/twitter-stream-2017-07-01.tar' (35864390 bytes).","logger_name":"messy.msgcli.app.InputProcessor","thread_name":"main","level":"INFO","level_value":20000,"file_name":"/path/to/twitter-stream-2017-07-01.tar","file_size":35864390,"app_name":"msgcli"}
...
This uses Unix tool find
to create a list of mbox files and pipe them to msgcli
which will print two properties as tab-separated values to standard output:
$ export MESSY_OUTPUT_FORMAT=TSV
$ export MESSY_OUTPUT_ITEMS=AUTHOR_ID,AUTHOR_NAME
$ find /mnt/hdd2/archive/usenet -type f -name '*.mbox'|m -@
...
- 7-Zip streams can only be opened as files, not as part of archives.
- Hamster message data files have no magic bytes file signature to properly identify them. Their file name data.dat is therefore used to detect them.
- Written in Java 8, using Adoptium (but any JDK version 8 or higher should do).
- Build tool gradle, as a multi-project build with the gradle wrapper.
- Hosted in a public git repository at GitHub.
- Continuous integration with GitHub Workflow Java CI.
- Dependencies:
- JUnit for unit tests,
- archive I/O from Apache Commons Compress,
- MIME support from Jakarta Mail,
- logging with SLF4J and Logback,
- Lucene for full-text search.
- Static code analysis with
- gradle plugins SpotBugs, checkstyle and Forbidden API Checker and
- service Codacy.
- Project comes with an Eclipse configuration file and gradle is configured to generate a workspace for Eclipse. Any other Java IDE will probably also work.
- Code formatting and license header with gradle spotless plugin. Also format automatically when saving in Eclipse (if provided configuration file is used, see below for gradle Eclipse workspace setup).
- Vulnerability analysis:
- Gradle plugin dependencyCheck. It compares direct and transitive dependencies to CVE entries in the National Vulnerability Database (NVD).
- GitHub workflow service CodeQL.
- API documentation with javadoc.
- Code coverage reporting with jacoco and codecov.io.
- Check for new versions of dependencies with gradle plugin versions.
- Create reports of dependencies and their licenses and check licenses against positive list.
- Install JDK 8 or higher on the system.
- Set environment variable JAVA_HOME to the JDK installation path, include its bin subdirectory in PATH variable. Run
javac -version
and possiblywhich java
to make sure that the right Java compiler and virtual machine are available now. - Clone the messy git repository.
- Navigate to cloned working copy and run
./gradlew check
as an initial toolchain check. - Install Eclipse IDE, run
./gradlew eclipse
in the cloned working copy, open Eclipse and import projects msg*.