Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Gmail takeout mbox import (v2) #8

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

maxhawkins
Copy link

WIP

This PR builds on #5 to continue implementing gmail import support.

Building on @UtahDave's work, these commits add a few performance and bug fixes:

  • Decreased memory overhead for import by manually parsing mbox headers.
  • Fixed error where some messages in the mbox would yield a row with NULL in all columns.

I will send more commits to fix any errors I encounter as I run the importer on my personal takeout data.

UtahDave and others added 12 commits February 22, 2021 12:56
Parsing the mbox file manually instead of using Python's built-in
parser allows us to process large files without loading them into
memory all at once.
This fixes a regression introduced by the previous commit where messages no longer fetch the date from the mbox 'From ' line. For messages without a Date header this means we lose information about the delivery date.
Some messages (like gchat logs) don't have message ids and therefore don't save properly. This commit uses the gmail X-GM-THRID if the Message-Id is missing.
The function email.utils.parsedate_tz expects a str, but we were passing bytes. Casting to str fixes an exception in messages where the Date header is missing and the delivery time must be inferred from the mbox header.
Some messages (like chats) don't have a Message-Id mime header, so the message is saved without a primary key.

A previous commit used the thread id in this situation, but the same thread id can be used for multiple messages. This id, which is the message id used by the gmail api, should be unique across all messages.
The docs note: "The policy keyword should always be specified; The default will change to email.policy.default in a future version of Python."
This shouldn't happen in RFC-abiding messages, but raw unicode or other non-ascii content will cause the header parser to return a Header object rather than a str. Improve handling of this case and add a simple unit test.
If the string is invalid, the undecoded string is returned instead.
@maxhawkins
Copy link
Author

Just added two more fixes:

  • Added parsing for rfc 2047 encoded unicode headers
  • Body is now stored as TEXT rather than a BLOB regardless of what order the messages are parsed in.

I was able to run this on my Takeout export and everything seems to work fine. @simonw let me know if this looks good to merge.

@maxhawkins maxhawkins changed the title WIP: Add Gmail takeout mbox import (v2) Add Gmail takeout mbox import (v2) Aug 7, 2021
In some instances tables would be created with the wrong column types if the initial records had unexpected types. This fixes the issue by explicitly creating the table and specifying types.
Using this newer email parsing code enables parsing of attachments and easier parsing of html emails in the future.
This may be more robust than the tree-walking method we were using earlier, and will enable parsing of html email contents in a future commit.
(Only if no text/plain alternative exists)
@maxhawkins
Copy link
Author

I added parsing of text/html emails using BeautifulSoup.

Around half of the emails in my archive don't include a text/plain payload so adding html parsing makes a good chunk of them searchable.

@Btibert3
Copy link

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

@maxhawkins
Copy link
Author

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

Shouldn't be hard. The easiest way is probably to remove the if body.content_type == "text/html" clause from utils.py:254 and just return content directly without parsing.

@iloveitaly
Copy link

@maxhawkins curious why you didn't use the stdlib mailbox to parse the mbox files?

@maxhawkins
Copy link
Author

@maxhawkins curious why you didn't use the stdlib mailbox to parse the mbox files?

Mailbox parses the entire mbox into memory. Using the lower level library lets us stream the emails in one at a time to support larger archives. Both libraries are in the stdlib.

@iloveitaly
Copy link

Makes sense, thanks for explaining!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants