- A project to download, process, and visualize an email corpus from the Sakai open source project from 2004-2011
- Analyzing an EMAIL Archive from gmane and vizualizing the data
We will be spidering this link:
By run of gmane.py getting the last five messages of the sakai developer list:
How many messages:10
http://mbox.dr-chuck.net/sakai.devel/1/2 2662
[email protected] 2005-12-08T23:34:30-06:00 call for participation: developers documentation
http://mbox.dr-chuck.net/sakai.devel/2/3 2434
[email protected] 2005-12-09T00:58:01-05:00 report from the austin conference: sakai developers break into song
http://mbox.dr-chuck.net/sakai.devel/3/4 3055
[email protected] 2005-12-09T09:01:49-07:00 cas and sakai 1.5
http://mbox.dr-chuck.net/sakai.devel/4/5 11721
[email protected] 2005-12-09T09:43:12-05:00 re: lms/vle rants/comments
http://mbox.dr-chuck.net/sakai.devel/5/6 9443
[email protected] 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments
Does not start with From
Note:The program scans content.sqlite from 1 up to the first message number not already spidered and starts spidering at that message. It continues spidering until it has spidered the desired number of messages or it reaches a page that does not appear to be a properly formatted message.
gmodel.py reads the rough/raw data from content.sqlite and produces a cleaned-up and well-modeled version of the data in the file index.sqlite The file index.sqlite will be much smaller (often 10X smaller) than content.sqlite because it also compresses the header and body text.
Running gmodel.py works as follows:
Loaded allsenders 1588 and mapping 28 dns mapping 1
1 2005-12-08T23:34:30-06:00 [email protected]
251 2005-12-22T10:03:20-08:00 [email protected]
501 2006-01-12T11:17:34-05:00 [email protected]
751 2006-01-24T11:13:28-08:00 [email protected]
The first, simplest data analysis is to do a "who does the most" and "which organzation does the most"? This is done using gbasic.py:
How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 5 Email list participants
[email protected] 2657
[email protected] 1742
[email protected] 1591
[email protected] 1304
[email protected] 1184
Top 5 Email list organizations
gmail.com 7339
umich.edu 6243
uct.ac.za 2451
indiana.edu 2258
unicon.net 2055
There is a simple vizualization of the word frequence in the subject lines in the file gword.py:
Range of counts: 33229 129
Output written to gword.js
This produces the file gword.js which you can visualize using the file
*gword.htm.*
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 10 Oranizations
>['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu', 'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com', 'stanford.edu', 'ox.ac.uk']
Output written to gline.js
Note: The above shown work is project of Python for everybody: Capstone: Retrieving, Processing, and Visualizing Data with Python on Coursera