Spidering_and_Modeling_Email_Data.PY4E

A project to download, process, and visualize an email corpus from the Sakai open source project from 2004-2011
Analyzing an EMAIL Archive from gmane and vizualizing the data

Step 1 ^st

Spidering the link and creating a database to store the mining.

We will be spidering this link:

http://mbox.dr-chuck.net/

By run of gmane.py getting the last five messages of the sakai developer list:

How many messages:10
http://mbox.dr-chuck.net/sakai.devel/1/2 2662
    [email protected] 2005-12-08T23:34:30-06:00 call for participation: developers documentation
http://mbox.dr-chuck.net/sakai.devel/2/3 2434
    [email protected] 2005-12-09T00:58:01-05:00 report from the austin conference:  sakai developers break into song
http://mbox.dr-chuck.net/sakai.devel/3/4 3055
    [email protected] 2005-12-09T09:01:49-07:00 cas and sakai 1.5
http://mbox.dr-chuck.net/sakai.devel/4/5 11721
    [email protected] 2005-12-09T09:43:12-05:00 re: lms/vle rants/comments
http://mbox.dr-chuck.net/sakai.devel/5/6 9443
    [email protected] 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments
Does not start with From

Note:The program scans content.sqlite from 1 up to the first message number not already spidered and starts spidering at that message. It continues spidering until it has spidered the desired number of messages or it reaches a page that does not appear to be a properly formatted message.

Step 2^nd

The second process is running the program gmodel.py.

gmodel.py reads the rough/raw data from content.sqlite and produces a cleaned-up and well-modeled version of the data in the file index.sqlite The file index.sqlite will be much smaller (often 10X smaller) than content.sqlite because it also compresses the header and body text.

Running gmodel.py works as follows:

Loaded allsenders 1588 and mapping 28 dns mapping 1
1 2005-12-08T23:34:30-06:00 [email protected]
251 2005-12-22T10:03:20-08:00 [email protected]
501 2006-01-12T11:17:34-05:00 [email protected]
751 2006-01-24T11:13:28-08:00 [email protected]

Step 3^rd

Running Gbasic.py

The first, simplest data analysis is to do a "who does the most" and "which organzation does the most"? This is done using gbasic.py:

How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584

Top 5 Email list participants
[email protected] 2657
[email protected] 1742
[email protected] 1591
[email protected] 1304
[email protected] 1184

Top 5 Email list organizations
gmail.com 7339
umich.edu 6243
uct.ac.za 2451
indiana.edu 2258
unicon.net 2055

Step 4^th : Visualizations

First Vizualization by running gword.py.

There is a simple vizualization of the word frequence in the subject lines in the file gword.py:

Range of counts: 33229 129
Output written to gword.js

This produces the file gword.js which you can visualize using the file 
*gword.htm.*

Second visualization is in gline.py.

Loaded messages= 51330 subjects= 25033 senders= 1584
Top 10 Oranizations
>['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu', 'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com', 'stanford.edu', 'ox.ac.uk']
Output written to gline.js

Note: The above shown work is project of Python for everybody: Capstone: Retrieving, Processing, and Visualizing Data with Python on Coursera

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spidering_and_Modeling_Email_Data.PY4E

Step 1 ^st

Spidering the link and creating a database to store the mining.

Step 2^nd

The second process is running the program gmodel.py.

Step 3^rd

Running Gbasic.py

Step 4^th : Visualizations

First Vizualization by running gword.py.

Second visualization is in gline.py.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
gbasic.py		gbasic.py
gline.htm		gline.htm
gline.py		gline.py
gmane.py		gmane.py
gmodel.py		gmodel.py
gword.htm		gword.htm
gword.py		gword.py
gyear.py		gyear.py

Brijesh403/Spidering_and_Modeling_Email_Data.PY4E

Folders and files

Latest commit

History

Repository files navigation

Spidering_and_Modeling_Email_Data.PY4E

Step 1 st

Spidering the link and creating a database to store the mining.

Step 2nd

The second process is *running the program *gmodel.py.

Step 3rd

Running Gbasic.py

Step 4th : Visualizations

First Vizualization by running gword.py.

Second visualization is in gline.py.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 1 ^st

Step 2^nd

The second process is running the program gmodel.py.

Step 3^rd

Step 4^th : Visualizations

Packages