Skip to content

enricospataro/enron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Processing Enron email dataset with Spark

Assumptions:

  • Only the first XML file found in an unzipped directory is processed.
  • Only folder text_000 with text files (emails' bodies) within the unzipped directory is processed.
  • The directory /enron_extract is used to extract a zip file.
  • Only the edrm-enron-v2 folder is processed
  • There are 3 types of email addresses in fields To and CC: LDAP addresses, normal email addresses and full contact names: LDAP address is processed with the pattern [RECIPIENTS/CN=]recipient[≶] Normal email addresses are processed with the pattern []@[] Full contact names - e.g. Surname, Name - are not processe
  • It is assumed that the email bodies are formatted in a proper way, so it is accurate to split on the white space character in order to perform the count of the words

Deployment instruction:

  • provision AWS Ubuntu Server 14.04 (t2.large)
  • add the Enron snapshot as a storage volume 250Gb
  • mount enron volume into enron folder in root ('sudo mkdir enron' and then 'sudo mount /dev/xvdb enron')
  • install Scala into /home/ubuntu/scala and Spark into /home/ubuntu/spark
  • copy test.scala to /home/ubuntu and run it to start a full processing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages