9781449326265
chapter_2.html

Chapter 2. Data

Email

We will be using your own email inbox as the dataset for the application we'll develop, in order to make the examples relevant. By downloading your gmail inbox and then using it in the examples, we will immediately pose a 'big' or 'medium' data problem - processing the data on your local machine is not always feasible. Working with data too large to fit in RAM this way requires that we use scalable tools, which is helpful as a learning device.

Email is a fundamental part of the internet. More than that, it is foundational; forming the basis for authentication for the web and social networks. In addition to being abundant and well understood, email is also complex, rich in signal and yields interesting information when mined.

Working with Raw Data

Raw Email

Email's format is rigorously defined in IETF RFC-5322 (Request For Comments by the Internet Engineering Taskforce). To view a raw email in Gmail, select a message and then select the 'Show original' option in the top-right drop-down menu.

Figure 2.1. Gmail Show original

Gmail view original email button

A raw email looks like this:

From: Russell Jurney <russell.jurney@gmail.com>
Mime-Version: 1.0 (1.0)
Date: Mon, 28 Nov 2011 14:57:38 -0800
Delivered-To: russell.jurney@gmail.com
Message-ID: <4484555894252760987@unknownmsgid>
Subject: Re: Lawn
To: ****** Jurney <******@hotmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Dad, get a sack of Rye grass seed and plant it over there now.  It
will build up a nice turf over the winter, then die off when it warms
up.  Making for good topsoil you can plant regular grass in.

Will keep the weeds from taking over.

Russell Jurney datasyndrome.com

This is called semi-structured data.

Structured vs. Semi-Structured Data

Wikipedia defines semi-structured data as:

 

Semi-structured data is a form of structured data that does not conform with the formal structure of tables and data models associated with relational databases but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.

 
 --Wikipedia, Semi-structured data

This is in contrast to structured data, which breaks data up into rigorously defined schemas before analytics begin, for more efficient querying therafter. A structured view of email is demonstrated in the Berkeley enron dataset by Andrew Fiore and Jeff Heer:

Figure 2.2. Enron email schema

Enron email schema

SQL and NoSQL

SQL

To query such a structured schema, we typically use declarative programming languages like SQL. In declarative programming, we specify the desired output rather than a set of operations on our data. A sql query against a relational email dataset to retrieve a single email looks like this:

select m.smtpid as id, 
    m.messagedt as date, 
    s.email as sender,
    (select group_concat(CONCAT(r.reciptype, ':', p.email) SEPARATOR ' ') 
        from recipients r 
        join people p 
            on r.personid=p.personid 
        where r.messageid = 511) as to_cc_bcc,
    m.subject as subject, 
    SUBSTR(b.body, 1, 200) as body
    from messages m 
    join people s
        on m.senderid=s.personid
    join bodies b 
        on m.messageid=b.messageid 
    where m.messageid=511;

| <25772535.1075839951307.JavaMail.evans@thyme> | 2002-02-02 12:56:33 | pete.davis@enron.com | to:pete.davis@enron.com cc:albert.meyers@enron.com cc:bill.williams@enron.com cc:craig.dean@enron.com cc:geir.solberg@enron.com cc:john.anderson@enron.com cc:mark.guzman@enron.com cc:michael.mier@enron.com cc:pete.davis@enron.com cc:ryan.slinger@enron.com bcc:albert.meyers@enron.com bcc:bill.williams@enron.com bcc:craig.dean@enron.com bcc:geir.solberg@enron.com bcc:john.anderson@enron.com bcc:mark.guzman@enron.com bcc:michael.mier@enron.com bcc:pete.davis@enron.com bcc:ryan.slinger@enron.com | Schedule Crawler: HourAhead Failure | Start Date: 2/2/02; HourAhead hour: 11;  HourAhead schedule download failed. Manual intervention required. |

Note how complex this query is to retrieve a basic record. We join three tables, and use a sub-query, the special mysql function GROUP_CONCAT as well as CONCAT and SUBSTR. Relational data almost discourages us from viewing data in its original form by requiring us to think in terms of the relational schema and not the data itself in its original, de-normalized form.

Relational structure does have benefits. We can see what time users send emails very easily with a simple select/group by/order query:

select senderid as id, 
    hour(messagedt) as sent_hour, 
    count(*) 
    from messages 
    where senderid=511 
    group by 
        senderid, 
        m_hour 
    order by 
        senderid, 
        m_hour;

+----------+--------+----------+
| senderid | m_hour | count(*) |
+----------+--------+----------+
|        1 |      0 |        4 |
|        1 |      1 |        3 |
|        1 |      3 |        2 |
|        1 |      5 |        1 |
|        1 |      8 |        3 |
|        1 |      9 |        1 |
|        1 |     10 |        5 |
|        1 |     11 |        2 |
|        1 |     12 |        2 |
|        1 |     14 |        1 |
|        1 |     15 |        5 |
|        1 |     16 |        4 |
|        1 |     17 |        1 |
|        1 |     19 |        1 |
|        1 |     20 |        1 |
|        1 |     21 |        1 |
|        1 |     22 |        1 |
|        1 |     23 |        1 |
+----------+--------+----------+

This kind of declarative programming is ideally suited to consuming and querying structured data in aggregate to produce simple charts and figures. When we know what we want, we can efficiently tell the SQL engine what that is, and it will compute the relations for us. We don't have to worry about the details of the query's execution.

Relational databases split data up into tables according to its structure, and pre-compute indexes for operating between these tables. Indexes enable these systems to be responsive on a single computer. Declarative programming is used to query this structure.

NoSQL

In contrast, when building analytics applications, we often don't know the query we want to run. Much experimentation and iteration is required to arrive at the solution to any given problem. Data is often unavailable in a relational format. Data in the wild is not normalized, it is fuzzy and dirty. Extracting structure is a lengthy process that we perform iteratively as we process data for different features.

For these reasons, in Agile Data we primarily employ imperative languages against distributed systems. Imperative languages like Pig Latin describe steps to manipulate data in pipelines. Rather than pre-compute indexes against structure we don't yet have, we use many processing cores in parallel to read individual records. Hadoop and work queues make this possible.

In addition to mapping well to technologies like Hadoop, which enables us to easily scale our processing, imperative languages put the focus of our tools where most of the work in building analytics applications is: in one or two hard-won, key steps where we do clever things that deliver most of the value of our application.

Compared to writing SQL queries, arriving at these clever operations is a lengthy and often exhaustive process, as we employ techniques from statistics, machine learning and social network analysis. Thus, imperative programming fits the task.

Conclusion

To summarize, when schemas are rigorous, and SQL is our lone tool, our perspective comes to be dominated by tools optimized for consumption, rather than mining data. Our ability to connect intuitively with the data is inhibited. Working with semi-structured data on the other hand enables us to focus on the data directly, manipulating it iteratively to extract value and to transform it to a product form. In Agile Data, we embrace NoSQL for what it enables us to do.

Serialization

Although we can work with semi-structured data as pure text, it is still helpful to impose some kind of structure to the raw records using a schema. Serialization systems give us this functionality. Available serialization systems include Thrift, Protobuf and Avro.

Thrift - http://thrift.apache.org/
Protobuf - http://code.google.com/p/protobuf/
Avro - http://avro.apache.org/

Although it is the least mature of these options, we choose Avro. Avro allows complex data structures, it includes a schema with each file, and it has support in Apache Pig. Installing Avro is easy, and it requires no external service to run.

We'll define a single, simple Avro schema for an email document as defined in RFC-5322. It is well and good to define a schema up front, but in practice... much processing will be required to extract all the entities in that schema. And so our initial schema might look very simple, like this:

  {
      "type":"record",
      "name":"RawEmail",
      "fields":
      [
          {
              "name":"thread_id",
              "type":["string", "null"],
              "doc":""
          },
          {
              "name":"raw_email",
              "type": ["string", "null"]
          }
      ]
  }
      

We might extract only a thread_id as a unique identifier, and then store the entire raw email string in a field on its own. If a unique identifier is not easy to extract from raw records, we can generate a UUID (Universally Unique IDentifier) and add it as a field.

Our job as we process data then is to add fields to our schema as we extract them, all the while retaining the raw data in its own field if we can. We can always go back to the mother sauce.

Extracting and Exposing Features in Evolving Schemas

As Pete Warden notes in his talk, Embracing the Chaos of Data, most freely available data is crude and unstructured. It is the availability of huge volumes of such ugly data, and not carefully cleaned and normalized tables, that make it 'big data.' Therein lies the opportunity in mining crude data into refined information, and using that information to drive new kinds of actions.

Extracted features from unstructured data only get cleaned in the harsh light of day, as users consume them and complain, if you can't ship your features as you extract them, you're in a state of free-fall. The hardest part of building data products is pegging entity and feature extraction to products smaller than your ultimate vision. This is why schemas must start as blobs of unstructured text and evolve into structured data only as features are extracted.

Features must be exposed in some product form as they are created... or they will never achieve a product-ready state. Derived data that lives in the basement of your product is unlikely to shape up. It is better to create entity pages to bring entities up to a 'consumer-grade' form, to incrementally improve these entities and to progressively combine them than to try and expose myriad derived data in a grand vision from the get-go.

While mining data into well structured information, using that information to expose new facts and make predictions that enable actions offers enormous potential for value creation... data is brutal and unforgiving, and failing to mind its true nature will dash the dreams of the most ambitious product manager.

As we'll see throughout the book - schemas evolve and improve, and so do features that expose them. When they evolve concurrently, we are truly agile.

Data Pipelines

We'll be working with semi-structured data in data-pipelines to extract and display different features of the data. The advantage of working with data in this way is that we don't invest time in extracting structure unless it is of interest and use to us. Thus, in the principals of KISS (Keep It Simple, Stupid!) and YAGNI (You Ain't Gonna Need It), we defer this overhead until the time of need. Our toolset helps make this more efficient, as we'll see in chapter 3.

A data pipeline to calculate the number of emails sent between two email addresses looks like this:

Figure 2.3. Example dataflow to count the number of emails between two email addresses

Load -> Filter -> Transform -> Flatten -> Group -> Transform -> Store

While this data-flow may look complex now if you're used to SQL, we'll quickly get used to working this way and such a simple flow will become second-nature.

Data Structures as Perspectives

To start, it is helpful to highlight different ways of looking at email. In agile data, we employ different perspectives on data like different lenses to inspect and mine data in different ways. It is easy to get stuck thinking about data in one or two ways that you find productive. Here we list the different perspectives on email data we'll be using throughout the book.

Social Networks

A social network is a group of persons (egos) and the connections or links between them. These connections may be directed, as in 'Bob knows Sara.' Or they may be undirected, 'Bob and Sara are friends.' Connections may also have a connection strength, or weight. 'Bob knows Sara well,' (0.5 out of 0-1) or 'Bob and Sara are married' (1.0 out of 0-1).

The sender and recipients of an email via the from, to, cc and bcc fields can be used to create a social network. For instance, the email above defines two entities, russell.jurney@gmail.com and ******@hotmail.com.

        From: Russell Jurney <russell.jurney@gmail.com>
        To: ******* Jurney <******@hotmail.com>
      

The message itself implies a link between them. We can represent this as a simple social network.

Figure 2.4. Social network dyad

Different kinds of dyads

A more complex social network might look like this:

Figure 2.5. Social network

A social network with seven egos

A social network of some 200 megabytes of emails from Enron looks like this:

Figure 2.6. Enron corpus viewer, by Jeffrey Heer

Enron corpus viewer, a large force directed layout of email egos

Social network analysis, or SNA, is the scientific study and analysis of social networks. By modeling our inbox as a social network, we can draw on the methods of social network analysis to reach a deeper understanding of the data, and of our interpersonal network.

Time Series

A time series is a sequence of data points ordered by a timestamp recorded with each value. Time series allow us to see changes and trends in data over time. All emails have timestamps, and so we can represent a series of emails as a time series.

        Date: Mon, 28 Nov 2011 14:57:38 -0800
      

Looking at several other emails, we can plot the raw data in a time series:

Figure 2.7. Raw time series

Raw time series of email timestamps

Since we aren't looking at another value associated with the time series, we can see the data more clearly by bucketing the data by day. This will tell us how many emails were sent between these two addresses per day.

Figure 2.8. Grouped time series

Time series of email frequency by day

Time series analysis might tell us when we most often receive email from a particular person or even what their work schedule is.

Natural Language

The meat of an email is the text message it contains. Despite the addition of MIME for multimedia attachments, email is still primarily text.

        Subject: Re: Lawn
        Content-Type: text/plain; charset=ISO-8859-1

        Dad, get a sack of Rye grass seed and plant it over there now.  It
        will build up a nice turf over the winter, then die off when it warms
        up.  Making for good topsoil you can plant regular grass in.

        Will keep the weeds from taking over.

        Russell Jurney
        twitter.com/rjurney
        russell.jurney@gmail.com
        datasyndrome.com
      

We might analyze the body of the email by counting the word frequency within it. Once we remove non-coding common 'stop words,' this looks like:

Figure 2.9. Email body word frequency

The most frequent words in the email body are plant and grass.

We might use this word frequency to infer the topics of the email are 'plant' and 'grass', as these are the most frequent words in the email. Processing natural language in this way helps us to extract properties from semi-structured data to make it more structured. This enables us to incorporate these structured properties into our analysis.

A fun way to show word frequency is via a wordle:

Figure 2.10. Email body wordle

The most frequent words in the email body are plant and grass.

Probability Theory

In probability theory we model seemingly random processes by using inputs to create probability distributions. We can then employ these probability distributions to make suggestions and to classify inputs.

We can use probability distributions to make predictions. For instance, we might create a probability distribution for our sent emails. Given that our email address appears in the from field, and another email address appears in the to field, what is the chance that another email will appear?

In this case, our raw data is the to, from, cc and bcc fields from each email:

        From: Russell Jurney <russell.jurney@gmail.com>
        To: ****** Jurney <******@hotmail.com>
        Cc: **** Jurney <****@hotmail.com>
        Bcc: Russell Jurney <russell.jurney@gmail.com>
      

First, we count the pairs of from email addresses with each other field:

Table 2.1. Totals for to, from pairs

FromToCount
russell.jurney@gmail.com****.jurney@gmail.com10
russell.jurney@gmail.comtoolsreq@oreilly.com10
russell.jurney@gmail.comyoga*****@gmail.com11
russell.jurney@gmail.comuser@pig.apache.org14
russell.jurney@gmail.com*****@hotmail.com15
russell.jurney@gmail.commikel@oreilly.com28
russell.jurney@gmail.comrussell.jurney@gmail.com44

Dividing these values by the total number of emails gives us a probability distribution characterizing the odds that any given email from our email address will be to any given email.

Table 2.2. P(to|from) - Probability of to, given from

FromToProbability
russell.jurney@gmail.com****.jurney@gmail.com0.0359
russell.jurney@gmail.comtoolsreq@oreilly.com0.0359
russell.jurney@gmail.comyoga*****@gmail.com0.0395
russell.jurney@gmail.comuser@pig.apache.org0.0503
russell.jurney@gmail.com*****@hotmail.com0.0539
russell.jurney@gmail.commikel@oreilly.com0.1007
russell.jurney@gmail.comrussell.jurney@gmail.com0.1582

Finally, we list the probabilities for a pair recipients co-occurring, given that the first address appears in an email.

Table 2.3. P(cc|from ∩ to) - Probability of cc, given from & to

FromToCcProbability
russell.jurney@gmail.commikel@oreilly.comtoolsreq@oreilly.com0.0357
russell.jurney@gmail.commikel@oreilly.commeghan@oreilly.com0.25
russell.jurney@gmail.commikel@oreilly.commstallone@oreilly.com0.25
russell.jurney@gmail.comtoolsreq@oreilly.commeghan@oreilly.com0.1
russell.jurney@gmail.comtoolsreq@oreilly.commikel@oreilly.com0.2

We can then use this data to show who else is likely to appear in an email, given a single address. This data can be used to drive features like Gmail's suggested recipients feature.

Figure 2.11. Gmail suggested recipients

Given one recipient, Gmail suggests others.

We'll see later how we can use Bayesian inference to make reasonable suggestions for recipients, even when Table 2.3 is incomplete.

Conclusion

As we've seen, viewing semi-structured data according to different algorithms, structures and perspectives informs feature development more than normalizing and viewing it in structured tables does. We'll be using the perspectives defined above to create features throughout the book, as we climb the data-value stack.

Site last updated on: July 26, 2012 at 05:06:45 AM PDT
Cover for Agile Data

View 1 comment

  1. Mark Birbeck – Posted Nov. 4, 2012

    Great idea.

Add a comment

View 1 comment

  1. Mark Birbeck – Posted Nov. 4, 2012

    Could be worth indicating that this is a query against the schema described above.

Add a comment

View 1 comment

  1. Julian Peeters – Posted Feb. 3, 2013

    Although we can work with semi-structured data as pure text, it is still >helpful to impose some kind of structure to the raw records using a >schema.

    Maybe "is still helpful" is too vague considering the next section indicates that shema evolution is foundational to extracting structure/designing features?

    I don't think it would steal any thunder from the next section if this part were to read something like:

    "Although we can work with semi-structured data as pure text, it is far easier and more efficient to impose some kind of structure to the raw records using a schema, and evolving that schema as further structure is extracted."

    But be warned, I ain't no professional editor.

Add a comment

View 1 comment

  1. Mark Birbeck – Posted Nov. 4, 2012

    "source"

    Or is there some clever wordplay going on here? :)

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 20, 2012

    I have an intuitive grasp of "stop words" but this may merit further definition for those unfamiliar with the term.

Add a comment