9781449326265
chapter_10.html

Chapter 10. Making Predictions

Figure 10.1. Level 4: Making Predictions

Level 4: Making Predictions

Working with Sparse Data

As we saw in the time series we fixed in the last chapter - sparsity is a challenge when working with thin slices of data - even big data. Big data enables easy analysis by enabling simpler algorithms when we don't have to make statistical inferences just to see an entity in our data. However, when we slice our data, even large datasets become sparse. What do we do when a particular view on our data is sparse, and 'has holes in it?'

We treat it like a sample. Lets smooth our data from our emails sent chart to get better charts and denser data. The simplest method to smooth our data is by using a moving average. An excellent introduction to smoothing data in numpy/scipy with different kernels is available in the Python Scipy Cookbook entry on SignalSmooth: http://www.scipy.org/Cookbook/SignalSmooth.

First we'll need to install numpy, Python's excellent numeric library. Along with scipy and nltk, and matplotlib, numpy is what makes Python such a great platform for working with data. Numpy will install with easy_install or pip:

easy_install numpy
pip install numpy

If you have trouble installing numpy on Mac OS X, try the macport for numpy.

sudo port install py27-numpy

SciPy Superpack

The SciPy Superpack is a good way to get numpy/scipy/matplotlib/sklearn working on Mac OS X, where building these packages can be tricky.

Lets make a Smoother class for our Flask app to call:

import numpy as np

class Smoother():
  
  """Takes an array of objects as input, and the data key of the object for access."""
  def __init__(self, raw_data, data_key):
    self.raw_data = raw_data
    print self.raw_data
    self.data = self.to_array(raw_data, data_key)
  
  """Given an array of objects with values, return a numpy array of values."""
  def to_array(self, in_data, data_key):
    data_array = list()
    for datum in in_data:
      print datum
      data_array.append(datum[data_key])
    return np.array(data_array)
  
  """Smoothing method from SciPy SignalSmooth Cookbook: http://www.scipy.org/Cookbook/SignalSmooth"""
  def smooth(self, window_len=10, window='blackman'):
    x = self.data
    s=np.r_[2*x[0]-x[window_len:1:-1], x, 2*x[-1]-x[-1:-window_len:-1]]
    w = getattr(np, window)(window_len)
    y = np.convolve(w/w.sum(), s, mode='same')
    self.smoothed = y[window_len-1:-window_len+1]
  
  def to_objects(self):
    objects = list()
    hours = [ '%02d' % i for i in range(24) ]
    for idx, val in enumerate(hours):
      objects.append({"sent_hour": val, "total": round(self.smoothed[idx], 0)})
    return objects

Then call it from our Flask app:

@app.route("/address/<string:email_address>")
def address(email_address):
  sent_dist = db.sent_dist.find_one({'email': email_address})
  smitty = Smoother(sent_dist['sent_dist'], 'total')
  smitty.smooth()
  smoothed_dist = smitty.to_objects()
  chart_json = json.dumps(smoothed_dist)
  top_friends = db.top_friends.find_one({'email': email_address})['top_20'][0:5]
  return render_template('partials/address.html', email_address=email_address,
                                                  sent_dist={"sent_dist": smoothed_dist}, 
                                                  chart_json=chart_json, 
                                                  top_friends=top_friends)

Figure 10.2. Before and After Smoothing

Before and After Smoothing

Which Smoother?

Our choice of the Blackman distribution to smooth our emails sent distribution is arbitrary. In practice, we might sample our densest distributions, smooth these samples with different kernels (such as blackman). Then we might calculate a distance between our smoothed samples and the actual distributions to find the kernel that works best.

For the moment, we will 'wing it.' We can refine any step later, but if we focus on accuracy before we plumb a feature, we are violating YAGNI, or "You ain't Gonna Need It." After all, we might not use this data at all if our data surprises us and we notice something more interesting.

Having smoothed and displayed the data, we are left with a question: what does smoothing do to the user experience? Do we need to say that it is smoothed? Should we draw a regression line instead? What are we trying to achieve with this data?

Predicting Response Rates to Emails

When I click around in our application and look at the charts showing how often someone emails by hour of the day, I start to wonder if we can infer from this data when someone is most likely to reply. This is why we create charts and reports in the first place - to guide us as we climb the data-value stack.

In this chapter, we will predict whether a recipient will respond to a given email using some of the entities we've extracted from our inbox. In the next chapter, we'll use this inference to enable a new kind of action.

We're going to walk from simple frequencies to real insight one table at a time, just as we did in chapter 2. This time, we'll show you the code to accompany the logic.

We begin by calculating a simple overall sent count between pairs of emails.

/* Get rid of emails with reply_to, as they confuse everything in mailing lists. */
avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
clean_emails = filter avro_emails by (froms is not null) and (reply_tos is null);

/* Treat emails without in_reply_to as sent emails */
trimmed_emails = foreach clean_emails generate froms, tos, message_id;
sent_mails = foreach trimmed_emails generate flatten(froms.address) as from, 
                                             flatten(tos.address) as to, 
                                             message_id;
store sent_counts into '/tmp/sent_counts';

Global sent counts between pairs of email addresses are easy enough to calculate, as this is roughly equivalent to a SQL group by: we use the flatten command to project all unique pairs of from/to in each email (remember: emails can have more than one to), along with the message_id of the email.

Figure 10.3. Calculating Sent Counts

Calculating Sent Counts

Table 10.1. Sent Counts - Simple Frequencies

FromToTotal
russell.jurney@gmail.com****@hotmail.com237
russell.jurney@gmail.comjurney@gmail.com122
russell.jurney@gmail.com****.jurney@gmail.com273

The next step is a little more complex. We need to separate replies. Since we will be using overall sent counts as the denominator in determining our reply ratios, we need to remove all mailing list emails from the analysis - as calculating the sent counts for the entire lurking population of a mailing list is daunting to say the least.

Our calculation is the same as for total emails, except we filter so that all emails have a non-null in_reply_to, and we project in_reply_to with our email pairs instead of message_id.

/* Remove in_reply_tos, as they are mailing lists which have incalculable total sent_counts */
avro_emails2 = load '/me/tmp/thu_emails' using AvroStorage();
replies = filter avro_emails2 by (froms is not null) and (reply_tos is null) and (in_reply_to is not null);
replies = foreach replies generate flatten(froms.address) as from,
                                   flatten(tos.address) as to,
                                   in_reply_to;
replies = filter replies by in_reply_to != 'None';
store replies into '/tmp/replies';

Self joins in Pig

Note that we have to load the emails twice to effect a self join. As of Pig 0.10, Pig can't join a relation to itself.

We are now prepared to join the sent messages with the replies to see each email and whether it was replied to at all.

/* Now join a copy of the emails by message id to the in_reply_to of our emails */
with_reply = join sent_mails by message_id, replies by in_reply_to;

/* Filter out mailing lists - only direct replies where from/to match up */
direct_replies = filter with_reply by (sent_mails::from == replies::to) and (sent_mails::to == replies::from);
store direct_replies into '/tmp/direct_replies';

Figure 10.4. Self-Join of Emails with Replies

Self-Join of Emails with Replies

The data at this point looks like this. Notice how we've used a JOIN (in this case a self-JOIN) to filter our data, which is a pattern in dataflow programming with Pig.

from  to  message_id  from  to  in_reply_to
russell.jurney@gmail.com	kate.jurney@gmail.com	CANSvDjrAR+ZHnxES3hBUZV+wJY_0ZbhzH0wjJEmiCTzBQGH1OQ@mail.gmail.com
  kate.jurney@gmail.com	russell.jurney@gmail.com	CANSvDjrAR+ZHnxES3hBUZV+wJY_0ZbhzH0wjJEmiCTzBQGH1OQ@mail.gmail.com
russell.jurney@gmail.com	kate.jurney@gmail.com	CANSvDjrLdcnk7_bPk-pLK2dSDF9Hw_6YScespnEnrnAEY8hocw@mail.gmail.com
  kate.jurney@gmail.com	russell.jurney@gmail.com	CANSvDjrLdcnk7_bPk-pLK2dSDF9Hw_6YScespnEnrnAEY8hocw@mail.gmail.com
russell.jurney@gmail.com	kate.jurney@gmail.com	CANSvDjrXO0pOC53j7B=sm4TMyTUVpG_GWxT-cUi=MtrGDDcs1Q@mail.gmail.com
  kate.jurney@gmail.com	russell.jurney@gmail.com	CANSvDjrXO0pOC53j7B=sm4TMyTUVpG_GWxT-cUi=MtrGDDcs1Q@mail.gmail.com
russell.jurney@gmail.com	kate.jurney@gmail.com	CANSvDjrbuxc4ik3PPAy9OcRf3au9ww3ivkFKv8rwwdEsqvAAMw@mail.gmail.com
  kate.jurney@gmail.com	russell.jurney@gmail.com	CANSvDjrbuxc4ik3PPAy9OcRf3au9ww3ivkFKv8rwwdEsqvAAMw@mail.gmail.com

Since we have duplicate fields after the join, we can drop them:

direct_replies = foreach direct_replies generate sent_mails::from as from, sent_mails::to as to;

The semantics of our data are now, 'the message from A to B with ID C was replied to, from B to A.'

from  to  message_id
russell.jurney@gmail.com	kate.jurney@gmail.com	CANSvDjrAR+ZHnxES3hBUZV+wJY_0ZbhzH0wjJEmiCTzBQGH1OQ@mail.gmail.com
russell.jurney@gmail.com	kate.jurney@gmail.com	CANSvDjrLdcnk7_bPk-pLK2dSDF9Hw_6YScespnEnrnAEY8hocw@mail.gmail.com
russell.jurney@gmail.com	kate.jurney@gmail.com	CANSvDjrXO0pOC53j7B=sm4TMyTUVpG_GWxT-cUi=MtrGDDcs1Q@mail.gmail.com
russell.jurney@gmail.com	kate.jurney@gmail.com	CANSvDjrbuxc4ik3PPAy9OcRf3au9ww3ivkFKv8rwwdEsqvAAMw@mail.gmail.com

Now we're ready to calculate reply counts between pairs of email addresses.

reply_counts = foreach (group direct_replies by (from, to)) generate flatten(group) as (from, to), 
                                                                    COUNT_STAR(direct_replies) as total;
store reply_counts into '/tmp/reply_counts';

Table 10.2. Reply Counts

FromToTotal Replies
russell.jurney@gmail.com****@hotmail.com60
russell.jurney@gmail.comjurney@gmail.com31
russell.jurney@gmail.com****.jurney@gmail.com36

Having calculated total emails sent between email addresses, as well as the number of replies, we can calculate reply ratios: how often one email address replies to another.

sent_replies = join sent_counts by (from, to), reply_counts by (from, to);
reply_ratios = foreach sent_replies generate sent_counts::from as from, 
                                             sent_counts::to as to, 
                                             (float)reply_counts::total/(float)sent_counts::total as ratio;
reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio;

Figure 10.5. Calculating Reply Ratios

Calculating Reply Ratios

Table 10.3. P(response|email)

FromToP(response|email)
russell.jurney@gmail.com****@hotmail.com0.25316456
russell.jurney@gmail.comjurney@gmail.com0.25409836
russell.jurney@gmail.com****.jurney@gmail.com0.13186814

What this means is that given an email from russell.jurney@gmail.com to ****@hotmail.com we can expect 0.25 replies. Another way of saying this is that there is a reply about 25% of the time.

Finally we publish this data to MongoDB and verify that it arrived.

store reply_ratios into 'mongodb://localhost/agile_data.reply_ratios' using MongoStorage();
> db.reply_ratios.findOne({"from": "russell.jurney@gmail.com", "to": "kate.jurney@gmail.com"})
{
  "_id" : ObjectId("5010f7df0364e16aa73da639"),
  "from" : "russell.jurney@gmail.com",
  "to" : "kate.jurney@gmail.com",
  "ratio" : 0.1318681389093399
}

Now lets add this feature to our email address page.

Personalization

Up to now we've made no assumptions about the user of our application. That is about to change. In this section we will assume the user is me - russell.jurney@gmail.com. In practice, we would authorize and login a user and then import and present data from their perspective via a unique session. To simplify the examples, we'll just assume we're me.

Insert a fetch for reply_ratio into our Flask app:

reply_ratio = db.reply_ratios.find_one({'from': 'russell.jurney@gmail.com', 'to': email_address})
  return render_template('partials/address.html', reply_ratio=reply_ratio, ...

And edit our template for the address page to display the value. Note that we've skipped displaying each step in our calculation. As you become more comfortable with your dataset, you can chunk in larger batches. Still, it is a good idea to publish frequently to keep everyone on the team on the same page, and to give everyone access to building blocks to create new features from.

<div class="span6" style="margin-top: 25px">
  <h3>Probability of Reply</h3>
  <p style="white-space:nowrap;">{{ reply_ratio['from']}} &nbsp;->&nbsp; {{ reply_ratio['to']}}: &nbsp; {{ reply_ratio['ratio']|round(2) }}</p>
</div>

Figure 10.6. Displaying Reply Ratio

Displaying Reply Ratio

Conclusion

Lets reflect on what we've done. We've taken what we know about the past to predict the future. We now know the odds that an email we are sending will be replied to. This can guide us in who we email - after all, if we are expecting a response, we might not bother to email someone that doesn't reply!

In the next chapter we'll drill down into this prediction to drive a new action that can take advantage of it.

Site last updated on: July 26, 2012 at 05:06:45 AM PDT
Cover for Agile Data