Chapter 10. Making Predictions
Working with Sparse Data
As we saw in the time series we fixed in the last chapter - sparsity is a challenge when working with thin slices of data - even big data. Big data enables easy analysis by enabling simpler algorithms when we don't have to make statistical inferences just to see an entity in our data. However, when we slice our data, even large datasets become sparse. What do we do when a particular view on our data is sparse, and 'has holes in it?'
We treat it like a sample. Lets smooth our data from our emails sent chart to get better charts and denser data. The simplest method to smooth our data is by using a moving average. An excellent introduction to smoothing data in numpy/scipy with different kernels is available in the Python Scipy Cookbook entry on SignalSmooth: http://www.scipy.org/Cookbook/SignalSmooth.
First we'll need to install numpy, Python's excellent numeric library. Along with scipy and nltk, and matplotlib, numpy is what makes Python such a great platform for working with data. Numpy will install with easy_install or pip:
easy_install numpy
pip install numpy
If you have trouble installing numpy on Mac OS X, try the macport for numpy.
sudo port install py27-numpy
SciPy Superpack
The SciPy Superpack is a good way to get numpy/scipy/matplotlib/sklearn working on Mac OS X, where building these packages can be tricky.
Lets make a Smoother class for our Flask app to call:
import numpy as np
class Smoother():
"""Takes an array of objects as input, and the data key of the object for access."""
def __init__(self, raw_data, data_key):
self.raw_data = raw_data
print self.raw_data
self.data = self.to_array(raw_data, data_key)
"""Given an array of objects with values, return a numpy array of values."""
def to_array(self, in_data, data_key):
data_array = list()
for datum in in_data:
print datum
data_array.append(datum[data_key])
return np.array(data_array)
"""Smoothing method from SciPy SignalSmooth Cookbook: http://www.scipy.org/Cookbook/SignalSmooth"""
def smooth(self, window_len=10, window='blackman'):
x = self.data
s=np.r_[2*x[0]-x[window_len:1:-1], x, 2*x[-1]-x[-1:-window_len:-1]]
w = getattr(np, window)(window_len)
y = np.convolve(w/w.sum(), s, mode='same')
self.smoothed = y[window_len-1:-window_len+1]
def to_objects(self):
objects = list()
hours = [ '%02d' % i for i in range(24) ]
for idx, val in enumerate(hours):
objects.append({"sent_hour": val, "total": round(self.smoothed[idx], 0)})
return objectsThen call it from our Flask app:
@app.route("/address/<string:email_address>")
def address(email_address):
sent_dist = db.sent_dist.find_one({'email': email_address})
smitty = Smoother(sent_dist['sent_dist'], 'total')
smitty.smooth()
smoothed_dist = smitty.to_objects()
chart_json = json.dumps(smoothed_dist)
top_friends = db.top_friends.find_one({'email': email_address})['top_20'][0:5]
return render_template('partials/address.html', email_address=email_address,
sent_dist={"sent_dist": smoothed_dist},
chart_json=chart_json,
top_friends=top_friends)Which Smoother?
Our choice of the Blackman distribution to smooth our emails sent distribution is arbitrary. In practice, we might sample our densest distributions, smooth these samples with different kernels (such as blackman). Then we might calculate a distance between our smoothed samples and the actual distributions to find the kernel that works best.
For the moment, we will 'wing it.' We can refine any step later, but if we focus on accuracy before we plumb a feature, we are violating YAGNI, or "You ain't Gonna Need It." After all, we might not use this data at all if our data surprises us and we notice something more interesting.
Having smoothed and displayed the data, we are left with a question: what does smoothing do to the user experience? Do we need to say that it is smoothed? Should we draw a regression line instead? What are we trying to achieve with this data?
Predicting Response Rates to Emails
When I click around in our application and look at the charts showing how often someone emails by hour of the day, I start to wonder if we can infer from this data when someone is most likely to reply. This is why we create charts and reports in the first place - to guide us as we climb the data-value stack.
In this chapter, we will predict whether a recipient will respond to a given email using some of the entities we've extracted from our inbox. In the next chapter, we'll use this inference to enable a new kind of action.
We're going to walk from simple frequencies to real insight one table at a time, just as we did in chapter 2. This time, we'll show you the code to accompany the logic.
We begin by calculating a simple overall sent count between pairs of emails.
/* Get rid of emails with reply_to, as they confuse everything in mailing lists. */
avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
clean_emails = filter avro_emails by (froms is not null) and (reply_tos is null);
/* Treat emails without in_reply_to as sent emails */
trimmed_emails = foreach clean_emails generate froms, tos, message_id;
sent_mails = foreach trimmed_emails generate flatten(froms.address) as from,
flatten(tos.address) as to,
message_id;
store sent_counts into '/tmp/sent_counts';Global sent counts between pairs of email addresses are easy enough to calculate, as this is roughly equivalent to a SQL group by: we use the flatten command to project all unique pairs of from/to in each email (remember: emails can have more than one to), along with the message_id of the email.
Table 10.1. Sent Counts - Simple Frequencies
| From | To | Total |
|---|---|---|
| russell.jurney@gmail.com | ****@hotmail.com | 237 |
| russell.jurney@gmail.com | jurney@gmail.com | 122 |
| russell.jurney@gmail.com | ****.jurney@gmail.com | 273 |
The next step is a little more complex. We need to separate replies. Since we will be using overall sent counts as the denominator in determining our reply ratios, we need to remove all mailing list emails from the analysis - as calculating the sent counts for the entire lurking population of a mailing list is daunting to say the least.
Our calculation is the same as for total emails, except we filter so that all emails have a non-null in_reply_to, and we project in_reply_to with our email pairs instead of message_id.
/* Remove in_reply_tos, as they are mailing lists which have incalculable total sent_counts */
avro_emails2 = load '/me/tmp/thu_emails' using AvroStorage();
replies = filter avro_emails2 by (froms is not null) and (reply_tos is null) and (in_reply_to is not null);
replies = foreach replies generate flatten(froms.address) as from,
flatten(tos.address) as to,
in_reply_to;
replies = filter replies by in_reply_to != 'None';
store replies into '/tmp/replies';Self joins in Pig
Note that we have to load the emails twice to effect a self join. As of Pig 0.10, Pig can't join a relation to itself.
We are now prepared to join the sent messages with the replies to see each email and whether it was replied to at all.
/* Now join a copy of the emails by message id to the in_reply_to of our emails */ with_reply = join sent_mails by message_id, replies by in_reply_to; /* Filter out mailing lists - only direct replies where from/to match up */ direct_replies = filter with_reply by (sent_mails::from == replies::to) and (sent_mails::to == replies::from); store direct_replies into '/tmp/direct_replies';
The data at this point looks like this. Notice how we've used a JOIN (in this case a self-JOIN) to filter our data, which is a pattern in dataflow programming with Pig.
from to message_id from to in_reply_to russell.jurney@gmail.com kate.jurney@gmail.com CANSvDjrAR+ZHnxES3hBUZV+wJY_0ZbhzH0wjJEmiCTzBQGH1OQ@mail.gmail.com kate.jurney@gmail.com russell.jurney@gmail.com CANSvDjrAR+ZHnxES3hBUZV+wJY_0ZbhzH0wjJEmiCTzBQGH1OQ@mail.gmail.com russell.jurney@gmail.com kate.jurney@gmail.com CANSvDjrLdcnk7_bPk-pLK2dSDF9Hw_6YScespnEnrnAEY8hocw@mail.gmail.com kate.jurney@gmail.com russell.jurney@gmail.com CANSvDjrLdcnk7_bPk-pLK2dSDF9Hw_6YScespnEnrnAEY8hocw@mail.gmail.com russell.jurney@gmail.com kate.jurney@gmail.com CANSvDjrXO0pOC53j7B=sm4TMyTUVpG_GWxT-cUi=MtrGDDcs1Q@mail.gmail.com kate.jurney@gmail.com russell.jurney@gmail.com CANSvDjrXO0pOC53j7B=sm4TMyTUVpG_GWxT-cUi=MtrGDDcs1Q@mail.gmail.com russell.jurney@gmail.com kate.jurney@gmail.com CANSvDjrbuxc4ik3PPAy9OcRf3au9ww3ivkFKv8rwwdEsqvAAMw@mail.gmail.com kate.jurney@gmail.com russell.jurney@gmail.com CANSvDjrbuxc4ik3PPAy9OcRf3au9ww3ivkFKv8rwwdEsqvAAMw@mail.gmail.com
Since we have duplicate fields after the join, we can drop them:
direct_replies = foreach direct_replies generate sent_mails::from as from, sent_mails::to as to;
The semantics of our data are now, 'the message from A to B with ID C was replied to, from B to A.'
from to message_id russell.jurney@gmail.com kate.jurney@gmail.com CANSvDjrAR+ZHnxES3hBUZV+wJY_0ZbhzH0wjJEmiCTzBQGH1OQ@mail.gmail.com russell.jurney@gmail.com kate.jurney@gmail.com CANSvDjrLdcnk7_bPk-pLK2dSDF9Hw_6YScespnEnrnAEY8hocw@mail.gmail.com russell.jurney@gmail.com kate.jurney@gmail.com CANSvDjrXO0pOC53j7B=sm4TMyTUVpG_GWxT-cUi=MtrGDDcs1Q@mail.gmail.com russell.jurney@gmail.com kate.jurney@gmail.com CANSvDjrbuxc4ik3PPAy9OcRf3au9ww3ivkFKv8rwwdEsqvAAMw@mail.gmail.com
Now we're ready to calculate reply counts between pairs of email addresses.
reply_counts = foreach (group direct_replies by (from, to)) generate flatten(group) as (from, to),
COUNT_STAR(direct_replies) as total;
store reply_counts into '/tmp/reply_counts';Table 10.2. Reply Counts
| From | To | Total Replies |
|---|---|---|
| russell.jurney@gmail.com | ****@hotmail.com | 60 |
| russell.jurney@gmail.com | jurney@gmail.com | 31 |
| russell.jurney@gmail.com | ****.jurney@gmail.com | 36 |
Having calculated total emails sent between email addresses, as well as the number of replies, we can calculate reply ratios: how often one email address replies to another.
sent_replies = join sent_counts by (from, to), reply_counts by (from, to);
reply_ratios = foreach sent_replies generate sent_counts::from as from,
sent_counts::to as to,
(float)reply_counts::total/(float)sent_counts::total as ratio;
reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio;Table 10.3. P(response|email)
| From | To | P(response|email) |
|---|---|---|
| russell.jurney@gmail.com | ****@hotmail.com | 0.25316456 |
| russell.jurney@gmail.com | jurney@gmail.com | 0.25409836 |
| russell.jurney@gmail.com | ****.jurney@gmail.com | 0.13186814 |
What this means is that given an email from russell.jurney@gmail.com to ****@hotmail.com we can expect 0.25 replies. Another way of saying this is that there is a reply about 25% of the time.
Finally we publish this data to MongoDB and verify that it arrived.
store reply_ratios into 'mongodb://localhost/agile_data.reply_ratios' using MongoStorage();
> db.reply_ratios.findOne({"from": "russell.jurney@gmail.com", "to": "kate.jurney@gmail.com"})
{
"_id" : ObjectId("5010f7df0364e16aa73da639"),
"from" : "russell.jurney@gmail.com",
"to" : "kate.jurney@gmail.com",
"ratio" : 0.1318681389093399
}Now lets add this feature to our email address page.
Personalization
Up to now we've made no assumptions about the user of our application. That is about to change. In this section we will assume the user is me - russell.jurney@gmail.com. In practice, we would authorize and login a user and then import and present data from their perspective via a unique session. To simplify the examples, we'll just assume we're me.
Insert a fetch for reply_ratio into our Flask app:
reply_ratio = db.reply_ratios.find_one({'from': 'russell.jurney@gmail.com', 'to': email_address})
return render_template('partials/address.html', reply_ratio=reply_ratio, ...And edit our template for the address page to display the value. Note that we've skipped displaying each step in our calculation. As you become more comfortable with your dataset, you can chunk in larger batches. Still, it is a good idea to publish frequently to keep everyone on the team on the same page, and to give everyone access to building blocks to create new features from.
<div class="span6" style="margin-top: 25px">
<h3>Probability of Reply</h3>
<p style="white-space:nowrap;">{{ reply_ratio['from']}} -> {{ reply_ratio['to']}}: {{ reply_ratio['ratio']|round(2) }}</p>
</div>Conclusion
Lets reflect on what we've done. We've taken what we know about the past to predict the future. We now know the odds that an email we are sending will be replied to. This can guide us in who we email - after all, if we are expecting a response, we might not bother to email someone that doesn't reply!
In the next chapter we'll drill down into this prediction to drive a new action that can take advantage of it.











Add a comment



Add a comment