Chapter 9. Exploring Data with Reports
Introduction
As charts become interactive, static pages become dynamic, and our data becomes explorable, we are at the reports stage of the data value pyramid.
Building Reports with Multiple Charts
To build a report we need to compose multiple views on the same entity. The charts we made last chapter will serve us well as we increase interactivity to create reports. Lets create an email address entity page and add a tag cloud for related emails to give us something of a report.
avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
avro_emails = filter avro_emails by (froms is not null);
/* We need to insert reply_to as a valid from or email addresses will miss in our index */
split avro_emails into has_reply_to if (reply_tos is not null), just_froms if (reply_tos is null);
/* Count both the from and reply_to as valid froms if there is a reply_tos field */
reply_tos = foreach has_reply_to generate reply_tos as froms, tos, ccs, bccs;
reply_to_froms = foreach has_reply_to generate froms, tos, ccs, bccs;
/* Treat emails without reply_to as normal */
just_froms = foreach just_froms generate froms, tos, ccs, bccs;
/* Now union them all and we have our dataset to compute on */
emails = union reply_tos, reply_to_froms, just_froms;
/* Now pair up our froms/reply_tos with all recipient types,
and union them to get a sender/recipient connection list. */
tos = foreach emails generate flatten(froms.address) as from, flatten(tos.address) as to;
ccs = foreach emails generate flatten(froms.address) as from, flatten(ccs.address) as to;
bccs = foreach emails generate flatten(froms.address) as from, flatten(bccs.address) as to;
pairs = union tos, ccs, bccs;
counts = foreach (group pairs by (from, to)) generate flatten(group) as (from, to), COUNT(pairs) as total;
top_pairs = foreach (group counts by from) {
filtered = filter counts by (to is not null);
sorted = order filtered by total desc;
top_20 = limit sorted 30;
generate group as email, top_20.(to) as top_20;
}
store top_pairs into 'mongodb://localhost/agile_data.top_friends' using MongoStorage();Our Flask controller combines several stubs we've already created along with top friends:
@app.route("/address/<string:email_address>")
def address(email_address):
sent_dist = db.sent_dist.find_one({'email': email_address})
chart_json = json.dumps(sent_dist['sent_dist']) # make json for d3.js
top_friends = db.top_friends.find_one({'email': email_address})['top_20'][0:6]
return render_template('partials/address.html', email_address=email_address,
sent_dist=sent_dist,
chart_json=chart_json,
top_friends=top_friends)Our template code adds a space for related contacts:
<table class="table table-striped table-bordered table-condensed" style="width: 140px">
<thead>
<tr>
<th style="width: 50px">Hour</th>
<th style="width: 100px">Emails Sent</th>
</tr>
</thead>
<tbody>
{% for d in sent_dist['sent_dist'] %}
<tr>
<td>{{ d['sent_hour'] }}</td>
<td>{{ d['total'] }}</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
<div class="span7" style="margin-top: 0px">
<h3>Related Email Addresses</h3>
<div style="margin-top: 10px">
{% for d in top_friends %}
<input class="btn btn-primary" style="margin: 2px; margin-top: 6px;" type="button" value="{{ d['to'] }}"></input>
{% endfor %}
</div>
</div>
<div class="span6" style="margin-top: 65px;">
<h3>Emails Sent per Hour</h3>
<div id="d3div" style="color: white;"></div>
</div>Linking Records
Having created a report, adding interactivity is easy. Lets begin by linking between email address entities. Update our template to add an onclick event for our email address buttons.
{% for d in top_friends %}
<input class="btn btn-primary" style="margin: 2px; margin-top: 6px;" type="button" value="{{ d['to'] }}" onclick="document.location='/address/{{ d['to']}}'"></input>
{% endfor %}We can now explore email addresses and their time habits endlessly! Big deal, right? Maybe not, but it is a good start. Lets extend this by making email addresses in email clickable. We need only extend our macros to add links when displaying email addresses.
{% macro format_email(address, real_name) -%}
<a href="/address/{{ address }}">"{{ real_name }}" <{{ address }}></a>
{% endmacro -%}
<!-- Display a list of emails across a span, as in to/from/cc/bcc/reply_to -->
{% macro display_emails(key, name) -%}
{% set plural_key = key + 's' -%}
{% if email[plural_key] -%}
<div class="row">
{{ display_label(name) }}
{% for email in email[plural_key] -%}
{% set display_name = format_email(email.address, email.real_name) -%}
<div class="span" style="display: inline-block">{{ display_name }}</div>
{% endfor -%}
</div>
{% endif -%}
{% endmacro -%}We can now explore email addresses, their properties and relationships as we view emails. This kind of 'pivot' offers insight, and is a form of simple recommendation. But we've got a bug. For sparse entries, we are skipping hours in our table and chart.
We can fix this bug in five places: javascript, the template, the controller, the database, or with Pig. Lets look at where we can fix it and where it makes the most sense. Fixing the data in javascript looks like so:
// Get "00" - "23"
function makeHourRange(num) {
return num < 10 ? "0" + num.toString() : num.toString();
}
function fillBlanks(rawData) {
var hourRange = d3.range(0,24);
var ourData = Array();
for (hour in hourRange)
{
var hourString = makeHourRange(hour);
var found = false;
for(x in rawData)
{
if(rawData[x]['sent_hour'] == hourString)
{
found = true;
break;
}
}
if(found == true)
{
ourData.push(rawData[x]);
}
else
{
ourData.push({'sent_hour': hourString, 'total': 0})
}
}
return ourData;
}
var rawData = {{ chart_json|safe }};
var filledData = fillBlanks(rawData);To fix this in the template language is possible, but embedding this kind of logic here is to be discouraged. Jinja2 isn't for data processing, we can do that elsewhere. We can fix it in our Python controller (or abstract it to a model class) by reformatting the data each request.
def fill_in_blanks(in_data):
out_data = list()
hours = [ '%02d' % i for i in range(24) ]
for hour in hours:
entry = [x for x in in_data if x['sent_hour'] == hour]
if entry:
out_data.append(entry[0])
else:
out_data.append({'sent_hour': hour, 'total': 0})
return out_dataChanging one line in our controller gets our empty values filled in.
def address(email_address): chart_json = json.dumps(fill_in_blanks(sent_dist['sent_dist']))
We can see that when it comes to data, Python has teeth. List comprehensions make this implementation fairly succinct.
The problem here is that we're reformatting the data each request that we formatted ourselves in our Pig script. Why not simply get the format right the first time? Consistency between model and view creates clarity for everyone.
Mongo can use the same Javascript we used in the web page to fill in empty values in a query. Beautiful, right? There is one exception: we must create our own range() function, as d3.js is not available to MongoDB.
Thanks to http://stackoverflow.com/questions/8273047/javascript-function-similar-to-python-range
function range(start, stop, step){
if (typeof stop=='undefined'){
// one param defined
stop = start;
start = 0;
};
if (typeof step=='undefined'){
step = 1;
};
if ((step>0 && start>=stop) || (step<0 && start<=stop)){
return [];
};
var result = [];
for (var i=start; step>0 ? i<stop : i>stop; i+=step){
result.push(i);
};
return result;
};
// Get "00" - "23"
function makeHourRange(num) {
return num < 10 ? "0" + num.toString() : num.toString();
}
function fillBlanks(rawData) {
var hourRange = range(0,24);
var ourData = Array();
for (hour in hourRange)
{
var hourString = makeHourRange(hour);
var found = false;
for(x in rawData)
{
if(rawData[x]['sent_hour'] == hourString)
{
found = true;
break;
}
}
if(found == true)
{
ourData.push(rawData[x]);
}
else
{
ourData.push({'sent_hour': hourString, 'total': 0})
}
}
return ourData;
}
fillBlanks(data);
Being able to query our database in javascript is convenient, but ideally we fix the problem at its source. To fix the problem at its source, we can re-use our Python code by modifying it into a Jython UDF for Pig, and calling this from our script. Note that Pig converts Pig tuples to tuples and Pig bags to lists of tuples.
@outputSchema("sent_dist:bag{t:(sent_hour:chararray, total:int)}")
def fill_in_blanks(sent_dist):
print sent_dist
out_data = list()
hours = [ '%02d' % i for i in range(24) ]
for hour in hours:
entry = [x for x in sent_dist if x[0] == hour]
if entry:
entry = entry[0]
print entry.__class__
out_data.append(tuple([entry[0], entry[1]]))
else:
out_data.append(tuple([hour, 0]))
return out_data/* Load our Jython UDFs */
register 'udfs.py' using jython as funcs;
emails = load '/me/tmp/thu_emails/' using AvroStorage();
...
/* Here we apply our Jython UDF, fill_in_blanks() to fill holes in our time series. */
filled_dist = foreach sent_distributions generate email, funcs.fill_in_blanks(sent_dist) as sent_dist;
store filled_dist into '/tmp/filled_distributions.avro' using AvroStorage();
store filled_dist into 'mongodb://localhost/agile_data.sent_dist' using MongoStorage();
We've fixed one bug four different ways. In practice we might process data at any part of the stack, but prudence tells us to push our processing deeper in the stack as we are able to reach a simple global consistent view of our entities and their relationships.
Conclusion
We might summarize what we've done so far in these steps:
Create interesting, inter-connected records. The bar for interesting is initially low. We will improve it over time based on user feedback, traffic analysis and noodling.
Store these records as objects in a document store, like so:
key => {property1, property2, links => [key1, key2, key3]}Split records as properties increase and become complex to avoid deep nesting. Or go at it as a document. Both approaches are valid if they fit your data.
Use a lightweight web framework like Sinatra or Bottle to emit the key/value data as JSON, or use a key/value store that returns JSON in the first place.
Extracting entities and their properties, linking between records, and starting a user session with a pile of links is a strategy that looks like this:











Add a comment



Add a comment