9781449326265
chapter_3.html

Chapter 3. Agile Tools

Scalability = Simplicity

As NoSQL tools like Hadoop, data science and big data have developed, much focus has been on the plumbing of analytics applications. In this book, we are teaching you to build applications that use such infrastructure. We take this plumbing for granted and build applications that depend on it. As such, we are devoting only two chapters to infrastructure. One on introducing our development tools, and one on scaling them up in the cloud.

In choosing our tools we seek scalability, but above all week seek simplicity. While the concurrent systems required to drive a modern analytics application at any kind of scale are complex, we still need to be able to focus on the task at hand: creating value for the user. When tools are too complex, we start to focus on the problem the tools are supposed to solve, and not on mining data and building applications. An efficient stack enables collaboration by team members that are not experts at distributed systems.

The stack we've chosen for this book is not definitive. It has been selected as an example of the kind of end-to-end setup you should aim for in order to rapidly and effectively build analytics applications. The takeaway should be an example stack you can use to jumpstart your application, and a standard to which you should hold other stacks.

Agile Data Processing

Figure 3.1. Flow of data processing in Agile Data

Events -> Collectors -> Bulk Storage -> Batch Processing -> Distributed Store -> Application Server -> Browser -> User

The first step to building analytics applications is to plumb your application from end to end: from collecting raw data to displaying something on the users’ screen. This is important, because models can get complex fast, and you need user feedback plugged into the equation from the start, lest you start iterating without feedback (also known as the death spiral).

The components of our stack are thus:

  • Events are the things logs represent. An event is an occurrence that happens and is logged along with its features and timestamps.

    Events come in many forms - logs from servers, sensors or financial transactions. Actions our users take in our own application. In order to facilitate data exchange among different tools and languages, events are serialized in a common, agreed upon format.

  • Collectors are event aggregators. They collect events from numerous sources and log them in aggregate to bulk storage, or queue them for action by sub-realtime workers.

    We'll be using either Kafka or Flume as our collectors, but we're not sure which one just yet.

  • Bulk Storage is a filesystem capable of parallel access by many concurrent processes. We'll be using S3 in place of the Hadoop Filesystem for this purpose.

  • A Distributed NoSQL Store is a multi-node key/value or document store. In Agile data we use these to publish data for consumption by web applications and other services. We'll be using MongoDB as our NoSQL store. Note, the store in this role needn't be NoSQL. MySQL makes a fine key/value store.

  • A minimalist web Application Server enables us to plumb our data as JSON through to the client for visualization, with minimal overhead.

  • A modern browser or mobile application enables us to present our data as an interactive experience for our users, who provide data through interaction and events describing those actions. In this book we focus on web applications.

This list may look daunting, but in practice these tools are easy to setup and match the crunch points in data science, as we'll see. This setup scales easily, and is optimized for analytic processing.

Setting up a Virtual Environment for Python

In this book we use Python2.7, which may or may not be the version you normally use. For this reason, we'll be using a virtual environment. To setup venv, install the virtualenv package.

With pip:

pip install virtualenv

With easy_install:

easy_install virtualenv

Then, to setup your virtual environment:

virtualenv -p `which python2.7` venv --distribute
source venv/bin/activate

Now you can pip install packages and they will build under the venv/ directory. To exit your virtual environment:

deactivate

Serializing Data with Avro

Figure 3.2. Serializing Events

Events

In our stack, we use a serialization system called Avro. Avro allows us to access our data in a common format in many languages.

Avro for Python

Installation

Installing Avro for Python can be tricky. When installing on Mac OS X, be aware of https://issues.apache.org/jira/browse/AVRO-981. You must first build and install the snappy compression library, available at http://code.google.com/p/snappy/. Using a package manager to do so is recommended. Then install python-snappy via easy_install, pip or from source at https://github.com/andrix/python-snappy. With python-snappy installed, Avro for python should install without problems.

To install the python Avro client from source:

[bash]$ git clone https://github.com/apache/avro.git
[bash]$ cd avro/lang/py
[bash]$ python setup.py install
        

To install using pip or easy_install:

pip install avro
easy_install avro

Testing

Try writing and reading a simple schema to verify that our data works:

[bash]$ python

Example 3.1. Writing avros in python, test_avro.py

from avro import schema, datafile, io
import pprint
OUTFILE_NAME = '/tmp/messages.avro'
SCHEMA_STR = """{
    "type": "record",
    "name": "Message",
    "fields" : [
      {"name": "message_id", "type": "int"},
      {"name": "topic", "type": "string"},
      {"name": "user_id", "type": "int"}
    ]
}"""
SCHEMA = schema.parse(SCHEMA_STR)
# Create a 'record' (datum) writer
rec_writer = io.DatumWriter(SCHEMA)

# Create a 'data file' (avro file) writer
df_writer = datafile.DataFileWriter(
  open(OUTFILE_NAME, 'wb'),
  rec_writer,
  writers_schema = SCHEMA
)

df_writer.append( {"message_id": 11, "topic": "Hello galaxy", "user_id": 1} )
df_writer.append( {"message_id": 12, "topic": "Jim is silly!", "user_id": 1} )
df_writer.append( {"message_id": 23, "topic": "I like apples.", "user_id": 2} )
df_writer.close()
            

Verify that the messages are present:

[bash]$ ls -lah /tmp/messages.avro
-rw-r--r--  1 rjurney  wheel   263B Jan 23 17:30 /tmp/messages.avro

Now verify that we can read records back:

Example 3.2. Reading avros in Python

from avro import schema, datafile, io
import pprint
# Test reading avros
rec_reader = io.DatumReader()

# Create a 'data file' (avro file) reader
df_reader = datafile.DataFileReader(
  open(OUTFILE_NAME),
  rec_reader
)

# Read all records stored inside
pp = pprint.PrettyPrinter()
for record in df_reader:
  pp.pprint(record)
            

The output should look like this:

{u'message_id': 11, u'topic': u'Hello galaxy', u'user_id': 1}
{u'message_id': 12, u'topic': u'Jim is silly!', u'user_id': 1}
{u'message_id': 23, u'topic': u'I like apples.', u'user_id': 2}
          

Collecting Data

Figure 3.3. Collecting Data via IMAP

Collecting Data

We'll be collecting your own email via IMAP, and storing it to disk with Avro. Our avro email schema starts simply, with a unique ID and a raw dump of the email:

Example 3.3. Initial, Raw Avro Schema for Email, src/avro/raw_email.schema

{
    "type":"record",
    "name":"RawEmail",
    "fields":
    [
        {
            "name":"thread_id",
            "type":["string", "null"],
            "doc":""
        },
        {
            "name":"raw_email",
            "type": ["string", "null"],
            "doc":""
        }
    ]
}

Once we extract properties and entities from it through data processing, our schema might look have entries for each and every property of an email.

For simplicity, we've extracted a reasonable schema up front. In practice, this takes time and multiple iterations and should be done as you go along.

We'll use a simple utility to encapsulate the complexity of this operation. If an error or a slow internet connection prevents you from downloading your entire inbox, that is ok. You only need a few megabytes of data to work the examples - although more data makes the examples richer and more rewarding.

Example 3.4. Scraping IMAP with Python imaplib

Python's imaplib makes connecting to gmail easy:

def init_imap(username, password, folder):
  imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
  imap.login(username, password)
  status, count = imap.select(folder)
  return imap, count

With this in place, with a helper script, we can scrape our own inbox like so:

Usage: gmail.py <username@gmail.com> <password> <output_directory>
[bash]$ ./gmail.py <username@gmail.com> <password> /me/tmp/my_emails.avro
    

Email subjects will print to the screen as they download. This can take a while if you want the entire inbox, so it is best to leave it to download overnight. If your internet connection is too slow, you can use the Data Science Toolkit on Amazon EC2 at http://www.datasciencetoolkit.org/developerdocs#setup to download your inbox faster.

You can stop the download at any time with control-c to move on.

[jira] [Commented] (PIG-2489) Input Path Globbing{} not working with PigStorageSchema or PigStorage('\t', '-schema');
[jira] [Created] (PIG-2489) Input Path Globbing{} not working with PigStorageSchema or PigStorage('\t', '-schema');
Re: hbase dns lookups
Re: need help in rendering treemap
RE: HBase 0.92.0 is available for download
Prescriptions Ready at Walgreens
Your payment to AT&T MOBILITY has been sent
Prometheus Un Bound commented on your status.
Re: HBase 0.92.0 is available for download
Prescriptions Ready at Walgreens
How Logical Plan Generator works?
Re: server-side SVG-based d3 graph generation, and SVG display on IE8
neil kodner (@neilkod) favorited one of your Tweets!
    

Now that we've got data, we can begin processing it.

Data Processing with Pig

Figure 3.4. Processing Data with Pig

Processing Data

Introduction

 

Perl is the duct tape of the Internet.

 
 --Hassan Schroeder, Sun's first webmaster

Pig is the duct tape of big data. We use it to define data-flows to allow us to pipe data between best-of-breed tools and languages in a structured, coherent way. Because Pig is a client-side technology, you can run it on local data, against a Hadoop cluster, or via Amazon's Elastic MapReduce (EMR). This enables us to work locally, and at scale with the same tools.

Installing Pig

To install Pig on your local machine, follow the 'Getting Started' directions at http://pig.apache.org/docs/r0.10.0/start.html.

Download the latest stable build of Pig from http://www.apache.org/dyn/closer.cgi/pig. At the time of writing, the latest version is Pig 0.10.0.

cd /me
wget http://apache.osuosl.org/pig/pig-0.10.0/pig-0.10.0.tar.gz
tar -xvzf pig-0.10.0.tar.gz
echo 'export PATH=$PATH:/me/pig-0.10.0/bin' >> ~/.bash_profile # or whatever your shell uses
source ~/.bash_profile

Example 3.5. Processing data with Pig

Now test pig out on the emails from your inbox we stored as avros. Run pig in local mode (instead of Hadoop mode) via -x local and put log files in /tmp via -l /tmp to keep from cluttering our workspace.

[bash]$ pig -l /tmp -x local

Our Pig script flows our data through filters to clean it, then projects, groups and counts it.

REGISTER /me/pig/contrib/piggybank/java/piggybank.jar

REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar

/* This gives us a shortcut to call our Avro storage function */
DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
rmf '/tmp/sent_counts.txt'

-- Load our emails using Pig's AvroStorage User Defined Function (UDF)
messages = LOAD '/me/tmp/my_emails.avro' USING AvroStorage();

-- Filter out missing from/to addresses to limit our processed data to valid records
messages = FILTER messages BY from IS NOT NULL AND to IS NOT NULL;

-- Project out all unique combinations of from/to in this message, then lowercase the emails
-- Note: Bug here if dupes with different case in one email.  Do in a foreach/generate.
smaller = FOREACH messages GENERATE FLATTEN(from), FLATTEN(to) AS to;
pairs = FOREACH smaller GENERATE LOWER(from) AS from, LOWER(to) AS to;

-- Now group the data by unique pairs of addresses, take a count, and store as text in /tmp
froms = GROUP pairs BY (from, to);
sent_counts = FOREACH froms GENERATE FLATTEN(group) AS (from, to), COUNT(pairs) AS total;
sent_counts = ORDER sent_counts BY total;
STORE sent_counts INTO '/tmp/sent_counts.txt';

Since we stored without specifying a storage function, Pig uses PigStorage. By default, PigStorage produces Tab-Seperated-Values. So, we can simple cat the file, or open it in Excel. This is a good example of our data growing as we project it, and then shrinking when we store our final metric: in this case a simple count.

[bash]$ cat /tmp/sent_counts.avro/*

    erictranslates@gmail.com	d3-js@googlegroups.com	1
    info@meetup.com	russell.jurney@gmail.com	1
    jira@apache.org	pig-dev@hadoop.apache.org	1
    desert_rose_170@hotmail.com	user@hbase.apache.org	1
    fnickels@gmail.com	d3-js@googlegroups.com	1
    l.garulli@gmail.com	gremlin-users@googlegroups.com	1
    punk.kish@gmail.com	d3-js@googlegroups.com	1
    lists@ruby-forum.com	user@jruby.codehaus.org	1
    rdm@cfcl.com	ruby-99@meetup.com	1
    sampd@stumbleupon.com	user@pig.apache.org	1
    sampd@stumbleupon.com	user@hive.apache.org	1
    kate.jurney@gmail.com	russell.jurney@gmail.com	2
    bob@novus.com	d3-js@googlegroups.com	2
    dalia.mohsobhy@hotmail.com	user@hbase.apache.org	2
    hugh.lomas@lodestarbpm.com	d3-js@googlegroups.com	2
    update+mkd57whm@facebookmail.com	russell.jurney@gmail.com	2
    notification+mkd57whm@facebookmail.com	138456936208061@groups.facebook.com	3

Figure 3.5. Pig output in Excel

Pig output in Microsoft Excel spreadsheet

You can see how the data flows in the image below. Each line of a Pig Latin script specifies some transformation on the data, and these transformations are executed stepwise as data flows through the script.

Figure 3.6. Data-flow through a Pig Latin Script

Data-flow through a Pig Latin Script

Publishing Data with MongoDB

Figure 3.7. Publishing data to MongoDB

Publishing Data

Introduction

To get our data out to a web application, we need to publish it with some kind of database. While many choices are appropriate, we choose MongoDB for its ease of use, document-orientation and its Hadoop and Pig integration.

Installing MongoDB

Note: Excellent instructions for installing MongoDB are available at http://www.mongodb.org/display/DOCS/Quickstart. An excellent tutorial is available here: http://www.mongodb.org/display/DOCS/Tutorial. I recommend working these brief tutorials before moving on.

Download MongoDB for your operating system at http://www.mongodb.org/downloads.

[bash]$ cd /me
[bash]$ wget http://fastdl.mongodb.org/osx/mongodb-osx-x86_64-2.0.2.tgz
[bash]$ tar -xvzf mongodb-osx-x86_64-2.0.2.tgz
[bash]$ sudo mkdir -p /data/db/
[bash]$ sudo chown `id -u` /data/db

Now start the MongoDB server:

[bash]$ cd /me/mongodb-osx-x86_64-2.0.2
bin/mongodb &

Now open the mongo shell, and get help:

[bash]$ bin/mongo
> help

Finally, create our collection and insert and query a record:

> use agile_data
> e = {from: 'russell.jurney@gmail.com', to: 'bumper1700@hotmail.com', subject: 'Grass seed', body: 'Put grass on the lawn...'}
> db.email.save(e)
> db.email.find()
{ "_id" : ObjectId("4f21c5f7c6ef8a98a43d921b"), "from" : "russell.jurney@gmail.com", "to" : "bumper1700@hotmail.com", "subject" : "Grass seed", "body" : "Put grass on the lawn..." }

We're cooking with Mongo!

Installing MongoDB's Java Driver

MongoDB's Java driver is available at https://github.com/mongodb/mongo-java-driver/downloads. At the time of writing, the 2.7.3 version is the latest stable build: https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar.

wget https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.3.7.jar
mv mongo-2.7.3.jar /me/mongo-hadoop/

Installing mongo-hadoop

Once we have the Java driver to MongoDB, we're ready to integrate with Hadoop. MongoDB's Hadoop integration is available at https://github.com/mongodb/mongo-hadoop and can be downloaded at https://github.com/mongodb/mongo-hadoop/tarball/master as a tar/gzip file.

cd /me
git clone git@github.com:rjurney/mongo-hadoop.git
git checkout ca271348c6dc636355cf18619843a45c579285a4
cd /me/mongo-hadoop
sbt package

Pushing data to MongoDB from Pig

Pushing data to MongoDB from Pig is easy.

Speculative Execution

We haven't set any indexes in MongoDB, so it is possible for copies of entries to be written. To avoid this, we must turn off speculative execution in our Pig script.

set mapred.map.tasks.speculative.execution false

Hadoop uses a feature called 'speculative execution' to fight the bain of concurrent systems called 'skew.' Skew is when one part of the data, assigned to some part of the system for processing, takes much longer than the rest of the data. Perhaps there are 10,000 entries for all keys in your data, but one has 1,000,000. That key can end up taking much longer to process than the others. To combat this, Hadoop runs a race - multiple mappers or reducers will process the lagging chunk of data. The first one wins!

Which is fine when writing to the Hadoop Filesystem. This is less fine when writing to a database without primary keys that will happily accept duplicates. So we turn this feature off in the script below, via 'set mapred.map.tasks.speculative.execution false'.

Example 3.6. Pig to MongoDB

REGISTER /me/mongo-hadoop/mongo-2.7.3.jar
REGISTER /me/mongo-hadoop/core/target/mongo-hadoop-core-1.0.0.jar
REGISTER /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.0.0.jar
REGISTER /me/mongo-hadoop/target/mongo-hadoop-1.0.0.jar

/* I must be set, or we can see duplicate values in MongoDB! */
set mapred.map.tasks.speculative.execution false

sent_counts = LOAD '/me/tmp/sent_counts.txt' AS (from:chararray, to:chararray, total:int);
STORE sent_counts INTO 'mongodb://localhost/agile_data.sent_counts' USING com.mongodb.hadoop.pig.MongoStorage;

Now query our data in Mongo!

use agile_data
          
> db.sent_counts.find()
  { "from" : "erictranslates@gmail.com", "to" : "d3-js@googlegroups.com", "total" : 1 }
  { "from" : "info@meetup.com", "to" : "russell.jurney@gmail.com", "total" : 1 }
  { "from" : "jira@apache.org", "to" : "pig-dev@hadoop.apache.org", "total" : 1 }
  { "from" : "desert_rose_170@hotmail.com", "to" : "user@hbase.apache.org", "total" : 1 }
  { "from" : "fnickels@gmail.com", "to" : "d3-js@googlegroups.com", "total" : 1 }
  { "from" : "l.garulli@gmail.com", "to" : "gremlin-users@googlegroups.com", "total" : 1 }
  { "from" : "punk.kish@gmail.com", "to" : "d3-js@googlegroups.com", "total" : 1 }
  { "from" : "lists@ruby-forum.com", "to" : "user@jruby.codehaus.org", "total" : 1 }
  { "from" : "rdm@cfcl.com", "to" : "ruby-99@meetup.com", "total" : 1 }
  { "from" : "sampd@stumbleupon.com", "to" : "user@pig.apache.org", "total" : 1 }
  { "from" : "sampd@stumbleupon.com", "to" : "user@hive.apache.org", "total" : 1 }
  { "from" : "kate.jurney@gmail.com", "to" : "russell.jurney@gmail.com", "total" : 2 }
  { "from" : "bob@novus.com", "to" : "d3-js@googlegroups.com", "total" : 2 }
  { "from" : "dalia.mohsobhy@hotmail.com", "to" : "user@hbase.apache.org", "total" : 2 }
  { "from" : "hugh.lomas@lodestarbpm.com", "to" : "d3-js@googlegroups.com", "total" : 2 }
  { "from" : "update+mkd57whm@facebookmail.com", "to" : "russell.jurney@gmail.com", "total" : 2 }
  { "from" : "notification+mkd57whm@facebookmail.com", "to" : "138456936208061@groups.facebook.com", "total" : 3 }

> db.sent_counts.find({from: 'kate.jurney@gmail.com', to: 'russell.jurney@gmail.com'})
  { "from" : "kate.jurney@gmail.com", "to" : "russell.jurney@gmail.com", "total" : 2 }

Congratulations, you've published 'agile data!'


Searching Data with ElasticSearch

ElasticSearch is emerging as 'Hadoop for search,' in that it provides a robust, easy to use search solution that lowers the barrier of entry to anyone wanting to search their data, large or small. ElasticSearch has a simple RESTful JSON interface, so we can use it from the command line or from any language. We'll be using ElasticSearch to search our data, to make it easy to find the records we'll be working so hard to create.

Installation

An excellent tutorial on ElasticSearch is available at http://www.elasticsearchtutorial.com/elasticsearch-in-5-minutes.html.

ElasticSearch is available for download at http://www.elasticsearch.org/download/2011/12/19/0.18.6.html.

wget https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.18.6.tar.gz
tar -xvzf elasticsearch-0.18.6.tar.gz
cd elasticsearch-0.18.6
mkdir plugins
bin/elasticsearch -f

Thats it! Our local search engine is up and going.

ElasticSearch and Pig with Wonderdog

Infochimps' Wonderdog provides integration between Hadoop, Pig and ElasticSearch. With Wonderdog, we can load and store data from Pig to and from our search engine. This is extremely powerful, because it lets us plug a search engine into the end of our data pipelines.

Installing Wonderdog

You can download Wonderdog here: https://github.com/infochimps/wonderdog/downloads.

git clone https://github.com/infochimps/wonderdog.git
mvn install

Wonderdog and Pig

To use Wonderdog with Pig, load the required jars and

register '/me/wonderdog/target/*.jar';
register '/me/elasticsearch-0.18.6/lib/*.jar';

define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();

sent_counts = LOAD '/me/tmp/sent_counts.txt' AS (from:chararray, to:chararray, total:int);
STORE sent_counts INTO 'es://sent_counts/sent_counts?json=false&size=1000' USING 
  ElasticSearch('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');

Searching our Data

Now, searching our data is easy, using curl:

curl -XGET 'http://localhost:9200/sent_counts/sent_counts/_search?q=from:russell.jurney@gmail.com&pretty=true'

"hits" : {
  "total" : 89,
  "max_score" : 0.77511376,
  "hits" : [ {
    "_index" : "sent_counts",
    "_type" : "sent_counts",
    "_id" : "vahClJhFQm2BS_6fKcV9oQ",
    "_score" : 0.77511376, "_source" : {"to":"dev@pig.apache.org","total":"13","from":"russell.jurney@gmail.com"}
  }, {
    "_index" : "sent_counts",
    "_type" : "sent_counts",
    "_id" : "vGNA6EmHR_-OZdgR398ujA",
    "_score" : 0.77511376, "_source" : {"to":"iefinkel@gmail.com","total":"1","from":"russell.jurney@gmail.com"}
  },

Clients for ElasticSearch for many languages are available at http://www.elasticsearch.org/guide/appendix/clients.html.

Python and ElasticSearch

For Python, the ESClient is a good choice. To make it work, we'll first need to install the python Requests library.

[bash]$ pip install requests
[bash]$ pip install esclient
[bash]$ easy_install requests
[bash]$ easy_install esclient

Using esclient is easy.

import pyelasticsearch
conn = pyelasticsearch.ElasticSearch('http://localhost:9200/')
result = conn.search("russell.jurney@gmail.com AND kate.jurney@gmail.com", indexes=["sent_counts"])
top_hit = result['hits']['hits'][0]
print(top_hit['_source']['from'] + " " + top_hit['_source']['to'] + " " + top_hit['_source']['total'])

russell.jurney@gmail.com kate.jurney@gmail.com 12

Reflecting on our Workflow

Compared to querying MySQL or MongoDB directly, that might seem hard. Notice, however, that our stack has been optimized for time consuming and thoughtful data-processing, with occasional publishing. Also, this way we won't hit a wall when our real-time queries don't scale anymore as they becoming increasingly complex.

Once our application is plumbed efficiently, the team can work together efficiently, but not before. The stack is the foundation of our agility.

Lightweight Web Applications

The next step is turning our published data into an interactive application. We'll use lightweight web frameworks to do that.

Figure 3.8. To the Web with Python and Flask!

Data Applications

We choose lightweight web frameworks because they are simple and fast to work with. Unlike CRUD applications, mined data is the star of the show here. We use read-only databases and simple application frameworks because that fits with the applications we build and how we offer value.

Given the examples in Python/Flask, the reader can easily implement a solution in Sinatra, Rails, Django, Node.js or your favorite language and web framework.

Python and Flask

Flask

 

Flask is a fast, simple and lightweight WSGI micro web-framework for Python.

 
 --Bottle Documentation

Excellent instructions for using Flask are available at http://flask.pocoo.org/.

[bash]$ pip install Flask
[bash]$ python hello.py

Flask Echo

#!/usr/bin/env python
# Flask:
from flask import Flask
app = Flask(__name__)

@app.route("/<input>")
def hello(input):
    return input

if __name__ == "__main__":
    app.run(debug=True)
$ curl http://localhost:5000/hello%20world!

hello world!

Displaying sent_counts in Flask

Tutorial

Find out more about MongoDB's Python driver at http://www.mongodb.org/display/DOCS/Python+Language+Center. Detailed installation instructions are available at http://api.mongodb.org/python/current/installation.html.

Install pymongo with pip:

pip install pymongo

with easy_install:

easy_install pymongo

from source:

[bash]$ git clone git://github.com/mongodb/mongo-python-driver.git pymongo
[bash]$ cd pymongo/
[bash]$ python setup.py install

Now, use pymongo to display the sent_counts we stored in Mongo using Pig and MongoStorage:

from pymongo import Connection
import json
from flask import Flask

app = Flask(__name__)
connection = Connection()
db = connection.agile_data

@app.route("/<input>")
def echo(input):
    return input

@app.route("/sent_counts/<ego1>/<ego2>")
def sent_counts(ego1, ego2):
    sent_count = db['sent_counts'].find_one({'ego1': ego1, 'ego2': ego2})
    plain = {'from': sent_count['ego1'], 'to': sent_count['ego2'], 'total': sent_count['total']}
    return json.dumps(plain)

if __name__ == "__main__":
    app.run(debug=True)

Now visit http://localhost:8080/sent_counts/k@123.org/common-user@hadoop.apache.org (or comparable, for your data) and you will see:

{"ego1":"k@123.org","ego2":"common-user@hadoop.apache.org","total":8}

And we're done!

Figure 3.9. Undecorated Data on the Web

{"ego1":"k@123.org","ego2":"common-user@hadoop.apache.org","total":8}

Conclusion

Congratulations! You've published data on the web. Now, lets make it presentable...

Presenting our Data

Figure 3.10. Presenting our data with Bootstrap and D3.js

Presentation in the user's browser

Introduction

Design and presentation impact the value of your work. In fact, one way to think of Agile Data is 'data design.' The output of our data models matches our views, and in that sense design and data-processing are not distinct. Instead, they are part of the same collaborative activity: data design. With that in mind, it is best that we start out with a solid, clean design for our data and work from there.

 

Bootstrap is Twitter's toolkit for kickstarting CSS for websites, apps, and more. It includes base CSS styles for typography, forms, buttons, tables, grids, navigation, alerts, and more. To get started -- checkout http://twitter.github.com/bootstrap!

 
 --Bootstrap Project

Installing Bootstrap

Bootstrap is available at https://github.com/twitter/bootstrap.git. To install, place it with the static files of your application.

git clone https://github.com/twitter/bootstrap.git
git checkout 2.0-wip

To invoke bootstrap, simply reference it as CSS from within your HTML page:

<link href="/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">

Booting Boostrap

It only takes a little editing of an example to arrive at a home page for our project.

Figure 3.11. Bootstrap 2.0 Hero Example

Agile Data Homepage, with the text: I feel legitimate already (because this web page has pretty CSS)! Isn't it amazing what a good presentation can do? We'll be using Bootstrap components to raidly spin up interfaces to wrap our data. This enables us to do Data Design!

Lets try wrapping a previous example in a table, styled with Bootstrap.

Tables!

Thats right - tables for tabular data! Bootstrap lets us use them without shame. Now we'll update our controller to stash our data, and create a simple template to print a table.

In index.py:

from flask import Flask, render_template
...
@app.route("/sent_counts/<ego1>/<ego2>")
def sent_counts(ego1, ego2):
    sent_count = db['sent_counts'].find_one({'ego1': ego1, 'ego2': ego2})
    data = {}
    data['keys'] = '_id', 'ego1', 'ego2', 'total'
    data['values'] = sent_count['_id'], sent_count['ego1'], sent_count['ego2'], sent_count['total']
    return render_template('table.html', data=data)

And in our template, table.html:

<head>
...
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
  <table class="table table-striped table-bordered table-condensed">
    <thead>
    {% for key in data['keys'] %}
      <th>{{ key }}</th>
    {% endfor %}
    </thead>
    <tbody>
      <tr>
      {% for value in data['values'] %}
        <td>{{ value }}</td>
      {% endfor %}
      </tr>
    </tbody>
  </table>
</div>
</body>

The result is human readable data with very little trouble!

Figure 3.12. Simple data in a Bootstrap styled table

Application logic moves between batch processing all the way to the browser in Agile Data, as needed.

Note

In practice, we may use client-side templating languages like moustache. For clarity's sake, we use Jinja2 templates in this book.

Visualizing Data with D3.js

D3.js enables data-driven documents.

 

Introducing D3.js

d3 is not a traditional visualization framework. Rather than provide a monolithic system with all the features anyone may ever need, d3 solves only the crux of the problem: efficient manipulation of documents based on data. This gives d3 extraordinary flexibility, exposing the full capabilities of underlying technologies such as CSS3, HTML5 and SVG.

 
 --Mike Bostock, mbostock.github.com/d3/

We'll be using D3.js to create charts in our application. You can install d3 in your web application via:

git clone https://github.com/mbostock/d3.git

We'll be making charts with d3 in chapter 7. For now, take a look at the examples directory to see what is possible with s3.

Summary

We've created a very simple app with a single, very simple feature. This is a great starting point, but so what?

What's important about the application isn't what it does, it's that it's a pipeline where it's easy to modify every stage. A pipeline that will scale without worrying about optimization at each step, where optimization becomes a function of cost in terms of resource efficiency, but not in terms of the cost of re-engineering.

As we'll see in the next chapter, because we've created an arbitrarily scalable pipeline where every stage is easily modifiable, it is possible to return to agility. We won't quickly run into a wall as soon as we need to switch from a relational database to something else that 'scales better', and we aren't confining ourselves to the limitations tools designed for other tasks like online transaction processing impose on us.

We now have total freedom to use best of breed tools within this framework to solve hard problems and produce value. We can choose any language, any framework, any library and glue it together to get things built.

Figure 3.13. Online Transaction Processing (OLTP) and NoSQL OnLine Analytic Processing (OLAP)

Online Transaction Processing (OLTP) and NoSQL OnLine Analytic Processing (OLAP)

Site last updated on: July 26, 2012 at 05:06:45 AM PDT
Cover for Agile Data

View 1 comment

  1. gnikolaropoulos – Posted Feb. 28, 2013

    "As NoSQL tools like Hadoop, data science and big data have developed, much focus has been on the plumbing of analytics applications." I think there is something wrong with the syntax here. Are "data science" and "big data" tools?

    Edited on February 28, 2013, 11:11 p.m. PST

Add a comment

View 1 comment

  1. paulbunkham – Posted March 14, 2013

    Typo 'week seek' should be 'we seek' I think.

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 20, 2012

    Since I was running this in a Python interpreter 'OUTFILE_NAME' was already defined via the code I pasted in from Example 3.1. Otherwise, it will need to be redefined. Or, to keep these separate, maybe defined 'INFILE_NAME' and set it to "/tmp/messages.avro".

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 20, 2012

    This portion is a bit unclear. Not sure id the unique ID is to be generated manually via the code we will write to import the IMAP data, or if it is to be gotten from the emails themselves. Closest thing I see that could act as the "thread_id" might be the_X-Gmail-Received:_ header.

Add a comment

View 3 comments

  1. terryjbates – Posted Oct. 20, 2012

    I would definitely never suggest using a command-line script that prompts for an email password. If someone makes the mistake using the history command the password will be exposed.

  2. terryjbates – Posted Oct. 21, 2012

    Greetings,

    Too curious to wait. Wrote my own version of gmail.py script:

    https://github.com/terryjbates/agile-data/blob/master/chapter-3/gmail.py

  3. terryjbates – Posted Oct. 25, 2012

    Had to modified gmail.py in such a way that it would not choke on things that required .decode('utf8') run on them. Also, noticed that there were some messages that lacked 'To' and 'From' in my IMAP download. Adjusted code accordingly to accommodate and not die.

Add a comment

View 2 comments

  1. terryjbates – Posted Oct. 27, 2012

    I do not see the immediate relevancy of the DST to this problem yet.

  2. Mark Birbeck – Posted Nov. 4, 2012

    @terryjbates: I think the DST is intended to run on an EC2 instance, so they probably mean that the download should be run in the cloud, without the need for your local machine to be connected.

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 20, 2012

    Is "/me" to be a user-specified directory?

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 21, 2012

    The pig install directions may need greater fleshing out. Otherwise, the subsequent REGISTER statements will fail in the .pig script

    cd <pig install dir>; ant; cd <pig install dir>/contrib/piggybank; ant

    https://cwiki.apache.org/confluence/display/PIG/PiggyBank

    Edited on October 21, 2012, 9:33 p.m. PDT

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 21, 2012

    The line:

    messages = LOAD '/me/tmp/my_emails.avro' USING AvroStorage();

    Probably should be something like:

    messages = LOAD '/tmp/myemails.avro' USING AvroStorage() AS (thread_id:int, date:chararray, from:chararray, to:chararray, subject:chararray);

    ...or whatever the schema mas made to be. I kept getting errors since pig had no idea what 'from' meant in subsequent portions of the .pig script.

    Edited on October 21, 2012, 9:33 p.m. PDT

Add a comment

View 1 comment

  1. terryjbates – Posted Nov. 22, 2012

    "Tab-Seperated-Values" should be "Tab-Separated-Values."

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 24, 2012

    Think this should read:

    cat /tmp/sent_counts.txt/*

    If you cat the .avro file, you might be in for a world of hurt, since it is binary.

    Edited on October 24, 2012, 12:56 p.m. PDT

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 24, 2012

    Would suggest moving this section immediately below "Installing mongo-hadoop." Otherwise, readers may wonder if this directory was/is already existing or needed to be created manually.

    Edited on October 24, 2012, 12:03 p.m. PDT

Add a comment

View 2 comments

  1. terryjbates – Posted Oct. 23, 2012

    Would suggest following: cd <pig install dir> git clone git@github.com:rjurney/mongo-hadoop.git cd <pig install dir>/mongo-hadoop/ git checkout ca271348c6dc636355cf18619843a45c579285a4 cd <pig install dir>mongo-hadoop ./sbt package

    Otherwise, this set of directions might only work if "." was part of the users PATH variable (which would not be good idea).

  2. terryjbates – Posted Oct. 24, 2012

    Also, I am definitely not conversant with Java at all. Curious as to what exactly the "./sbt package" is actually doing; I see a flurry of stuff flying past my screen. JMX? Scala? Compilation?

Add a comment

View 1 comment

  1. Karmi – Posted July 6, 2012

    Or rather, there is an excellent tutorial available in the elasticsearch README directly: https://github.com/elasticsearch/elasticsearch#getting-started ...

Add a comment

View 1 comment

  1. Karmi – Posted July 6, 2012

    It makes more sense to link to the general elasticsearch download page, http://www.elasticsearch.org/download/, which lists the latest version.

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 26, 2012

    I had to make some changes to elasticsearch.yml so that this Pig script would not tank;

    network.host: 127.0.0.1 discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["localhost"]

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 26, 2012

    This code inaccurate. The pip and easy_install directives point to 'esclient' but the actual Python snippet seems to be using "pyelasticsearch". Confusing. Here is code I have used:

    import esclient
    
    def print_dict(dict):
        for key, value in dict.iteritems():
            print key, value
            print "#######"
            print
    
    def main():
        es_object = esclient.ESClient("http://localhost:9200/")
    
        query_string_args = {
            'q':'nobody@gmail.com AND amazon'
    
        }
        result = es_object.search(query_string_args=query_string_args, indexes=['sent_counts'])
        #print_dict(result)
        print result['hits']['total']
    
    if __name__ == '__main__':
        main()
    

    Edited on October 26, 2012, 9:36 p.m. PDT

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 26, 2012

    This code needs some tweaking:

    plain = {'from': sent_count['from'], 'to': sent_count['to'], 'total': sent_count['total']}
    

    Otherwise, you will be looking for keys called literally 'ego1' and 'ego2' which is not desired.

    Another thing I have noticed, is that in my gmail import script, I did not remove the Sender/Receiver First and Last names. So my routes look as so:

    http://localhost:5000/sent_counts/%22Greg.%20Foo%22%20%3Cg.foo@mail.hgen.pitt.edu%3E/%22terry%20bates%22%20%3Cterryjbates@nowhere.com%3E
    

    Given that we are reading in 'ego1' and 'ego2', it should be simple enough to allow this to work as is, or figure out how to be flexible enough to find records and omit the First and Last name. (or so I think)

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 26, 2012

    Noticed that the flask instructions seem to indicate that one must place things like bootstrap under the flask application's directory. So extra commands:

    mkdir flask
    cd flask
    *edit python flask app under flask directory*
    mkdir static
    cd static
    git clone https://github.com/twitter/bootstrap.git
    cd bootstrap
    git checkout 2.0.0-wip
    

    Bizarrely enough, when I did git checkout and hit tab a few times, it displayed a list of branches to choose from.

    Edited on October 27, 2012, 7:03 p.m. PDT

Add a comment

View 1 comment

  1. terryjbates – Posted Oct. 27, 2012

    Umm. What HTML page?

Add a comment

View 1 comment

  1. Karmi – Posted July 6, 2012

    To "install" the d3.js library in your application, the only thing you have to do is to insert a  to https://github.com/mbostock/d3/blob/master/d3.v2.min.js -- cloning the repository makes sense only when you want to develop d3.js or check the source code.

    (Most of the examples are available online at https://github.com/mbostock/d3/wiki/Gallery)

Add a comment

View 1 comment

  1. Mark Birbeck – Posted Nov. 4, 2012

    Have you considered using VMs such as VirtualBox? When coupled with Vagrant and Chef it makes for a nice 'standard' development environment which would be the same for everyone, regardless of whether they use a Mac or Linux. Also, the environment would be much the same as that used on the deployed system.

Add a comment