Chapter 3. Agile Tools
Scalability = Simplicity
As NoSQL tools like Hadoop, data science and big data have developed, much focus has been on the plumbing of analytics applications. In this book, we are teaching you to build applications that use such infrastructure. We take this plumbing for granted and build applications that depend on it. As such, we are devoting only two chapters to infrastructure. One on introducing our development tools, and one on scaling them up in the cloud.
In choosing our tools we seek scalability, but above all week seek simplicity. While the concurrent systems required to drive a modern analytics application at any kind of scale are complex, we still need to be able to focus on the task at hand: creating value for the user. When tools are too complex, we start to focus on the problem the tools are supposed to solve, and not on mining data and building applications. An efficient stack enables collaboration by team members that are not experts at distributed systems.
The stack we've chosen for this book is not definitive. It has been selected as an example of the kind of end-to-end setup you should aim for in order to rapidly and effectively build analytics applications. The takeaway should be an example stack you can use to jumpstart your application, and a standard to which you should hold other stacks.
Agile Data Processing
The first step to building analytics applications is to plumb your application from end to end: from collecting raw data to displaying something on the users’ screen. This is important, because models can get complex fast, and you need user feedback plugged into the equation from the start, lest you start iterating without feedback (also known as the death spiral).
The components of our stack are thus:
Events are the things logs represent. An event is an occurrence that happens and is logged along with its features and timestamps.
Events come in many forms - logs from servers, sensors or financial transactions. Actions our users take in our own application. In order to facilitate data exchange among different tools and languages, events are serialized in a common, agreed upon format.
Collectors are event aggregators. They collect events from numerous sources and log them in aggregate to bulk storage, or queue them for action by sub-realtime workers.
We'll be using either Kafka or Flume as our collectors, but we're not sure which one just yet.
Bulk Storage is a filesystem capable of parallel access by many concurrent processes. We'll be using S3 in place of the Hadoop Filesystem for this purpose.
A Distributed NoSQL Store is a multi-node key/value or document store. In Agile data we use these to publish data for consumption by web applications and other services. We'll be using MongoDB as our NoSQL store. Note, the store in this role needn't be NoSQL. MySQL makes a fine key/value store.
A minimalist web Application Server enables us to plumb our data as JSON through to the client for visualization, with minimal overhead.
A modern browser or mobile application enables us to present our data as an interactive experience for our users, who provide data through interaction and events describing those actions. In this book we focus on web applications.
This list may look daunting, but in practice these tools are easy to setup and match the crunch points in data science, as we'll see. This setup scales easily, and is optimized for analytic processing.
Setting up a Virtual Environment for Python
In this book we use Python2.7, which may or may not be the version you normally use. For this reason, we'll be using a virtual environment. To setup venv, install the virtualenv package.
With pip:
pip install virtualenv
With easy_install:
easy_install virtualenv
Then, to setup your virtual environment:
virtualenv -p `which python2.7` venv --distribute source venv/bin/activate
Now you can pip install packages and they will build under the venv/ directory. To exit your virtual environment:
deactivate
Serializing Data with Avro
In our stack, we use a serialization system called Avro. Avro allows us to access our data in a common format in many languages.
Avro for Python
Installation
Installing Avro for Python can be tricky. When installing on Mac OS X, be aware of https://issues.apache.org/jira/browse/AVRO-981. You must first build and install the snappy compression library, available at http://code.google.com/p/snappy/. Using a package manager to do so is recommended. Then install python-snappy via easy_install, pip or from source at https://github.com/andrix/python-snappy. With python-snappy installed, Avro for python should install without problems.
To install the python Avro client from source:
[bash]$ git clone https://github.com/apache/avro.git
[bash]$ cd avro/lang/py
[bash]$ python setup.py install
To install using pip or easy_install:
pip install avro
easy_install avro
Testing
Try writing and reading a simple schema to verify that our data works:
[bash]$ python
Example 3.1. Writing avros in python, test_avro.py
from avro import schema, datafile, io
import pprint
OUTFILE_NAME = '/tmp/messages.avro'
SCHEMA_STR = """{
"type": "record",
"name": "Message",
"fields" : [
{"name": "message_id", "type": "int"},
{"name": "topic", "type": "string"},
{"name": "user_id", "type": "int"}
]
}"""
SCHEMA = schema.parse(SCHEMA_STR)
# Create a 'record' (datum) writer
rec_writer = io.DatumWriter(SCHEMA)
# Create a 'data file' (avro file) writer
df_writer = datafile.DataFileWriter(
open(OUTFILE_NAME, 'wb'),
rec_writer,
writers_schema = SCHEMA
)
df_writer.append( {"message_id": 11, "topic": "Hello galaxy", "user_id": 1} )
df_writer.append( {"message_id": 12, "topic": "Jim is silly!", "user_id": 1} )
df_writer.append( {"message_id": 23, "topic": "I like apples.", "user_id": 2} )
df_writer.close()
Verify that the messages are present:
[bash]$ ls -lah /tmp/messages.avro
-rw-r--r-- 1 rjurney wheel 263B Jan 23 17:30 /tmp/messages.avro
Now verify that we can read records back:
Example 3.2. Reading avros in Python
from avro import schema, datafile, io
import pprint
# Test reading avros
rec_reader = io.DatumReader()
# Create a 'data file' (avro file) reader
df_reader = datafile.DataFileReader(
open(OUTFILE_NAME),
rec_reader
)
# Read all records stored inside
pp = pprint.PrettyPrinter()
for record in df_reader:
pp.pprint(record)
The output should look like this:
{u'message_id': 11, u'topic': u'Hello galaxy', u'user_id': 1}
{u'message_id': 12, u'topic': u'Jim is silly!', u'user_id': 1}
{u'message_id': 23, u'topic': u'I like apples.', u'user_id': 2}
Collecting Data
We'll be collecting your own email via IMAP, and storing it to disk with Avro. Our avro email schema starts simply, with a unique ID and a raw dump of the email:
Example 3.3. Initial, Raw Avro Schema for Email, src/avro/raw_email.schema
{
"type":"record",
"name":"RawEmail",
"fields":
[
{
"name":"thread_id",
"type":["string", "null"],
"doc":""
},
{
"name":"raw_email",
"type": ["string", "null"],
"doc":""
}
]
}Once we extract properties and entities from it through data processing, our schema might look have entries for each and every property of an email.
For simplicity, we've extracted a reasonable schema up front. In practice, this takes time and multiple iterations and should be done as you go along.
We'll use a simple utility to encapsulate the complexity of this operation. If an error or a slow internet connection prevents you from downloading your entire inbox, that is ok. You only need a few megabytes of data to work the examples - although more data makes the examples richer and more rewarding.
Example 3.4. Scraping IMAP with Python imaplib
Python's imaplib makes connecting to gmail easy:
def init_imap(username, password, folder):
imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
imap.login(username, password)
status, count = imap.select(folder)
return imap, countWith this in place, with a helper script, we can scrape our own inbox like so:
Usage: gmail.py <username@gmail.com> <password> <output_directory>
[bash]$ ./gmail.py <username@gmail.com> <password> /me/tmp/my_emails.avro
Email subjects will print to the screen as they download. This can take a while if you want the entire inbox, so it is best to leave it to download overnight. If your internet connection is too slow, you can use the Data Science Toolkit on Amazon EC2 at http://www.datasciencetoolkit.org/developerdocs#setup to download your inbox faster.
You can stop the download at any time with control-c to move on.
[jira] [Commented] (PIG-2489) Input Path Globbing{} not working with PigStorageSchema or PigStorage('\t', '-schema');
[jira] [Created] (PIG-2489) Input Path Globbing{} not working with PigStorageSchema or PigStorage('\t', '-schema');
Re: hbase dns lookups
Re: need help in rendering treemap
RE: HBase 0.92.0 is available for download
Prescriptions Ready at Walgreens
Your payment to AT&T MOBILITY has been sent
Prometheus Un Bound commented on your status.
Re: HBase 0.92.0 is available for download
Prescriptions Ready at Walgreens
How Logical Plan Generator works?
Re: server-side SVG-based d3 graph generation, and SVG display on IE8
neil kodner (@neilkod) favorited one of your Tweets!
Now that we've got data, we can begin processing it.
Data Processing with Pig
Introduction
Perl is the duct tape of the Internet. | ||
| --Hassan Schroeder, Sun's first webmaster | ||
Pig is the duct tape of big data. We use it to define data-flows to allow us to pipe data between best-of-breed tools and languages in a structured, coherent way. Because Pig is a client-side technology, you can run it on local data, against a Hadoop cluster, or via Amazon's Elastic MapReduce (EMR). This enables us to work locally, and at scale with the same tools.
Installing Pig
To install Pig on your local machine, follow the 'Getting Started' directions at http://pig.apache.org/docs/r0.10.0/start.html.
Download the latest stable build of Pig from http://www.apache.org/dyn/closer.cgi/pig. At the time of writing, the latest version is Pig 0.10.0.
cd /me wget http://apache.osuosl.org/pig/pig-0.10.0/pig-0.10.0.tar.gz tar -xvzf pig-0.10.0.tar.gz echo 'export PATH=$PATH:/me/pig-0.10.0/bin' >> ~/.bash_profile # or whatever your shell uses source ~/.bash_profile
Example 3.5. Processing data with Pig
Now test pig out on the emails from your inbox we stored as avros. Run pig in local mode (instead of Hadoop mode) via -x local and put log files in /tmp via -l /tmp to keep from cluttering our workspace.
[bash]$ pig -l /tmp -x local
Our Pig script flows our data through filters to clean it, then projects, groups and counts it.
REGISTER /me/pig/contrib/piggybank/java/piggybank.jar REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar /* This gives us a shortcut to call our Avro storage function */ DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); rmf '/tmp/sent_counts.txt' -- Load our emails using Pig's AvroStorage User Defined Function (UDF) messages = LOAD '/me/tmp/my_emails.avro' USING AvroStorage(); -- Filter out missing from/to addresses to limit our processed data to valid records messages = FILTER messages BY from IS NOT NULL AND to IS NOT NULL; -- Project out all unique combinations of from/to in this message, then lowercase the emails -- Note: Bug here if dupes with different case in one email. Do in a foreach/generate. smaller = FOREACH messages GENERATE FLATTEN(from), FLATTEN(to) AS to; pairs = FOREACH smaller GENERATE LOWER(from) AS from, LOWER(to) AS to; -- Now group the data by unique pairs of addresses, take a count, and store as text in /tmp froms = GROUP pairs BY (from, to); sent_counts = FOREACH froms GENERATE FLATTEN(group) AS (from, to), COUNT(pairs) AS total; sent_counts = ORDER sent_counts BY total; STORE sent_counts INTO '/tmp/sent_counts.txt';
Since we stored without specifying a storage function, Pig uses PigStorage. By default, PigStorage produces Tab-Seperated-Values. So, we can simple cat the file, or open it in Excel. This is a good example of our data growing as we project it, and then shrinking when we store our final metric: in this case a simple count.
[bash]$ cat /tmp/sent_counts.avro/*
erictranslates@gmail.com d3-js@googlegroups.com 1
info@meetup.com russell.jurney@gmail.com 1
jira@apache.org pig-dev@hadoop.apache.org 1
desert_rose_170@hotmail.com user@hbase.apache.org 1
fnickels@gmail.com d3-js@googlegroups.com 1
l.garulli@gmail.com gremlin-users@googlegroups.com 1
punk.kish@gmail.com d3-js@googlegroups.com 1
lists@ruby-forum.com user@jruby.codehaus.org 1
rdm@cfcl.com ruby-99@meetup.com 1
sampd@stumbleupon.com user@pig.apache.org 1
sampd@stumbleupon.com user@hive.apache.org 1
kate.jurney@gmail.com russell.jurney@gmail.com 2
bob@novus.com d3-js@googlegroups.com 2
dalia.mohsobhy@hotmail.com user@hbase.apache.org 2
hugh.lomas@lodestarbpm.com d3-js@googlegroups.com 2
update+mkd57whm@facebookmail.com russell.jurney@gmail.com 2
notification+mkd57whm@facebookmail.com 138456936208061@groups.facebook.com 3You can see how the data flows in the image below. Each line of a Pig Latin script specifies some transformation on the data, and these transformations are executed stepwise as data flows through the script.
Publishing Data with MongoDB
Introduction
To get our data out to a web application, we need to publish it with some kind of database. While many choices are appropriate, we choose MongoDB for its ease of use, document-orientation and its Hadoop and Pig integration.
Installing MongoDB
Note: Excellent instructions for installing MongoDB are available at http://www.mongodb.org/display/DOCS/Quickstart. An excellent tutorial is available here: http://www.mongodb.org/display/DOCS/Tutorial. I recommend working these brief tutorials before moving on.
Download MongoDB for your operating system at http://www.mongodb.org/downloads.
[bash]$ cd /me [bash]$ wget http://fastdl.mongodb.org/osx/mongodb-osx-x86_64-2.0.2.tgz [bash]$ tar -xvzf mongodb-osx-x86_64-2.0.2.tgz [bash]$ sudo mkdir -p /data/db/ [bash]$ sudo chown `id -u` /data/db
Now start the MongoDB server:
[bash]$ cd /me/mongodb-osx-x86_64-2.0.2 bin/mongodb &
Now open the mongo shell, and get help:
[bash]$ bin/mongo > help
Finally, create our collection and insert and query a record:
> use agile_data
> e = {from: 'russell.jurney@gmail.com', to: 'bumper1700@hotmail.com', subject: 'Grass seed', body: 'Put grass on the lawn...'}
> db.email.save(e)
> db.email.find()
{ "_id" : ObjectId("4f21c5f7c6ef8a98a43d921b"), "from" : "russell.jurney@gmail.com", "to" : "bumper1700@hotmail.com", "subject" : "Grass seed", "body" : "Put grass on the lawn..." }We're cooking with Mongo!
Installing MongoDB's Java Driver
MongoDB's Java driver is available at https://github.com/mongodb/mongo-java-driver/downloads. At the time of writing, the 2.7.3 version is the latest stable build: https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar.
wget https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.3.7.jar mv mongo-2.7.3.jar /me/mongo-hadoop/
Installing mongo-hadoop
Once we have the Java driver to MongoDB, we're ready to integrate with Hadoop. MongoDB's Hadoop integration is available at https://github.com/mongodb/mongo-hadoop and can be downloaded at https://github.com/mongodb/mongo-hadoop/tarball/master as a tar/gzip file.
cd /me git clone git@github.com:rjurney/mongo-hadoop.git git checkout ca271348c6dc636355cf18619843a45c579285a4 cd /me/mongo-hadoop sbt package
Pushing data to MongoDB from Pig
Pushing data to MongoDB from Pig is easy.
Speculative Execution
We haven't set any indexes in MongoDB, so it is possible for copies of entries to be written. To avoid this, we must turn off speculative execution in our Pig script.
set mapred.map.tasks.speculative.execution false
Hadoop uses a feature called 'speculative execution' to fight the bain of concurrent systems called 'skew.' Skew is when one part of the data, assigned to some part of the system for processing, takes much longer than the rest of the data. Perhaps there are 10,000 entries for all keys in your data, but one has 1,000,000. That key can end up taking much longer to process than the others. To combat this, Hadoop runs a race - multiple mappers or reducers will process the lagging chunk of data. The first one wins!
Which is fine when writing to the Hadoop Filesystem. This is less fine when writing to a database without primary keys that will happily accept duplicates. So we turn this feature off in the script below, via 'set mapred.map.tasks.speculative.execution false'.
Example 3.6. Pig to MongoDB
REGISTER /me/mongo-hadoop/mongo-2.7.3.jar
REGISTER /me/mongo-hadoop/core/target/mongo-hadoop-core-1.0.0.jar
REGISTER /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.0.0.jar
REGISTER /me/mongo-hadoop/target/mongo-hadoop-1.0.0.jar
/* I must be set, or we can see duplicate values in MongoDB! */
set mapred.map.tasks.speculative.execution false
sent_counts = LOAD '/me/tmp/sent_counts.txt' AS (from:chararray, to:chararray, total:int);
STORE sent_counts INTO 'mongodb://localhost/agile_data.sent_counts' USING com.mongodb.hadoop.pig.MongoStorage;Now query our data in Mongo!
use agile_data
> db.sent_counts.find()
{ "from" : "erictranslates@gmail.com", "to" : "d3-js@googlegroups.com", "total" : 1 }
{ "from" : "info@meetup.com", "to" : "russell.jurney@gmail.com", "total" : 1 }
{ "from" : "jira@apache.org", "to" : "pig-dev@hadoop.apache.org", "total" : 1 }
{ "from" : "desert_rose_170@hotmail.com", "to" : "user@hbase.apache.org", "total" : 1 }
{ "from" : "fnickels@gmail.com", "to" : "d3-js@googlegroups.com", "total" : 1 }
{ "from" : "l.garulli@gmail.com", "to" : "gremlin-users@googlegroups.com", "total" : 1 }
{ "from" : "punk.kish@gmail.com", "to" : "d3-js@googlegroups.com", "total" : 1 }
{ "from" : "lists@ruby-forum.com", "to" : "user@jruby.codehaus.org", "total" : 1 }
{ "from" : "rdm@cfcl.com", "to" : "ruby-99@meetup.com", "total" : 1 }
{ "from" : "sampd@stumbleupon.com", "to" : "user@pig.apache.org", "total" : 1 }
{ "from" : "sampd@stumbleupon.com", "to" : "user@hive.apache.org", "total" : 1 }
{ "from" : "kate.jurney@gmail.com", "to" : "russell.jurney@gmail.com", "total" : 2 }
{ "from" : "bob@novus.com", "to" : "d3-js@googlegroups.com", "total" : 2 }
{ "from" : "dalia.mohsobhy@hotmail.com", "to" : "user@hbase.apache.org", "total" : 2 }
{ "from" : "hugh.lomas@lodestarbpm.com", "to" : "d3-js@googlegroups.com", "total" : 2 }
{ "from" : "update+mkd57whm@facebookmail.com", "to" : "russell.jurney@gmail.com", "total" : 2 }
{ "from" : "notification+mkd57whm@facebookmail.com", "to" : "138456936208061@groups.facebook.com", "total" : 3 }
> db.sent_counts.find({from: 'kate.jurney@gmail.com', to: 'russell.jurney@gmail.com'})
{ "from" : "kate.jurney@gmail.com", "to" : "russell.jurney@gmail.com", "total" : 2 }Congratulations, you've published 'agile data!'
Searching Data with ElasticSearch
ElasticSearch is emerging as 'Hadoop for search,' in that it provides a robust, easy to use search solution that lowers the barrier of entry to anyone wanting to search their data, large or small. ElasticSearch has a simple RESTful JSON interface, so we can use it from the command line or from any language. We'll be using ElasticSearch to search our data, to make it easy to find the records we'll be working so hard to create.
Installation
An excellent tutorial on ElasticSearch is available at http://www.elasticsearchtutorial.com/elasticsearch-in-5-minutes.html.
ElasticSearch is available for download at http://www.elasticsearch.org/download/2011/12/19/0.18.6.html.
wget https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.18.6.tar.gz tar -xvzf elasticsearch-0.18.6.tar.gz cd elasticsearch-0.18.6 mkdir plugins bin/elasticsearch -f
Thats it! Our local search engine is up and going.
ElasticSearch and Pig with Wonderdog
Infochimps' Wonderdog provides integration between Hadoop, Pig and ElasticSearch. With Wonderdog, we can load and store data from Pig to and from our search engine. This is extremely powerful, because it lets us plug a search engine into the end of our data pipelines.
Installing Wonderdog
You can download Wonderdog here: https://github.com/infochimps/wonderdog/downloads.
git clone https://github.com/infochimps/wonderdog.git mvn install
Wonderdog and Pig
To use Wonderdog with Pig, load the required jars and
register '/me/wonderdog/target/*.jar';
register '/me/elasticsearch-0.18.6/lib/*.jar';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
sent_counts = LOAD '/me/tmp/sent_counts.txt' AS (from:chararray, to:chararray, total:int);
STORE sent_counts INTO 'es://sent_counts/sent_counts?json=false&size=1000' USING
ElasticSearch('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');Searching our Data
Now, searching our data is easy, using curl:
curl -XGET 'http://localhost:9200/sent_counts/sent_counts/_search?q=from:russell.jurney@gmail.com&pretty=true'
"hits" : {
"total" : 89,
"max_score" : 0.77511376,
"hits" : [ {
"_index" : "sent_counts",
"_type" : "sent_counts",
"_id" : "vahClJhFQm2BS_6fKcV9oQ",
"_score" : 0.77511376, "_source" : {"to":"dev@pig.apache.org","total":"13","from":"russell.jurney@gmail.com"}
}, {
"_index" : "sent_counts",
"_type" : "sent_counts",
"_id" : "vGNA6EmHR_-OZdgR398ujA",
"_score" : 0.77511376, "_source" : {"to":"iefinkel@gmail.com","total":"1","from":"russell.jurney@gmail.com"}
},
Clients for ElasticSearch for many languages are available at http://www.elasticsearch.org/guide/appendix/clients.html.
Python and ElasticSearch
For Python, the ESClient is a good choice. To make it work, we'll first need to install the python Requests library.
[bash]$ pip install requests [bash]$ pip install esclient
[bash]$ easy_install requests [bash]$ easy_install esclient
Using esclient is easy.
import pyelasticsearch
conn = pyelasticsearch.ElasticSearch('http://localhost:9200/')
result = conn.search("russell.jurney@gmail.com AND kate.jurney@gmail.com", indexes=["sent_counts"])
top_hit = result['hits']['hits'][0]
print(top_hit['_source']['from'] + " " + top_hit['_source']['to'] + " " + top_hit['_source']['total'])
russell.jurney@gmail.com kate.jurney@gmail.com 12Reflecting on our Workflow
Compared to querying MySQL or MongoDB directly, that might seem hard. Notice, however, that our stack has been optimized for time consuming and thoughtful data-processing, with occasional publishing. Also, this way we won't hit a wall when our real-time queries don't scale anymore as they becoming increasingly complex.
Once our application is plumbed efficiently, the team can work together efficiently, but not before. The stack is the foundation of our agility.
Lightweight Web Applications
The next step is turning our published data into an interactive application. We'll use lightweight web frameworks to do that.
We choose lightweight web frameworks because they are simple and fast to work with. Unlike CRUD applications, mined data is the star of the show here. We use read-only databases and simple application frameworks because that fits with the applications we build and how we offer value.
Given the examples in Python/Flask, the reader can easily implement a solution in Sinatra, Rails, Django, Node.js or your favorite language and web framework.
Python and Flask
Flask
Flask is a fast, simple and lightweight WSGI micro web-framework for Python. | ||
| --Bottle Documentation | ||
Excellent instructions for using Flask are available at http://flask.pocoo.org/.
[bash]$ pip install Flask [bash]$ python hello.py
Flask Echo
#!/usr/bin/env python
# Flask:
from flask import Flask
app = Flask(__name__)
@app.route("/<input>")
def hello(input):
return input
if __name__ == "__main__":
app.run(debug=True)$ curl http://localhost:5000/hello%20world! hello world!
Displaying sent_counts in Flask
Tutorial
Find out more about MongoDB's Python driver at http://www.mongodb.org/display/DOCS/Python+Language+Center. Detailed installation instructions are available at http://api.mongodb.org/python/current/installation.html.
Install pymongo with pip:
pip install pymongo
with easy_install:
easy_install pymongo
from source:
[bash]$ git clone git://github.com/mongodb/mongo-python-driver.git pymongo [bash]$ cd pymongo/ [bash]$ python setup.py install
Now, use pymongo to display the sent_counts we stored in Mongo using Pig and MongoStorage:
from pymongo import Connection
import json
from flask import Flask
app = Flask(__name__)
connection = Connection()
db = connection.agile_data
@app.route("/<input>")
def echo(input):
return input
@app.route("/sent_counts/<ego1>/<ego2>")
def sent_counts(ego1, ego2):
sent_count = db['sent_counts'].find_one({'ego1': ego1, 'ego2': ego2})
plain = {'from': sent_count['ego1'], 'to': sent_count['ego2'], 'total': sent_count['total']}
return json.dumps(plain)
if __name__ == "__main__":
app.run(debug=True)Now visit http://localhost:8080/sent_counts/k@123.org/common-user@hadoop.apache.org (or comparable, for your data) and you will see:
{"ego1":"k@123.org","ego2":"common-user@hadoop.apache.org","total":8}And we're done!
Conclusion
Congratulations! You've published data on the web. Now, lets make it presentable...
Presenting our Data
Introduction
Design and presentation impact the value of your work. In fact, one way to think of Agile Data is 'data design.' The output of our data models matches our views, and in that sense design and data-processing are not distinct. Instead, they are part of the same collaborative activity: data design. With that in mind, it is best that we start out with a solid, clean design for our data and work from there.
Bootstrap is Twitter's toolkit for kickstarting CSS for websites, apps, and more. It includes base CSS styles for typography, forms, buttons, tables, grids, navigation, alerts, and more. To get started -- checkout http://twitter.github.com/bootstrap! | ||
| --Bootstrap Project | ||
Installing Bootstrap
Bootstrap is available at https://github.com/twitter/bootstrap.git. To install, place it with the static files of your application.
git clone https://github.com/twitter/bootstrap.git git checkout 2.0-wip
To invoke bootstrap, simply reference it as CSS from within your HTML page:
<link href="/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
Booting Boostrap
It only takes a little editing of an example to arrive at a home page for our project.
Lets try wrapping a previous example in a table, styled with Bootstrap.
Tables!
Thats right - tables for tabular data! Bootstrap lets us use them without shame. Now we'll update our controller to stash our data, and create a simple template to print a table.
In index.py:
from flask import Flask, render_template
...
@app.route("/sent_counts/<ego1>/<ego2>")
def sent_counts(ego1, ego2):
sent_count = db['sent_counts'].find_one({'ego1': ego1, 'ego2': ego2})
data = {}
data['keys'] = '_id', 'ego1', 'ego2', 'total'
data['values'] = sent_count['_id'], sent_count['ego1'], sent_count['ego2'], sent_count['total']
return render_template('table.html', data=data)And in our template, table.html:
<head>
...
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>The result is human readable data with very little trouble!
Note
In practice, we may use client-side templating languages like moustache. For clarity's sake, we use Jinja2 templates in this book.
Visualizing Data with D3.js
D3.js enables data-driven documents.
Introducing D3.js d3 is not a traditional visualization framework. Rather than provide a monolithic system with all the features anyone may ever need, d3 solves only the crux of the problem: efficient manipulation of documents based on data. This gives d3 extraordinary flexibility, exposing the full capabilities of underlying technologies such as CSS3, HTML5 and SVG. | ||
| --Mike Bostock, mbostock.github.com/d3/ | ||
We'll be using D3.js to create charts in our application. You can install d3 in your web application via:
git clone https://github.com/mbostock/d3.git
We'll be making charts with d3 in chapter 7. For now, take a look at the examples directory to see what is possible with s3.
Summary
We've created a very simple app with a single, very simple feature. This is a great starting point, but so what?
What's important about the application isn't what it does, it's that it's a pipeline where it's easy to modify every stage. A pipeline that will scale without worrying about optimization at each step, where optimization becomes a function of cost in terms of resource efficiency, but not in terms of the cost of re-engineering.
As we'll see in the next chapter, because we've created an arbitrarily scalable pipeline where every stage is easily modifiable, it is possible to return to agility. We won't quickly run into a wall as soon as we need to switch from a relational database to something else that 'scales better', and we aren't confining ourselves to the limitations tools designed for other tasks like online transaction processing impose on us.
We now have total freedom to use best of breed tools within this framework to solve hard problems and produce value. We can choose any language, any framework, any library and glue it together to get things built.


















Add a comment



Add a comment