9781449302641
index.html

Programming Pig

Alan F Gates


Dedication

To my wife Barbara and our boys Adam and Joel. Their support, encouragement, and sacrificed Saturdays have made this book possible.

Preface
Data Addiction
Who Should Read This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgements
1. Introduction
What Is Pig?
Pig on Hadoop
Pig Latin, a Parallel Dataflow Language
What Is Pig Useful For?
Pig Philosophy
Pig's History
Code Examples in this Book
2. Installing and Running Pig
Downloading and Installing Pig
Downloading the Pig Package from Apache
Downloading Pig from Cloudera
Downloading Pig Artifacts from Maven
Downloading the Source
Running Pig
Running Pig Locally On Your Machine
Running Pig on Your Hadoop Cluster
Running Pig in the Cloud
Command Line and Configuration Options
Return Codes
3. Grunt
Entering Pig Latin Scripts in Grunt
HDFS Commands in Grunt
Controlling Pig from Grunt
4. Pig's Data Model
Types
Scalar Types
Complex Types
Nulls
Schemas
Casts
5. Introduction to Pig Latin
Preliminary Matters
Relations and Fields
Case Sensitivity
Comments
Input and Output
Load
Store
Dump
Relational Operations
Foreach
Filter
Group
Order by
Distinct
Join
Limit
Sample
Parallel
User Defined Functions
Registering UDFs
Define and UDFs
Calling Static Java Functions
6. Advanced Pig Latin
Advanced Relational Operations
Advanced Features of Foreach
Using Different Join Implementations
Cogroup
Union
Cross
Integrating Pig with Legacy Code and MapReduce
Stream
Mapreduce
Non-Linear Data Flows
Controlling Execution
Set
Setting the Partitioner
Pig Latin Preprocessor
Parameter Substitution
Macros
Including other Pig Latin Scripts
7. Developing and Testing Pig Latin Scripts
Development Tools
Syntax Highlighting and Checking
Describe
Explain
Illustrate
Pig Statistics
MapReduce Job Status
Debugging Tips
Testing Your Scripts with PigUnit
8. Making Pig Fly
Writing Your Scripts to Perform Well
Filter Early and Often
Project Early and Often
Set up your Joins Properly
Use Multiquery When Possible
Choose the Right Data Type
Select the Right Level of Parallelism
Writing Your UDF to Perform
Tune Pig and Hadoop for your Job
Using Compression in Intermediate Results
Data Layout Optimization
Bad Record Handling
9. Embedding Pig Latin in Python
Compile
Bind
Binding Multiple Sets of Variables
Run
Running Multiple Bindings
Utility Methods
10. Writing Evaluation and Filter Functions
Writing an Evaluation Function in Java
Where Your UDF Will Run
Evaluation Function Basics
Input and Output Schemas
Error Handling and Progress Reporting
Constructors and Passing Data from Front End to Back End
Overloading UDFs
Memory Issues in Eval Funcs
Algebraic Interface
Accumulator Interface
Python UDFs
Writing Filter Functions
11. Writing Load and Store Functions
Load Functions
Front End Planning Functions
Passing Information from the Front End to the Back End
Back End Data Reading
Additional Load Function Interfaces
Store Functions
Store Function Front End Planning
Store Functions and UDFContext
Writing Data
Failure Cleanup
Storing Metadata
12. Pig And Other Members Of The Hadoop Community
Pig And Hive
Cascading
NoSQL Databases
HBase
Cassandra
Metadata In Hadoop
A. Built In User Defined Functions and Piggybank
Built In UDFs
Built in Load and Store Functions
Built in Evaluation and Filter Functions
Piggybank
B. Overview of Hadoop
MapReduce
Map Phase
Combiner Phase
Shuffle Phase
Reduce Phase
Output Phase
Distributed Cache
Handling Failure
Hadoop Distributed File System
Site last updated on: August 10, 2011 at 10:50:07 AM PDT
Cover for Programming Pig