9781449302641
running_pig.html

Chapter 2. Installing and Running Pig

Downloading and Installing Pig

Before you can run Pig on your machine or your Hadoop cluster, you will need to download and install it. If someone else has taken care of this, you can skip ahead to the section called “Running Pig”.

You can download Pig as a complete package or as source code that you build. You can also get it as part of a Hadoop distribution.

Downloading the Pig Package from Apache

This is the official version of Apache Pig. It comes packaged with all of the jars needed to run Pig. It can be downloaded by going to Pig's release page.

Pig does not need to be installed on your Hadoop cluster. It runs from the machine you launch Hadoop jobs from. Though you can run Pig from your laptop or desktop, in practice most cluster owners set up one or more machines that have access to their Hadoop cluster but are not part of the cluster (that is, they are not data nodes or task nodes). This makes it easier for administrators to update Pig and associated tools, and to secure access to the clusters. These machines are called gateway machines or edge machines. In this book I will use the term gateway machine.

It is on these gateway machines that you will need to install Pig. If your Hadoop cluster is accessible from your desktop or laptop then you can install Pig there as well. Also, you can install Pig on your local machine if you plan to use Pig in local mode.

The core of Pig is written in Java, and is thus portable across operating systems. The shell script that starts Pig is a bash script, so it requires a Unix environment. Hadoop, which Pig depends on even in local mode, also requires a Unix environment for its file system operations. In practice, most Hadoop clusters run a flavor Linux. Many Pig developers develop and test Pig on MacOS X.

Pig requires Java 1.6. And Pig versions 0.5 through 0.9 require Hadoop 0.20. For future versions, check the download page for information on what version(s) of Hadoop they require. The correct version of Hadoop is included with the Pig download. If you plan to use it in local mode or install it on a gateway machine where Hadoop is not currently installed, there is no need to download Hadoop separately.

Once you have downloaded Pig, you can place it anywhere you like on your machine. Pig does not depend on being in a certain location. To install it, place the tarball in the directory of your choosing and type:

tar xzf filename

where filename is the tar file you downloaded.

The only other setup in preparation for running Pig is making sure that the environment variable JAVA_HOME is set to the directory that contains your Java distribution. Pig will fail immediately if this value is not in the environment. You can set this in your shell, specify it on the command line when you invoke Pig, or set it explicitly in your copy of the Pig script pig, located in the bin directory that you just unpacked. You can find the appropriate value for JAVA_HOME by doing which java and stripping the bin/java from the end of the result.

Downloading Pig from Cloudera

In addition to the official Apache version, there are companies that repackage and distribute Hadoop and associated tools. The most popular of these currently is Cloudera. They produce RPMs for Red Hat based systems and packages for use with APT on Debian systems. They also provide tarballs for other systems that cannot use one of these package managers.

The upside of using a distribution like Cloudera's is that all of the tools are packaged and tested together. Also, if you need professional support, it is available. The downside is that you are constrained to move at the speed of your distribution provider. There is a delay between an Apache release of Pig and its availability in various distributions.

For complete instructions on downloading and installing Hadoop and Pig from Cloudera, see Cloudera's download site. Note that you have to download Pig separately. It is not part of the Hadoop package.

Downloading Pig Artifacts from Maven

In addition to the official release available from Pig's Apache site, it is possible to download Pig from Apache's Maven repository. This site includes jar files for Pig, for the source code, for the javadocs, and the pom file that defines Pig's dependencies. Development tools that are Maven aware can use this to pull down Pig's source and javadoc. If you use maven or ant in your build process you can also pull the Pig jar automatically from this repository.

Downloading the Source

When you download Pig from Apache you also get the Pig source code. This enables you to debug your version of Pig or just to peruse the code to see how it works. But if you want to live on the edge and try out a feature or a bug fix before it is available in a release, you can download the source from Apache's Subversion repository. You can also apply patches that have been uploaded to Pig's issue tracking system but not yet checked into the code repository. Information on checking out Pig using svn or cloning the repository via git is available on Pig's version control page.

Running Pig

Pig can be run locally on your machine or on your grid. You can also run Pig as part of Amazon's Elastic MapReduce service.

Running Pig Locally On Your Machine

Running Pig locally on your machine is referred to in Pig parlance as local mode. Local mode is useful for prototyping and debugging your Pig Latin scripts. Some people also use it for small data when they want to apply the same processing to large data, so that their data pipeline is consistent across data of different sizes, but they do not want to waste cluster resources on small files and small jobs.

In versions 0.6 and before, Pig executed scripts in local mode itself. Starting with version 0.7, it uses a Hadoop class LocalJobRunner that reads from the local file system and executes MapReduce jobs locally. This has the nice property that Pig jobs run locally in the same way as they will on your cluster. And they run all in one process. This makes debugging much easier. The downside is that it is slow. Setting up a local instance of Hadoop has approximately a twenty second overhead. So even tiny jobs take at least that long.[1]

Example 2.1. Running Pig in Local Mode

Let's run a Pig Latin script in local mode. See the section called “Code Examples in this Book” for how to download the data and Pig Latin for this example. This simple script loads a file NYSE_dividends, groups rows in it together by their stock ticker symbol, and then calculates the average dividend for each symbol.

--average_dividend.pig
-- load data from NYSE_dividends, declaring the schema to have 4 fields
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
-- group rows together by stock ticker symbol
grouped   = group dividends by symbol;
-- calculate the average dividend per symbol
avg       = foreach grouped generate group, AVG(dividends.dividend);
-- store the results to average_dividend
store avg into 'average_dividend';
				

If you use head -5 to look at the NYSE_dividends you will see:

NYSE    CPO 2009-12-30  0.14
NYSE    CPO 2009-09-28  0.14
NYSE    CPO 2009-06-26  0.14
NYSE    CPO 2009-03-27  0.14
NYSE    CPO 2009-01-06  0.14

This matches the schema we declared in our Pig Latin script. The first field is the exchange this stock is traded on, the second field is the stock ticker symbol, the third the data the dividend was paid, and the fourth the amount of the dividend.

Remember that to run Pig you will need to set the JAVA_HOME environment variable to the directory that contains your Java distribution.

Change directories to the directory NYSE_dividends is in. You can then run this example on your local machine by doing:

pig_path/bin/pig -x local average_dividend.pig

where pig_path is the path to where you installed Pig on your local machine.

The result should be a lot of output on your screen. Much of this is MapReduce's LocalJobRunner generating logs. But some of it is Pig telling you how it will execute the script, giving you status as it executes, etc. Near the bottom of the output you should see the simple message Success!. This means all went well. The script stores its output to average_dividend, so you might expect to find a file by that name in your local directory. Instead you will find a directory named that with a file part-r-00000 in it. Since Hadoop is a distributed system and usually processes data in parallel, when it outputs data to a “file” it creates a directory with the file's name and each writer creates a separate part file in that directory. In this case we had one writer, so we have one part file. We can look in that part file for the results by doing:

cat average_dividend/part-r-00000 | head -5

which returns:

CA      0.04
CB      0.35
CE      0.04
CF      0.1
CI      0.04

Running Pig on Your Hadoop Cluster

Most of the time you will be running Pig on your Hadoop cluster. As was covered above, the section called “Downloading and Installing Pig”, Pig runs locally on your machine or your gateway machine. All of the parsing, checking, and planning is done locally. It then executes MapReduce jobs in your cluster.

Note

Throughout the book when I say “your gateway machine” I mean the machine you are launching Pig jobs from. Usually this will be one or more machines that have access to your Hadoop cluster. But, depending on your configuration, it could be your local machine as well.

The only thing Pig needs to know to run on your cluster is where your cluster's NameNode and JobTracker are located. The NameNode is the manager of HDFS and the JobTracker coordinates MapReduce jobs. In Hadoop 0.18 and before, these locations are found in your hadoop-site.xml file. In Hadoop 0.20 and later they are in separate files hdfs-site.xml and mapred-site.xml.

If you are already running Hadoop jobs from your gateway machine via MapReduce or another tool then you most likely have these files present. If not, the best course is to copy these files from nodes in your cluster to a location on your gateway machine. This guarantees that you get the proper addresses plus any site specific settings.

If, for whatever reason, it is not possible to copy the appropriate files from your cluster, you can create a hadoop-site.xml file yourself. It will look like:

<configuration>
<property>                                                                     
  <name>fs.default.name</name>                                                 
  <value>namenode_hostname:port</value>                         
</property>

<property>                                                                     
  <name>mapred.job.tracker</name>
  <value>jobtrack_hostname:port</value>                        
</property>
</configuration>

You will need to find the names and ports for your NameNode and JobTracker from your cluster administrator.

Once you have located, copied, or created these files, you will need to tell Pig the directory they are in by setting the PIG_CLASSPATH environment variable to that directory. Note that this must point to the directory that the XML file is in, not the file itself. Pig will read all XML and properties files in that directory.

Example 2.2. Running Pig On Your Cluster

Let's run the same script on your cluster that we ran in the local mode example Example 2.1, “Running Pig in Local Mode”. If you are running on a Hadoop cluster you have never used before you will most likely need to create a home directory. Pig can do this for you:

PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e fs -mkdir /username

where hadoop_conf_dir is the directory that your hadoop-site.xml or hdfs-site.xml and mapred-site.xml files are located, pig_path is the path to Pig on your gateway machine, and username is your username on the gateway machine. If you are using 0.5 or before change fs -mkdir to mkdir

Remember, you need to set JAVA_HOME before executing any Pig commands. See the section called “Downloading the Pig Package from Apache” for details.

In order to run this example on your cluster you will first need to copy the data to your cluster.

PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e fs -copyFromLocal NYSE_dividends NYSE_dividends

If you are running on Pig 0.5 or before, change fs -copyFromLocal to copyFromLocal.

Now, you are ready to run the Pig Latin script itself:

PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig average_dividend.pig

The first few line will tell you how Pig is connecting to your cluster. After that it will describe its progress executing your script. It is important that you verify that Pig is connecting to the appropriate filesystem and JobTracker by checking that these values match the values for your cluster. If the filesystem is listed as file:/// or the JobTracker says localhost then Pig did not connect to your cluster. You will need to check that you entered the values properly in your configuration files and set PIG_CLASSPATH properly to the directory that contains those files.

Near the end of the output there should be a line saying Success!. This means that your execution succeeded. You can see the results by doing:

PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e fs -cat average_dividend

which should give you the same connection information and then a dump of all the stock ticker symbols and their average dividends. If you are using Pig 0.5 or before, change fs -cat to cat.

You may have noticed that in Example 2.1, “Running Pig in Local Mode” I made a point to say that average_age is a directory and thus you have to cat the part file contained in that directory, but in this example I ran cat directly on average_age. If you list average_age, you will see that it is still a directory in this example. But in HDFS cat can operate on directories. See Chapter 3, Grunt for a discussion of this.


Running Pig in the Cloud

Cloud computing[2] along with the software as a service (SaaS) model has taken off in recent years. This has been fortuitous for hardware intensive applications like Hadoop. Setting up and maintaining a Hadoop cluster is an expensive proposition in terms of hardware acquisition, facility costs, and maintenance and administration. Many users find that it is cheaper for them to rent the hardware they need instead.

Whether you or your organization decides to use Hadoop and Pig in the cloud or on owned and operated machines, the instructions for running Pig on your cluster are the same, the section called “Running Pig on Your Hadoop Cluster”.

Amazon's Elastic MapReduce (EMR) is a cloud offering however which is different. Rather than allowing customers to rent machines for any type of process (like Amazon's Elastic Cloud Computing (EC2) service and other cloud services), EMR allows users to rent virtual Hadoop clusters. These clusters read data from and write data to Amazon's Simple Storage Service (S3). This means users do not even need to set up their own Hadoop cluster as they would if they used EC2 or a similar service.

EMR users can access their rented Hadoop cluster via their browser, ssh, or a web services API. EMR is at http://aws.amazon.com/elasticmapreduce/. I suggest beginning with the nice tutorial which will introduce you to the service.

Command Line and Configuration Options

Pig has a number of command line options that you can use with it. You can see the full list by doing pig -h. Most of these options will be discussed in the sections that cover the features these options control. In this section I will discuss the remaining miscellaneous options.

-e or -execute

Execute a single command in Pig. For example pig -e fs -ls will list your home directory.

-h or -help

List the available command line options.

-h properties

List the properties that Pig will use if they are set by the user.

-P or -propertyFile

Specify a property file that Pig should read.

-version

Print the version of Pig.

Pig also has a number of Java properties it uses. The entire list can be printed out by doing a pig -h properties. Specific properties will be covered in sections that cover the features they control.

Hadoop also has a number of Java properties is uses to determine its behavior. For example, you can pass options to the JVM that runs your map and reduce tasks by setting mapred.child.java.opts. In Pig version 0.8 and later these can be passed to Pig and Pig will pass them on to Hadoop when it invokes Hadoop. In earlier versions the properties had to be in the hadoop-site.xml so that the Hadoop client itself would pick them up.

Properties can be passed to Pig on the command line using -D in the same format as any Java property. For example, bin/pig -Dexectype=local. When placed on the command line, these property definitions must come before any Pig specific command line options (such as -x local). They can also be specified in the conf/pig.properties file that is part of your Pig distribution. Finally, you can specify a separate properties file by using -P. If properties are specified on both the command line and in a properties file, the command line specification takes precedence.

Return Codes

Pig uses return codes to communicate success or failure. Table 2.1, “Pig Return Codes” describes these return codes.

Table 2.1. Pig Return Codes

ValueMeaningComments
0success 
1retriable failure 
2failure 
3partial failureused with multi-query, see the section called “Non-Linear Data Flows”
4illegal arguments passed to Pig 
5IOException thrownWould usually be thrown by a UDF.
6PigException thrownUsually means a Python UDF raised an exception.
7ParseException thrown, can happen after parsing if variable substitution is being done. 
8Throwable thrown, an unexpected exception. 



[1] Another reason for switching to MapReduce for local mode was that, as Pig added features that took advantage of more advanced features of MapReduce, it became difficult or impossible to replicate those features in local mode. Thus local mode and MapReduce mode were diverging in their feature set.

[2] Being current flavor of the month, the term cloud computing is being used to describe about anything that takes more than one computer and is not located on a person's desktop. In this chapter I use cloud computing to mean the ability to rent a cluster of computers and place software of your choosing on those computers.

Site last updated on: August 10, 2011 at 10:50:07 AM PDT
Cover for Programming Pig

View 1 comment

  1. david smith – Posted Sept. 7, 2011

    period inside the quotes, I think

Add a comment

View 2 comments

  1. Will Duckworth – Posted April 13, 2011

    Probably need to expand the Pig on Windows sentence. Adding documentation on how to get Pig configured and running under Windows/Cygwin would also be good.

  2. david smith – Posted Sept. 7, 2011

    flavor of Linux.

Add a comment

View 2 comments

  1. timothy.webster – Posted March 8, 2011

    Are the demo files available while the book is in beta? --thanks.

  2. Alan Gates – Posted April 1, 2011

    Ok, they are available now. See end of chapter 1 on how to get the demo data and files.

    Edited on April 23, 2011, 11:16 a.m. PDT

Add a comment

View 1 comment

  1. pablo martinez – Posted Oct. 27, 2011

    typo: it says "the data the dividend was paid", it should be "the date the dividend was paid"

Add a comment

View 1 comment

  1. sushant_p – Posted May 5, 2013

    I write the program in text and save it .pig in pig/bin directory. But instead of getting saved as .pig it get saved as .pig.txt

    How can I save as .pig

Add a comment

View 1 comment

  1. jeromatron – Posted May 16, 2011

    Typo - hodoop-site.xml should be hadoop-site.xml

Add a comment

View 1 comment

  1. Senthil Thangavel – Posted March 29, 2013

    i hope the directory name should be "average_dividend" instead of "average_age".

Add a comment

View 1 comment

  1. pablo martinez – Posted Oct. 27, 2011

    it says "Hadoop also has a number of Java properties is uses to determine its behavior", should it be "Hadoop also has a number of Java properties it uses to determine its behavior" ?

Add a comment