Chapter 18. Common tasks
WHAT YOU WILL LEARN IN THIS CHAPTER
Working with CSV Data
Reading and writing XML
Parsing and manipulating dates
Using the built-in debugger
Profiling your program
By this time you have a good idea of what Perl programming is about and you should have a solid grasp of the fundamentals. If you’ve followed along carefully and worked through the exercises, you could possibly even qualify for some entry-level developer positions. However, developers constantly get strange tasks thrown at them all of the time and it’s important to be able to handle them. These last two chapters will cover some of those tasks. This chapter will handle common tasks that you will likely need to perform and the next chapter will touch on some advanced topics that mastering will take you to the next level of Perl.
By now you should know that the CPAN is the first place you look to see if someone else has already handled your task and we’ll show you a few of the more popular CPAN modules for handling various tricky data tasks and also how to analyze your programs when things go awry.
CSV data
One common file format is CSV. CSV stands for comma-separated values and a quick example should make it very easy to understand a CSV file is:
Name,Age,Occupation John Public,28,Waiter Curtis Poe,44,Software Engineer Leïla Contraire,36,Political Advisor
Basically, for CSV data, records are separated by newlines and fields are separated by commas. If a field contains a newline, comma or double quotes, it’s generally enclosed in double quotes. Except that there is no formal specification for CSV data and this can make things a bit more difficult. As you can see in the example above, double quotes within double quotes are often escaped with more double quotes, but some files will escape double quotes with backslashes. Sometimes people will use single quotes instead of double quotes, or they’ll enclose everything in quotes that is not a number.
All of these factors can make parsing CSV data a bit of a challenge. Here’s a common (and broken) method of CSV parsing:
use strict;
use warnings;
open my $fh, '<', $file or die "Cannot open $file for reading: $!";
while ( my $line = <$fh> ) {
chomp($line);
my @fields = split /,/, $line;
s/^"|"$//g foreach @fields;
printf "Name: %20s Age: %3d Occupation: %10s\n", @fields;
}Here’s a typical bit of CSV data (note the embedded newline in the Alice Baker’s occupation):
Name,Age,Occupation John Public,28,Bum "Curtis ""Ovid"" Poe",44,Software Engineer "Contraire, Leïla",36,Political Advisor Alice Baker,44,"CEO, MegaCorp"
Note
example_18_1_jobs.csv available for download at Wrox.com.
Running that with our sample program generates complete garbage:
Name: John Public Age: 28 Occupation: Bum Name: Curtis ""Ovid"" Poe Age: 44 Occupation: Software Engineer Argument "Leïla" isn't numeric in printf at parse_csv.pl line 13, <$fh> line 4. Name: Contraire Age: 0 Occupation: 36 Name: Alice Baker Age: 44 Occupation: CEO Missing argument in printf at parse_csv.pl line 13, <$fh> line 6. Missing argument in printf at parse_csv.pl line 13, <$fh> line 6. Name: MegaCorp Age: 0 Occupation:
Reading
Rather than writing lots of code to handle these special cases, install the Text::CSV_XS program from the CPAN. As the author, H. Merijn Brand, points out, the module should probably be referred to as parsing ASV (anything separated values) because of its extreme flexibility. Here’s how to parse that file:
use strict;
use warnings;
use Text::CSV_XS;
my $file = 'example_18_1_jobs.csv';
open my $fh, '<', $file or die "Cannot open $file for reading: $!";
my $headers = <$fh>; # discard headers
my $csv = Text::CSV_XS->new( { binary => 1, eol => $/ } );
while ( my $row = $csv->getline($fh) ) {
printf "Name: %20s Age: %3d Occupation: %10s\n", @$row;
}Note
example_18_1_parse_csv.pl available for download at Wrox.com.
And the output:
Name: John Public Age: 28 Occupation: Bum Name: Curtis "Ovid" Poe Age: 44 Occupation: Software Engineer Name: Contraire, Leïla Age: 36 Occupation: Political Advisor Name: Alice Baker Age: 44 Occupation: CEO, MegaCorp
You’ll notice that Alice Baker has her profession printed over two lines, but that’s because that was how it was represented in the original data file.
Note
If you have trouble compiling Text::CSV_XS, you may consider installing the Text::CSV module from the CPAN. It offers a pure Perl alternative. It’s not as fast, but it works.
Note the arguments to the constructor:
my $csv = Text::CSV_XS->new( { binary => 1, eol => $/ } );By default, Text::CSV_XS assumes all data is ASCII. If you have newlines embedded in your fields or if any of your characters are above 0x7E (the tilde), then you must pass binary => 1 to the constructor to ensure it parses correctly. The eol argument is documented as taking $/, though you can change this, if needed.
You may recall that we described $/ in Chapter 9, Files and Directories. The $/ variable is the Perl built-in variable for the input record separator. For example, when you read from a file handle in scalar context, Perl will return data up to the input record separator (and chomp() will remove that separator). Text::CSV_XS uses the $/ to handle reading lines for you. You use $csv->getline($fh) instead if the normal <$fh> to read the filehandle because newlines embedded in fields are not really input record separators.
Writing
Obviously, if we can read CSV, we want to write it, too. In this case, let’s just print it to the console so you can see what’s going on.
use Text::CSV_XS;
my $csv = Text::CSV_XS->new({ binary => 1, eol => $/ });
my @input = (
[ 'Name', 'Age', 'Occupation' ],
[ 'John Public', 28, 'Bum' ],
[ 'Curtis "Ovid" Poe', 44, 'Software Engineer' ],
[ 'Contraire, Leïla', 36, 'Political Advisor' ],
[ 'Alice Baker', 44, "CEO,\nMegaCorp" ],
);
foreach my $input (@input) {
if ( $csv->combine(@$input) ) {
print $csv->string;
}
else {
printf "combine() failed on argument: %s\n", $csv->error_input;
}
}Note
example_18_3_write_csv.pl available for download at Wrox.com.
And that prints out:
Name,Age,Occupation "John Public",28,Bum "Curtis ""Ovid"" Poe",44,"Software Engineer" "Contraire, Leïla",36,"Political Advisor" "Alice Baker",44,"CEO, MegaCorp"
You’ll note that we’ve handled escaping quotes correctly without worrying about it.
As you can see, the Text::CSV_XS constructor takes the same arguments as we used for reading. Later, we use the $csv->combine(LIST) to combine a list of arguments into a single CSV string and then the $csv->string method is used for printing it. If the $csv->combine method returns false, we call $csv->error_input to understand what input caused the actual error.
The Text::CSV_XS module is very flexible. If you wanted to write out the data in a tab-separated format, you could pass sep_char => "\t" to the constructor (and use this to read tab-separated format, too). You can change the quote character and many other behaviors within the module to get exactly the data you need.
Basic XML
The Extensible Markup Language (XML) is a format for encoding documents that is designed to be readable by both humans and computers. When handled correctly, it is both powerful and flexible. Here’s a simple example of an XML document that might be used to represent a library of books:
<?xml version="1.0" encoding="UTF-8" ?>
<library>
<book isbn="1118013840">
<title>Beginning Perl</title>
<authors>
<author>Curtis "Ovid" Poe"</author>
</authors>
<publisher>Wrox</publisher>
</book>
<book isbn="0596526741">
<title>Perl Hacks</title>
<authors>
<author>chromatic</author>
<author>Damian Conway</author>
<author>Curtis "Ovid" Poe"</author>
</authors>
<publisher>O'Reilly Media</publisher>
</book>
</library>Note
example_18_4_library.xml available for download at Wrox.com.
Obviously, we could fit a lot more information in that document, including synopses, genres, and many other things that we might feel are useful. The power of XML is that it is both flexible and moderately easy to read. We’ll cover some of the more popular choices for XML reading and writing and we’ll use our example_18_4_library.xml example for our sample XML document.
Note
The problem with XML is that the XML specification (http://www.w3.org/XML/Core/#Publications) is large and complex enough that many authors, thinking XML is just angle brackets for grouping data, write XML parsers and generators that are broken in many ways.
The problem is serious enough that your author reluctantly (and with a bit of criticism) released Data::XML::Variant to allow authors to systematically write “bad” XML. Generally you don’t want to do this, but when working with other parties, they often have an XML “specification” that requires attributes in a specific order, does not allow quoting of attributes, allows unclosed tags, illegal characters and other problems.
It’s strongly recommended that you do not use Data::XML::Variant unless you have no other choice.
Reading
There are many Perl modules for reading and writing XML and XML::Simple is one of the more popular choices, but it has a variety of limitations. Still, it’s so easy to use that many people prefer it to more robust solutions.
To show you its ease of use, we can parse our example XML snippet with:
use strict; use warnings; use XML::Simple; use Data::Dumper; $Data::Dumper::Indent = 1; $Data::Dumper::Sortkeys = 1; my $document = XMLin( 'library.xml', forcearray => ['author'] ); print Dumper($document);
Note
example_18_5_xml_simple.pl available for download at Wrox.com.
And that will print out:
$VAR1 = {
'book' => [
{
'authors' => {
'author' => [
'Curtis "Ovid" Poe"'
]
},
'isbn' => '1118013840',
'publisher' => 'Wrox',
'title' => 'Beginning Perl'
},
{
'authors' => {
'author' => [
'chromatic',
'Damian Conway',
'Curtis "Ovid" Poe"'
]
},
'isbn' => '0596526741',
'publisher' => 'O\'Reilly Media',
'title' => 'Perl Hacks'
}
]
};You can also use XML::Simple to output your XML. With the $document variable above, we can do this:
print XMLout(
$document,
ValueAttr => { book => 'isbn' },
RootName => 'library',
);And that outputs:
<library>
<book isbn="1118013840" publisher="Wrox" title="Beginning Perl">
<authors>
<author>Curtis "Ovid" Poe"</author>
</authors>
</book>
<book isbn="0596526741" publisher="O'Reilly Media" title="Perl Hacks">
<authors>
<author>chromatic</author>
<author>Damian Conway</author>
<author>Curtis "Ovid" Poe"</author>
</authors>
</book>
</library>You’ll notice that the XML is not the same and the XML::Simple documentation offers suggestions, but makes it clear that XML::Simple should only be used when the following is true:
You’re not interested in text content consisting only of whitespace
You don’t mind that when things get slurped into a hash the order is lost
You don’t want fine-grained control of the formatting of generated XML
You would never use a hash key that was not a legal XML element name
You don’t need help converting between different encodings
In other words, XML::Simple is handy, but it’s limited enough that you will likely outgrow it quickly.
Other programmers prefer finer control with the XML::Twig module. This module allows you to treat XML documents as trees. A tree is a data structure that allows you to represent complex data as a “tree” with “branches” representing the different elements. For example, in our sample XML, the library element is the root of our tree and the tree structure might look like Figure 18.1, “XML Tree structure”.
By using a tree-based XML parser, you can select any or all of the branches of the tree and do something with them. For example, here’s one way to fetch just the titles:
use XML::Twig;
my @titles;
my $twig = XML::Twig->new(
twig_handlers => {
'//library/book/title' => sub { push @titles => $_->text },
},
);
$twig->parsefile('library.xml');
printf "%s\n" => join ' | ', @titles;And that should print out:
Beginning Perl | Perl Hacks
There are several ways to use XML::Twig and the method we show is one of the most memory efficient (which is one of the reasons many people use XML::Twig for XML processing). The keys of the twig_handlers hash references are a subset of XPath, a tool used to select elements and attributes in XML documents. The values are subroutines that allow you to manipulate the XML or fetch data from it. The $_ variable is set to the current node of the tree.
Note
A handy tutorial for understanding XPath is available at http://www.w3schools.com/xpath/.
Or maybe you want to create a hash with the keys being the unique ISBN numbers and the values being the titles:
use XML::Twig;
use Data::Dumper;
my %books;
my $twig = XML::Twig->new(
twig_handlers => {
'//library/book' =>
sub { $books{ $_->{att}{isbn} } = $_->first_child('title')->text },
},
);
$twig->parsefile('library.xml');
print Dumper( \%books );And that prints out:
$VAR1 = {
'0596526741' => 'Perl Hacks',
'1118013840' => 'Beginning Perl'
};If you really wanted to be elaborate, you could rewrite the entire XML document in a manner similar to XSLT (Extensible Stylesheet Language Transformations). Here’s how to rewrite our XML as HTML lists:
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
twig_handlers => {
'//library' => sub { $_->set_tag('ol') },
'//library/book' => sub {
$_->set_tag('li'); $_->set_atts( {} )
},
'//library/book/title' => sub { $_->set_tag('strong') },
'//library/book/publisher' => sub { $_->delete },
'//library/book/authors' => \&rewrite_authors,
},
pretty_print => 'indented',
no_prolog => 1,
comments => 'drop',
);
$twig->parsefile('example_18_4_library.xml');
print $twig->toString;
sub rewrite_authors {
my $authors = $_;
my @authors = map { $_->text } $authors->children('author');
$authors->set_tag('p');
$authors->set_text( join ' - ', @authors );
}Note
example_18_6_xml_twig.pl available for download at Wrox.com.
Running that with our sample XML should print out the following:
<ol>
<li>
<strong>Beginning Perl</strong>
<p>Curtis "Ovid" Poe"</p>
</li>
<li>
<strong>Perl Hacks</strong>
<p>chromatic - Damian Conway - Curtis "Ovid" Poe"</p>
</li>
</ol>There are, of course, many other powerful modules for parsing XML. XML::LibXML, XML::Parser, XML::Sax and XML::Compile are today among a few of the many useful (and sometimes incomprehensible) XML parsers available. Just because we showed XML::Simple and XML::Twig does not mean that they are the best. Your choice of module should reflect your needs.
Writing
Naturally, if we read XML, we have to write it, too. XML::Simple’s XMLout() function will allow us to do that, but it’s not very flexible. Instead, we’ll turn to XML::Writer to handle this task. It makes writing XML a breeze. It generally needs to write its output to an IO::File object, but for simplicity’s sake, we’ll use XML::Writer::String to let us print our XML directly.
We should start by making sure we design a good data structure to represent our XML data. XML is a set of tags, each of which may be self-closed (<tag/>) or closed later (<tag>...</tag>). Each tag may have zero or more attributes and either contain a string value or a zero or more tags. We’ll represent each tag as an array reference. The first item will be the tag name, so an empty tag will look like this:
# <name/> [ 'name' ]
The tag might attributes, so the second element will be a hash reference. With no values, it means the tag has no attributes. Otherwise, the name/value pairs can represent attributes:
# <name version="1.0"/>
[ name => { version => '1.0' } ]Any array elements after the hashref should either be a single string, representing the text value, or more array references for nested tags:
# <name version="1.0">Bob</name>
[ name => { version => '1.0' }, 'Bob' ]
# <name version="1.0">
# <first>Bob</first>
# <last>Dobbs</last>
# </name>
[ name => { version => '1.0' },
[ first => {}, 'Bob' ],
[ last => {}, 'Dobbs' ],
]If you squint, you can even see that it looks a little bit like XML.
To make it a touch cleaner to read, let’s make empty attribute hash references optional:
[ name => { version => '1.0' },
[ first => 'Bob' ],
[ last => 'Dobbs' ],
]Now that we have clean data structure for our XML generation, let’s use XML::Writer and XML::Writer::String to create the sample XML we’ve been using for this chapter:
use strict;
use warnings;
use XML::Writer;
use XML::Writer::String;
my @to_xml = (
library =>
[ book => { isbn => '1118013840' } =>
[ title => 'Beginning Perl' ],
[ authors =>
[ author => 'Curtis "Ovid" Poe' ],
],
[ publisher => 'Wrox' ],
],
[ book => { isbn => '0596526741' } =>
[ title => 'Perl Hacks' ],
[ authors =>
[ author => 'chromatic' ],
[ author => 'Damian Conway' ],
[ author => 'Curtis "Ovid" Poe' ],
],
[ publisher => "O'Reilly Media" ],
],
);
my $output = XML::Writer::String->new;
my $writer = XML::Writer->new(
OUTPUT => $output,
DATA_MODE => 1,
DATA_INDENT => 2,
);
$writer->xmlDecl;
write_element($writer, @to_xml);
$writer->end;
print $output->value;
sub write_element {
my ( $writer, $element, @next ) = @_;
# This allows the attributes hashref to be optional
my ( $attributes, @elements ) = 'HASH' eq ref $next[0]
? @next # we had attributes
: ( {}, @next ); # we did not have attributes
$writer->startTag($element, %$attributes);
foreach my $next_element (@elements) {
ref $next_element
? write_element($writer, @$next_element)
: $writer->characters($next_element);
}
$writer->endTag;
}Note
example_18_7_xml_writer.pl available for download at Wrox.com.
Running this code will print our desired XML:
<?xml version="1.0"?>
<library>
<book isbn="1118013840">
<title>Beginning Perl</title>
<authors>
<author>Curtis "Ovid" Poe</author>
</authors>
<publisher>Wrox</publisher>
</book>
<book isbn="0596526741">
<title>Perl Hacks</title>
<authors>
<author>chromatic</author>
<author>Damian Conway</author>
<author>Curtis "Ovid" Poe</author>
</authors>
<publisher>O'Reilly Media</publisher>
</book>
</library>At the top of the xml_writer.pl code, we use XML::Writer and XML::Writer::String. Ordinarily, XML::Writer expects to write the data to an IO::File object, but we use XML::Writer::String to make it easier to directly see the output as you’re working.
Next, we have our data structure which, as you can see, mirrors our example XML perfectly.
The we have our code for writing the XML declaration (that’s the <?xml version="1.0"?> bit), the actual XML, and finally printing out our result:
$writer->xmlDecl; write_element($writer, @to_xml); $writer->end; print $output->value;
Finally, we have a recursive subroutine that walks the data structure to print the XML:
sub write_element {
my ( $writer, $element, @next ) = @_;
# This allows the attributes hashref to be optional
my ( $attributes, @elements ) = 'HASH' eq ref $next[0]
? @next # we had attributes
: ( {}, @next ); # we did not have attributes
$writer->startTag($element, %$attributes);
foreach my $next_element (@elements) {
ref $next_element
? write_element($writer, @$next_element)
: $writer->characters($next_element);
}
$writer->endTag;
}Since object in Perl are references, any changes made to the object will persist and we don’t need to return it (this simplifies our code somewhat). We can write out an XML start tag with this:
$writer->startTag($element, %$attributes);
Then we recurse throughout the elements and if we have an array reference, we recursively call write_element() again with the next elements and if we have a string, such as Wrox, we write it directly after the tag:
$writer->characters($next_element);
Finally, we call $writer->endTag which will automatically print the closing tag for whichever $writer->startTag we last called.
Note
The data structure for writing XML might look a bit strange, but it’s a real-world example. Your author had to work from home for a couple of weeks due to a back injury and rewrote the XML generation for the Programme Information Platform (PIPs) for the BBC. The data structure is therefore identical to the data structure used to provide XML data for the world’s largest broadcaster’s data feeds for metadata. It also has a side effect of being unambiguously serializable in both JSON and YAML formats.
If you’re curious at a tiny behind-the-scenes look at a powerful application powered by Perl, you can read more about PIPs at:
http://www.bbc.co.uk/blogs/bbcinternet/2009/02/what_is_pips.html
Note that the sample code for this chapter is skipping much of the data validation and error reporting that was necessary, and also doesn’t include the work needed to deserialize XML, YAML and JSON back into the same Perl data structure (XML is a bit tricky, but the YAML and JSON deserialization is very straightforward).
Date Handling
In 1999, you author was working as a mainframe programmer and some of his job was dealing with the infamous Y2K issue. Many systems had years stored as two bytes, thus meaning, for example, that the year 00 would often be interpreted as 1900 rather than 2000. There was a rather strange conspiracy theory running around that programmers across the planet had somehow managed to work together to create a non-existent problem in order to guarantee job security. In response, your author penned the following haiku:
Is Y2K real? The problem's being solved by Those who can't find dates.
(For the pedants: yes, we know it’s a senryu).
Today, handling dates is as tricky as ever and programmers invariably underestimate the subtleties involved. Fortunately, Perl offers a variety of excellent tools for making it far making your life easier. In your programming career, you will eventually be confronted with your first date. Don’t blow it; listen to the experts.
DateTime
The DateTime module, written by Dave Rolsky, is your first choice when learning to navigate this tricky area. If you need to work with dates and times regularly, we strongly recommend that you not only read the documentation thoroughly, but you also read the http://datetime.perl.org/ Web site.
We’ve already seen a fair amount of use of this module in the book, so I won’t belabor it now, but we’ll review a couple of features that are easy to forget about.
Let’s say that you’re passed a DateTime object and you want to know if it’s after now.
if ( $datetime > DateTime->now ) {
# $datetime is in the future
}Sometimes you’ll get a string containing a date and you’ll want to parse that into a DateTime object. DateTime::Format::Strptime is a good choice (you can also use DateTime::Format::Builder for hard-to-parse cases).
use DateTime::Format::Strptime;
my $parser = DateTime::Format::Strptime->new( pattern => '%Y-%m-%d' );
my $datetime = $parser->parse_datetime('1967-06-20');And you can even represent time down to the nanosecond level:
my $dt_ns = DateTime->new(
year => 2012,
month => 5,
day => 23,
hour => 22,
minute => 35,
second => 16,
nanosecond => 130,
);Date::Tiny and DateTime::Tiny
One problem with using DateTime is that dates are so much harder than people think. The module is slow to load and takes about three or four megabytes of memory. If all you need to do is represent a date and don’t care about durations, comparisons, or other aspects of date math, you might find the Date::Tiny and DateTime::Tiny modules by Adam Kennedy interesting. Not only are they much smaller (around 100 bytes of memory), but they’re much faster. However, they achieve this by deliberately excluding many features that DateTime provides. Their basic usage is very similar:
my $date = Date::Tiny->new(
year => 1967,
month => 6,
day => 20,
);
my $date = DateTime::Tiny->new(
year => 2006,
month => 12,
day => 31,
hour => 10,
minute => 45,
second => 32,
);The Date::Tiny module is very useful when you only need a date and the DateTime::Tiny module is useful when you need both a date and time. Your author has found these very handy on performance sensitive systems when he very quickly needs to generate a datetime string for now:
my $today = Date::Tiny->now; my $now = DateTime::Tiny->now;
Unlike DateTime, each of these has a simple, built-in date string parser. The date or datetime strings are expected to be in ISO 8601 format:
my $birthday =Date::Tiny->from_string('1967-06-20');
my $party_like_its=DateTime::Tiny->from_string('1998-12-31T23:59:59');Note
In case you’ve wondered why the date 1967-06-20 shows up so often in the text, it’s because this is your author’s birthday. Now you have no excuse for forgetting.
Both the Date::Tiny and DateTime::Tiny modules provide a DateTime method for returning the object as its corresponding DateTime object. This is very useful if you discover that you need to manipulate the DateTime in ways that the Date::Tiny and DateTime::Tiny modules do not allow.
my $dt1 = $birthday->DateTime; my $party = $party_like_its->DateTime->add( seconds => 1 );
Note
The trick of wrapping the variable identifier in curly braces in a string also works outside of strings:
my ${foo} = 3;
print $foo; # prints 3However, do not do this. It doesn’t add any value here and is good only for obfuscating code and showing off (usually a bad idea). We mention it here for completeness.
This program shows off a lot of idiomatic Perl and shows you what a “real world” program actually looks like. It’s worth going over a few times to make sure you understand what it’s doing.
Understanding your program
Writing programs is great, but often we find ourselves in the situation where something has gone wrong and we need to understand what it is. Perl offers a rich variety of tools to help us understand these issues and we’ll take a look at some of them now.
Debugger
The debugger for Perl is probably one of the most useful tools you can use, but many Perl programmers find it quite intimidating. Basic usage of the Perl debugger is quite simple, but the Fear Of The Command Line seems to intimidate many Perl developers and they avoid the debugger. This is actually rather understandable because the debugger is very cryptic. You will find, though, that learning the debugger pays off handsomely when you take the time to learn it. We’ll look at some of the basics.
There are several ways to use the Perl debugger, the most common of which is this:
perl -d some_program.pl
That command will run some_program.pl in the debugger. Rather than try to explain everything, let’s look at a sample program. Imagine that, for some strange reason, you have a list of strings and you want to return the number of letters in a string if it’s a palindrome (hey, you try coming up with interesting examples for a book this long!), but zero if it’s not a palindrome. For example, the sentence “Murder for a jar of red rum” should have a length of 21, but “Hey, dude”, should return 0. So let’s look at our sample program:
use strict;
use warnings;
use Data::Dumper;
my @strings = (
'Dogma? I am God.',
'I did, did I?',
'Lager, sir, is regal.',
'This is not a palindrome',
'Murder for a jar of red rum.',
'Reviled did I live, said I, as evil I did deliver.',
);
my %lengths = map { $_ => plength($_) } @strings;
print Dumper \%lengths;
sub plength {
my $word = @_;
$word =~ s/\W//g;
return 0 unless $word eq reverse $word;
return length $word;
}Note
example_18_8_palindrome.pl available for download at Wrox.com.
That’s pretty straightforward, but it prints out:
$VAR1 = {
'Reviled did I live, said I, as evil I did deliver.' => 1,
'This is not a palindrome' => 1,
'Lager, sir, is regal.' => 1,
'Dogma? I am God.' => 1,
'I did, did I?' => 1,
'Murder for a jar of red rum.' => 1
};Obviously, it’s not true that those are all of length of one and clearly, one of them is not a palindrome (which one is an exercise for the reader). So let’s fire up our debugger and figure out what’s going on.
$ perl -d palindrome.pl Loading DB routines from perl5db.pl version 1.33 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(palindrome.pl:5): my @palindromes = ( main::(palindrome.pl:6): 'Dogma? I am God.', main::(palindrome.pl:7): 'I did, did I?', main::(palindrome.pl:8): 'Lager, sir, is regal.', main::(palindrome.pl:9): 'This is not a palindrome', main::(palindrome.pl:10): 'Murder for a jar of red rum.', main::(palindrome.pl:11): 'Reviled did I live, said I, as ... main::(palindrome.pl:12): ); DB<1>
When you first file up the debugger, it shows the first line or lines of code that it’s about to run and displays them. Whenever a new line of code shows up in the debugger, it’s a line of code that is about to be executed, not one that has already been executed.
So we can see above that Perl is about to evaluate the contents of the @palindromes array. Our code started with three use statements, but those aren’t shown because they happen at compile time and the debugger (usually) starts at the first runtime statement.
So what do we do? Let’s type n. That will advance to the ‘n’ext line of code.
DB<1> n
main::(palindrome.pl:14)::my %lengths = map { $_ => plength($_) } @palindromes;Now we see that the next line of code we’re going to execute is the map statements. Since there are six elements in @palindromes, we could hit n six times to execute this six times:
DB<1> n
main::(palindrome.pl:14): my %lengths=map{$_=>plength($_)}@strings;
DB<1> n
main::(palindrome.pl:14): my %lengths=map{$_=>plength($_)}@strings;
DB<1> n
main::(palindrome.pl:14): my %lengths=map{$_=>plength($_)}@strings;
DB<1> n
main::(palindrome.pl:14): my %lengths=map{$_=>plength($_)}@strings;
DB<1> n
main::(palindrome.pl:14): my %lengths=map{$_=>plength($_)}@strings;
DB<1> n
main::(palindrome.pl:14): my %lengths=map{$_=>plength($_)}@strings;
DB<1> n
main::(palindrome.pl:15): print Dumper \%lengths;However, that doesn’t let us see what’s happening in the plength subroutine, so after the first n command, let’s type s to step into the subroutine.
DB<1> s
main::(palindrome.pl:14): my %lengths=map{$_=>plength($_)}@strings;
DB<1> s
main::plength(palindrome.pl:18):: my $word = @_;The first s steps into the map command and the second s steps into plength and shows us that we’re about to execute the first line of the subroutine. We’ll type n again to go to the next line:
DB<1> n main::plength(palindrome.pl:19):: $word =~ s/\W//g;
Now that we’ve execute the my $word = @_; line, $word has a value, so let’s look at that by using the p command. The p command is shorthand for print:
DB<2> p $word 1
Ah hah! As you can see, $word has a value of 1. So what’s in @_? We’ll use the x command for this. It’s like the debugger version of Data::Dumper, but with a slightly different output:
DB<5> x \@_ 0 ARRAY(0x7ff0398ac3a8) 0 'Dogma? I am God.'
The x command dumps out the variables. If it’s a reference, it displays the reference type and address (something like ARRAY(0x7ff0398ac3a8)), and then shows the contents of the variable. In this case, we see we have a one-element array containing the string Dogma? I am God.. Obviously, we’re passing in the correct value, but now we should understand what happened. We tried to assign a list to a scalar and that’s why $word contained the value of 1. So quit the debugger with the q command and fix the first line of the subroutine with by forcing list context:
my ($word) = @_;
Now when we run our program again, we get the following output:
$VAR1 = {
'Reviled did I live, said I, as evil I did deliver.' => 0,
'This is not a palindrome' => 0,
'Lager, sir, is regal.' => 0,
'Dogma? I am God.' => 0,
'I did, did I?' => 8,
'Murder for a jar of red rum.' => 0
};Hmm, the I did, did I? line worked, but not the rest. Since we have a line that returns 0, it looks awfully suspicious. Let’s run the debugger again. As you probably noticed, our output above was rather limited. We usually only saw one line at a time and that can make it hard to see what’s going on. So let’s fix that.
When you are in the debugger, you can type v to see a “view” of the lines surrounding your current line. The current line is designated with a ==> marker. For example:
DB<1> v
15: print Dumper \%lengths;
16
17 sub plength {
18==> my $word = @_;
19: $word =~ s/\W//g;
20: return 0 unless $word eq reverse $word;
21: return length $word;
22 }
DB<1>As you can see from the above, we’re on the first line of the plength() subroutine, but with the extra lines before and after, it’s much easier to see where we are and to understand what’s going on.
Obviously, you don’t want to type v after every time you enter a command, so when you enter the debugger, before you type anything else, type {{v. The {{ command construct, followed by any debugger command, tells the debugger to execute the debugger command before every debugger prompt (hey, we already said the debugger was cryptic!). So we’ll do that now, followed by the n command to move to the next line of code.
$ perl -d palindrome.pl
Loading DB routines from perl5db.pl version 1.33
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
main::(palindrome.pl:5): my @strings = (
main::(palindrome.pl:6): 'Dogma? I am God.',
main::(palindrome.pl:7): 'I did, did I?',
main::(palindrome.pl:8): 'Lager, sir, is regal.',
main::(palindrome.pl:9): 'This is not a palindrome',
main::(palindrome.pl:10): 'Murder for a jar of red rum.',
main::(palindrome.pl:11): 'Reviled did I live, said I, as ...
main::(palindrome.pl:12): );
DB<1> {{v
DB<2> n
main::(palindrome.pl:14): my %lengths=map {$_=>plength($_)}@strings;
auto(-1) DB<2> v
11 'Reviled did I live, said I, as evil I did deliver.',
12 );
13
14==> my %lengths = map { $_ => plength($_) } @strings;
15: print Dumper \%lengths;
16
17 sub plength {
18: my ($word) = @_;
19: $word =~ s/\W//g;
20: return 0 unless $word eq reverse $word;
DB<2>The debugger is looking better already. Let’s step into the plength subroutine. We’ll do this by setting a breakpoint with b plength (you can set breakpoints with either line numbers or subroutine names) and then hitting c to continue to the breakpoint, then we’ll use n a couple of times to get to the desired line of code.
DB<2> b plength
DB<3> c
main::plength(palindrome.pl:18): my ($word) = @_;
auto(-1) DB<3> v
15: print Dumper \%lengths;
16
17 sub plength {
18==>b my ($word) = @_;
19: $word =~ s/\W//g;
20: return 0 unless $word eq reverse $word;
21: return length $word;
22 }
DB<3> n
main::plength(palindrome.pl:19): $word =~ s/\W//g;
auto(-1) DB<3> v
16
17 sub plength {
18:b my ($word) = @_;
19==> $word =~ s/\W//g;
20: return 0 unless $word eq reverse $word;
21: return length $word;
22 }
DB<3> n
::plength(palindrome.pl:20): return 0 unless $word eq reverse $word;
auto(-1) DB<3> v
17 sub plength {
18:b my ($word) = @_;
19: $word =~ s/\W//g;
20==> return 0 unless $word eq reverse $word;
21: return length $word;
22 }
DB<3>Note
If you, like your author, prefer to always have several lines of context in your debugger output, create a file in your home directory named .perldb. Add the following text:
@DB::typeahead = ('{{v');When you launch the debugger, Perl will find that file and execute those Perl commands. In this case, before you type anything, the debugger will “type” the commands present in the array. This will give you the lines of context that you are looking for.
See perldoc perldebug for a full explanation of the debugger and perldoc perldebtut for a tutorial on using it.
At this point, remember what our output was:
$VAR1 = {
'Reviled did I live, said I, as evil I did deliver.' => 0,
'This is not a palindrome' => 0,
'Lager, sir, is regal.' => 0,
'Dogma? I am God.' => 0,
'I did, did I?' => 8,
'Murder for a jar of red rum.' => 0
};Clearly we have a problem where we’re returning 0 from plength() and we’re on the line of code that is responsible for this:
return 0 unless $word eq reverse $word;
So let’s use the p command to print out some values. Notice that for DB<4> we can even print out the value of an expression (in this case the eq check in the line of code that’s the problem):
DB<3> p $word DogmaIamGod DB<4> $t = reverse $word DB<5> p $t doGmaIamgoD DB<6> p ( $word eq reverse $word ) ? 'Yes' : 'No' No
This makes is clear that we want a case-insensitive check, so we’ll change the value of $word:
DB<7> $word = lc $word DB<8> p ( $word eq reverse $word ) ? 'Yes' : 'No' Yes DB<9> n main::plength(palindrome.pl:21): return length $word; auto(-1) DB<13> v 18:b my ($word) = @_; 19: $word =~ s/\W//g; 20: return 0 unless $word eq reverse $word; 21==> return length $word; 22 } DB<10>
As you can see, changing the value of $word to lc $word allows the code to continue correctly. Now it’s obvious how to fix it.
Table 18.1, “Table 18-1” has a list of common debugger commands for a handy reference. We haven’t covered everything in this section and there’s a lot more to learn.
Table 18.1. Table 18-1
Command | Meaning |
|---|---|
| Go to the next line of code. Do not enter a subroutine call. |
| Step into a subroutine call. |
| Set a breakpoint at a subroutine name. |
| Set a breakpoint at the specified line number. |
| Continue executing code until the next breakpoint. |
| Print the value of a variable or expression. |
| Like the p command, but will “dump” references. |
| View a range of lines around the current line. |
| Show a stacktrace. |
| Quit the debugger. |
| Execute command before every debugger prompt. |
| Set a global watch expression. |
| Delete a global watch expression. |
| Display debugger help |
| Return from a subroutine. |
| Display all subroutines matching pattern. |
Profiling
So you have a large, working piece of software. It’s comprised of several modules, but it’s slow and buggy. You’re not sure why, so how do you fix it? That’s where various profiling tools come in handy.
Devel::Cover
A few years ago in London, your author was at a gathering of London Perl Mongers when one of the attendees sheepishly admitted to them that they had just started testing and only 1% of their code what covered by tests. However, they had started writing tests by focusing on real bugs that were being reported in their system and their phone support people reported a significant drop in calls. Some experienced developers are aware of this and instead of writing tests for all of their code, they focus their tests on the most critical parts of their code and hope to come back later and write tests for the rest.
But what does code coverage mean? Imagine the following subroutine:
sub is_temperature_out_of_bounds {
my $celsius = shift;
if ( $celsius > 40 ) {
return 1;
}
elsif ( $celsius < 10 ) {
return 1;
}
else {
return;
}
}The is_temperature_out_of_bounds() subroutine should return a false value if the temperature is greater than 40 degrees Celsius or less than 10 degrees Celsius. Some tests might look like this:
ok is_temperature_out_of_bounds(50), '50 degrees is too high'; ok !is_temperature_out_of_bounds(30), '30 degrees is ok';
In this case, our tests clearly miss the condition of where the temperature is less than 10 degrees. The subroutine is simple enough that we may think it’s not important, but if someone changes this subroutine in the future, it would be very unfortunate to not have full tests covering all possible conditions and lines of code.
So how do you know which code is actually covered by your test suite? That’s where Paul Johnson’s excellent Devel::Cover module comes in. This module will give you excellent reports on exactly what is covered in your test suites. Let’s take a look at the code coverage for AI::Prolog, a module your author wrote to do logic programming in Perl.
Note
AI::Prolog implements an interpreter for a language called Prolog. In most programming languages, you tell the computer step-by-step how to solve problems. In Prolog and other logic programming languages, you give them all of the data you know about a problem and the rules of how the data is related. Then when you present it with a problem, the language figures out how to solve the problem for you! Logic programming languages are very fascinating. Your author recommends that anyone who wants to be a top-notch developer learn multiple programming paradigms including logic programming.
For AI::Prolog, instead of installing it via the CPAN, we download it from the CPAN and untar the distribution (tar zxvf AI-Prolog-0.741.tar.gz, or Windows users can double-click the icon). Type perl Makefile.PL and accept the default prompt for installing the aiprolog shell and then type make.
Now that you’ve built the distribution, you can test it with Devel::Cover.
cover -delete HARNESS_PERL_SWITCHES=-MDevel::Cover make test cover
The cover -delete command tells Devel::Cover to delete any previous code coverage runs.
The HARNESS_PERL_SWITCHES environment variable tells Perl to load Devel::Cover for every test that it runs. You should see output similar to the following (with some warnings deleted for clarity — your output will not be identical):
$ HARNESS_PERL_SWITCHES=-MDevel::Cover make test t/01pod.t ................ ok t/05examples.t ........... ok t/10choicepoint.t ........ ok t/20term.t ............... ok t/25cut.t ................ ok t/25number.t ............. ok t/30termlist.t ........... ok t/35clause.t ............. ok t/35primitive.t .......... ok t/35step.t ............... ok t/50engine.t ............. ok t/60aiprolog.t ........... ok t/80math.t ............... ok t/80preprocessor.t ....... ok t/80preprocessor_math.t .. ok t/90results.t ............ ok All tests successful. Files=19, Tests=461, 16 wallclock secs Result: PASS
The you issue the cover command:
$ cover Reading database from /tmp/AI-Prolog-0.741/cover_db ---------------------------- ------ ------ ------ ------ ------ ------ ------ File stmt bran cond sub pod time total ---------------------------- ------ ------ ------ ------ ------ ------ ------ lib/AI/Prolog.pm 69.7 37.5 n/a 72.2 88.9 22.6 67.0 lib/AI/Prolog/ChoicePoint.pm 100.0 n/a n/a 100.0 0.0 2.2 85.7 lib/AI/Prolog/Engine.pm 83.3 70.3 75.0 77.1 60.0 4.5 78.5 ...olog/Engine/Primitives.pm 59.5 12.5 0.0 90.9 0.0 1.0 55.7 ...I/Prolog/KnowledgeBase.pm 30.4 16.7 0.0 46.2 0.0 0.4 27.2 lib/AI/Prolog/Parser.pm 82.7 75.0 58.8 75.0 0.0 44.8 76.7 ...og/Parser/PreProcessor.pm 100.0 n/a n/a 100.0 0.0 2.8 94.1 ...rser/PreProcessor/Math.pm 96.8 85.7 100.0 95.5 0.0 3.3 93.5 lib/AI/Prolog/Term.pm 77.7 66.7 58.9 82.6 0.0 12.2 68.8 lib/AI/Prolog/Term/Cut.pm 100.0 n/a n/a 100.0 0.0 0.1 88.5 lib/AI/Prolog/Term/Number.pm 100.0 100.0 66.7 100.0 0.0 0.2 88.2 lib/AI/Prolog/TermList.pm 97.2 83.3 66.7 100.0 0.0 2.1 89.5 ...Prolog/TermList/Clause.pm 95.2 75.0 n/a 100.0 0.0 0.7 85.3 ...log/TermList/Primitive.pm 100.0 50.0 n/a 100.0 0.0 2.7 84.6 ...I/Prolog/TermList/Step.pm 100.0 n/a n/a 100.0 0.0 0.5 95.0 Total 76.2 61.8 58.7 81.9 12.7 100.0 70.4 ---------------------------- ------ ------ ------ ------ ------ ------ ------ HTML output written to /tmp/AI-Prolog-0.741/cover_db/coverage.html done.
Note
If you do not have a Makefile.PL or Build.PL file for your code, you can run coverage with prove:
HARNESS_PERL_SWITCHES=-MDevel::Cover prove -l t
For every module in the distribution, we have a percentage of coverage for all statements (stmt), branches (bran), conditionals (cond), subroutines (sub), and documentation (pod). The time column merely represents the percent of time the tests spent in each module. An n/a result means that the particular type of code to cover was not found. The totals across the bottom and down the right hand side are averages for the amounts (except for the time column). The number 70.4 in the bottom right-hand portion of the result shows the overall code coverage percent. 70.4% is not bad, but it’s not particularly great, either.
Warning
Many programmers new to testing make the mistake of thinking they should shoot for 100% code coverage with their tests. Many types of code, such as GUIs or threaded code, are intrinsically hard to test and the amount of stress you find in trying to test virtually untestable code sometimes means that manual testing is fine. Remember, you have deadlines and code to deliver and if you have hard to test code, focus your tests on those areas of your code that are the most critical.
Statements represent individual lines of code (as separated by semi-colons). Branches represent things like if/else conditions. Conditional coverage examines Boolean operators such as if ( ($x && y) || !$z) { ... }. POD coverage uses a heuristic to determine if subroutines not beginning with underscores have POD documentation for them.
Note
Even having 100% coverage for your code does not mean that it is bug free because for larger systems, it’s generally impossible to test all possible combinations of inputs and all of the different paths through the code. Thorough code coverage is good, but it’s no “silver bullet” in ensuring that your code works as expected.
Knowing that we have code not covered by our tests isn’t very helpful unless we know which code is not covered. Note that the second to last line of our output was this:
HTML output written to /tmp/AI-Prolog-0.741/cover_db/coverage.html
Open that up in a browser and you should see output similar to Figure 18.2, “Figure 18-1”.
Many of the items in that report are underlined. Those are hyperlinks that let you drill down to individual modules and see what lines of code your tests have missed. Note that if you have some code with no code coverage, it might actually be dead code you can delete!
Devel::NYTProf
Knowing what code your tests cover is great, but what if your code runs about as fast as a paraplegic cheetah? Not so great. Often in working with large-scale systems, we find that network latency, database access or simple disk operations are responsible for our slow code, but not always. The first problem to solve when looking for slow code is to identify which code is actually slow. When you’re working on a system with a few hundred thousand lines of code, this is not a trivial problem. That’s where Devel::NYTProf comes in. Written by Tim Bunce (the author of the DBI module we covered in Chapter 16, Databases) and Adam Kaplan, the Devel::NYTProf module is often used with test suites to determine where your slow code is.
Note
To get a much better introduction to Devel::NYTProf, see Tim Bunce’s screencast on the topic at http://blip.tv/timbunce/devel-nytprof-v4-oscon-201007-3932242.
It will contain many excellent tips and tricks that you should know when trying to find performance problems in your code.
Note that Devel::NYTProf recommends Perl version 5.8.9 or better, with 5.10.1 or better being preferred.
The basic way of using Devel::NYTProf is to execute perl with the -d flag:
perl -d:NYTProf some_perl.pl
The -d flag, as explained earlier in this chapter, starts Perl with the debugger. However, followed by a colon and an $identifier, Perl will attempt to load Devel::$identifier and run the some_perl.pl program listed on the command line. With the -d:NYTProf argument, Perl will load Devel::NYTProf and then run some_perl.pl.
In this case, we’ll run in on the maze.pl program that we wrote in Chapter 7, Subroutines. We’ll use the downloadable version as that has more interesting timing information.
perl -d:NYTProf maze.pl
When using Devel::NYTProf, the program will generally take 3 to 4 times longer to run, but this is far faster than earlier (and broken) profilers were. Then you can open the Devel::NYTProf output in your favorite browser:
nytprofhtml --open
Note
The command nytprofhtml --open may not work on your system. Instead, you can use this:
nytprofhtml nytprof.out
And that will create your a directory full of HTML files you can browse. It’s the same thing as nytprofhtml --open, but without the magical opening of a browser window for you.
The output should resemble Figure 18.3, “Devel::NYTProf output”.
The first page of output, shown in Figure 18.3, “Devel::NYTProf output”, contains the 15 slowest subroutines (though you can see all of the subroutines if you like), but to understand them, you need to what the columns mean.
The header information from that gives you a good idea of what issues to look for. The Calls column represents the number of times the subroutine was called. You can see that the relatively fast diagnostics::CORE::subst was called a whopping 7801 times! That makes it slow even if the subroutine is fast.
The P column is the number of places the subroutine was called from. For diagnostics::CORE::subst, we can see that it was called from seven places. The F column represents the number of files the line of the subroutine was called from.
Exclusive and Inclusive time columns confuse a few people at first. The Inclusive Time represents how much time the subroutine took to run, including the time of any subroutine calls it made. The Exclusive Time column represents the time the subroutine took to run, excluding the timing of any of its subroutine calls. That’s why Exclusive Time should always be equal to or less than Inclusive Time.
And finally, we have the Subroutine column, which names the offending subroutine.
So how do we use this? Well, we can guess that diagnostics::CORE::subst was probably called from the diagnostics pragma, so merely removing that pragma should speed things up a bit.
The top two lines, though, are interesting:
Calls P F Exclusive Inclusive Subroutine 400 1 1 4.41s 4.41s Time::HiRes::usleep (xsub) 402 3 1 2.76s 2.76s main::CORE:system (opcode)
For the downloadable version of this program, we repeatedly redraw the maze, in slow motion, to let you see the recursive rendering of the maze. Let’s take out all the usleep and system calls to ensure that our code renders as quickly as possible. Running our profiler again gives us new results, as shown in Figure 18.4, “Devel::NYTProf output”.
As you can see, we’ve gone from over 7 seconds to around half a second. That’s great, but clearly this is not a real-world example and, in this case, we’ve dramatically changed the behavior of our code.
Subroutines with names like BEGIN@... represent compile time code, such as loading use warnings. Others, such the top two tunnel() and have_not_visited() subroutines are clearly examples of code we can look at for further optimization. To figure out how to make them run faster, don’t guess. Benchmark them! That’s what we’ll cover in the next section.
Benchmark
The Benchmark module is one of the core modules released with Perl version 5. It’s used to benchmark some code to see how long it takes to run, and compare it with alternative versions of code that do the same thing.
Warning
It is very common among developers (sometimes even experienced ones!) to worry about the performance of their programs when they should not. Though there are times this makes sense, programmers tend to be incredibly bad at judging which parts of their software they should optimize. Just because you know that a routine is slow, if it only takes .2% of your program’s running time, it’s probably not worth speeding up. That’s why Devel::NYTProf is an excellent tool for finding out what parts of your program are the real trouble spots.
Just remember one rule: when you’re program runs fast enough for your needs, stop optimizing.
Let’s take a look at a concrete example, using the example of a factorial. The factorial of a function can be defined as that number times the factorial of that number minus one, with the factorial of zero being defined as one. In other words, the factorial of 4 is 24 (4 * 3 * 2 * 1). We could write this with a recursive function that clearly defines our intent:
sub fac {
my $number = shift;
return 1 if 0 == $number;
return $number * fac( $number - 1 );
}Note
A common error in factorial functions is to return 0 for the factorial of 0 because programmers forget (or don’t know) that the proper result is 1. That’s why our factorial program has return 1 if 0 == $number rather than when 1 == $number.
This seems fine, but what if our profiling the code shows that this function is called thousands of times? Would it be worthwhile to eliminate the overhead of the recursive function call? Let’s find out by using the timethese() function from Benchmark. One way of using this function looks like this (see the documentation for full details):
timethese(
$number_of_times_to_run_the_code,
{
name1 => \&subref1,
name2 => \&subref2,
}
);use strict;
use warnings;
use Benchmark 'timethese';
sub recursive_factorial {
my $number = shift;
return 1 if 0 == $number;
return $number * recursive_factorial( $number - 1 );
}
sub loop_factorial {
my $number = shift;
return 1 if 0 == $number or 1 == $number;
my $factorial = 1;
for ( 2 .. $number ) {
$factorial *= $_;
}
return $factorial;
}
timethese(
1_000_000,
{
'recursive' => sub { recursive_factorial(15) },
'loop' => sub { loop_factorial(15) },
}
);Note
example_18_9_factorial.pl available for download at Wrox.com.
So we’re going to run our recursive and loop versions of factorial one million times each, computing the factorial of 15. Here’s the output from this on my computer (reformatted slightly to fit the book):
Benchmark: timing 1000000 iterations of loop, recursive...
loop: 2 wallclock secs (1.92 CPU) @ 520833.33/s (n=1000000)
recursive: 7 wallclock secs (6.66 CPU) @ 150150.15/s (n=1000000)As you can see, the loop version of the factorial function is more than three times as fast as the recursive version. You might be happy with that, but can we do faster? Sure we can.
The factorial function is what’s known as a pure function. A pure function has no side effects (such as deleting files or altering global variables) and always returns the same output for the same input. Pure functions are great candidates for caching, so let’s cache the factorial and return the cache if it’s found. This can take a bit more code, but for very “hot” pieces of code (code that gets run a lot), it can be worth the effort:
{
my %factorial_for;
sub cached_factorial {
my $number = shift;
unless (exists $factorial_for{$number}) {
if ( 0 == $number or 1 == $number ) {
$factorial_for{$number} = 1;
}
else {
my $factorial = 1;
for ( 2 .. $number ) {
$factorial *= $_;
}
$factorial_for{$number} = $factorial;
}
}
return $factorial_for{$number};
}
}Or, if you’re using Perl version 5.10.0 or better:
use 5.10.0;
sub cached_factorial {
state %factorial_for;
my $number = shift;
unless (exists $factorial_for{$number}) {
if ( 0 == $number or 1 == $number ) {
$factorial_for{$number} = 1;
}
else {
my $factorial = 1;
for ( 2 .. $number ) {
$factorial *= $_;
}
$factorial_for{$number} = $factorial;
}
}
return $factorial_for{$number};
}And we add this to our timethese function:
timethese(
1_000_000,
{
'recursive' => sub { recursive_factorial(15) },
'loop' => sub { loop_factorial(15) },
'cached' => sub { cached_factorial(15) },
}
);And here are our results:
Benchmark: timing 1000000 iterations of cached, loop, recursive...
cached: 0 wallclock secs (0.47 CPU) @ 2127659.57/s (n=1000000)
loop: 3 wallclock secs (1.96 CPU) @ 510204.08/s (n=1000000)
recursive: 7 wallclock secs (6.60 CPU) @ 151515.15/s (n=1000000)Note that the wallclock time (the amount of time it took from the user’s perspective) is rounded off but when you look at the number after the @ sign, you see that the cached version executed over two million times per second while second while the loop version only ran about half a million times per second, so the cached version is roughly four times faster than the loop and fourteen times faster than our recursive function. Sometimes more lines of code run faster than fewer!
When benchmarking code, it’s important to remember a few things:
There’s usually no point in benchmarking code before you’ve profiled your program.
Always make sure that every version you’re benchmarking behaves identically.
Run your benchmark several times. Other processes running on your system can interfere with benchmarks.
If your faster code is too complicated to understand, is it really worth it?
When it’s fast enough, stop benchmarking!
Perl::Critic
Understanding how much of your code is covered by tests and how well your code performs is great, but how do you know you’ve written good code? Perl::Critic is a highly configurable static analysis tool that can “read” your Perl code and, while it won’t tell you if the code is any good, it can tell you where problem spots lie in your code.
Perl::Critic applies policies to your code and analyzes each file to determine if it violates the policies. These policies can have one of five levels of severity, from gentle (level 5) to brutal (level 1). The default policies that ship with Perl::Critic are mostly derived from the book Perl Best Practices, written by Damian Conway. Some of the policy violations seem a bit out of date (such as RCS keywords $Id$ not found, a reference to older version control systems), whereas others will catch potentially serious issues with your code ("return" statement followed by "sort" at line 6, column 5. Behavior is undefined if called in scalar context.). Perl::Critic is not limited to the Perl Best Practices. You can write your own policies and there are many other policies on the CPAN for you to download and apply. You can even create a .perlcriticrc file, explained in perldoc Perl::Critic (the module) and perldoc perlcriticrc (the command line tool).
Two common ways of using the perlcritic tool is to pass it a filename or directory:
perlcritic some_program.pl perlcritic lib/
By default, Perl::Critic runs in “gentle” mode and only reports on the most severe violations, or ones that are likely to cause your program issues. So let’s run this on the example_18_9_factorial.pl benchmarking program we wrote earlier in this chapter:
$ perlcritic example_18_9_factorial.pl example_18_9_factorial.plsource OK
That’s great. We have no serious violations here. That’s equivalent to:
$ perlcritic --gentle example_18_9_factorial.pl $ perlcritic −5 example_18_9_factorial.pl
Let’s kick to the --stern level (reformatted slightly):
$ perlcritic −4 example_18_9_factorial.pl Code not contained in explicit package at line 1, column 1. Violates encapsulation. (Severity: 4) Module does not end with "1;" at line 46, column 1. Must end with a recognizable true value. (Severity: 4)
In this case we see that we haven’t started our code with a package name and it doesn’t end with a 1 on the last line as we would expect a module to end. The policy violation is described, the line and column where the policy is found is presented, a brief description of why the policy matters is presented and the severity level is included.
However, those policies are for modules and this is just a simple script and we don’t care about those, so let’s exclude them. The --exclude parameter takes a regular expression as its argument and any violations matching that pattern will be excluded:
$ perlcritic −4 --exclude 'package|module' example_18_9_factorial.pl example_18_9_factorial.pl source OK
Next is the --harsh level, or level −3.
$ perlcritic −3 --exclude 'package|module' example_18_9_factorial.pl example_18_9_factorial.pl source OK
So far so good.
$ perlcritic −2 --exclude 'package|module' example_18_9_factorial.pl RCS keywords $Id$ not found at line 1, column 1. See page 441 of PBP. (Severity: 2) RCS keywords $Revision$, $HeadURL$, $Date$ not found at line 1, column 1. See page 441 of PBP. (Severity: 2) RCS keywords $Revision$, $Source$, $Date$ not found at line 1, column 1. See page 441 of PBP. (Severity: 2) "unless" block used at line 28, column 9. See page 97 of PBP. (Severity: 2) 1_000_000 is not one of the allowed literal values (0, 1, 2). Use the Readonly or Const::Fast module or the "constant" pragma instead at line 45, column 5. Unnamed numeric literals make code less maintainable. (Severity: 2)
(Note that we’ve omitted a couple of violations for the sake of brevity).
The RCS keywords violations are references to older source control management systems (used to keep track of changes in your source code) such as CVS or Subversion. They would substitute in certain values in place of RCS keywords for some automated documentation. Your author uses a program called git to handle this, so these aren’t relevant to him. However, now that they see See page 441 of PBP. In lieu of an explanation of the importance of these violations, you are referred to page 441 of the Perl Best Practices book.
The unless block used violation is actually a valid concern. Many developers get confused by unless blocks because they can make straightforward logic a bit of a nightmare:
unless ( $foo || $bar ) ) {
...
}Even the most experienced might be tripped up by the above code. It only runs if both $foo and $bar are false, so maybe the perlcritic violation has pointed out something about my code that might make it harder to maintain.
Usually you want to be warned about this but you don’t think it’s a problem in this code, so you’ll annotate the source code to tell Perl::Critic not to worry about this. You need to read perldoc Perl::Critic::PolicySummary to understand what policy you’ve violated:
unless (exists $factorial_for{$number}) { ## no critic 'ProhibitUnlessBlocks'Or you can add the --statistics switch to get a summary at the end, including the formal names of the policies you’ve violated:
1 files.
3 subroutines/methods.
41 statements.
51 lines, consisting of:
7 blank lines.
0 comment lines.
0 data lines.
44 lines of Perl code.
0 lines of POD.
Average McCabe score of subroutines was 4.00.
13 violations.
Violations per file was 13.000.
Violations per statement was 0.317.
Violations per line of code was 0.255.
2 severity 4 violations.
9 severity 2 violations.
2 severity 1 violations.
1 violations of CodeLayout::ProhibitTrailingWhitespace.
1 violations of CodeLayout::RequireTidyCode.
1 violations of ControlStructures::ProhibitUnlessBlocks.
3 violations of Miscellanea::RequireRcsKeywords.
1 violations of Modules::RequireEndWithOne.
1 violations of Modules::RequireExplicitPackage.
1 violations of Modules::RequireVersionVar.
4 violations of ValuesAndExpressions::ProhibitMagicNumbers.In this case, we can see that it was ControlStructures::ProhibitUnlessBlocks that we have violated, but just the last part of the name is required when you add an annotation to your code telling Perl critic to ignore the issue.
Let’s look at the next violation:
1_000_000 is not one of the allowed literal values (0, 1, 2). Use the Readonly or Const::Fast module or the "constant" pragma instead at line 45, column 5. Unnamed numeric literals make code less maintainable. (Severity: 2)
This one is certainly a problem. If you’re going to hard-code literal values in your code, it’s better to declare them at the top of your code and use a descriptive name:
use constant NUMBER_OF_TIMES_TO_RUN => 1_000_000;
Not only does this help to document your code, it makes it easier to find all of the values in your code that are more likely to need to change at a later date.
Right now, we’ve seen a few cases where there are policies we don’t like. Perhaps we want to disable them globally. You can create a .perlcriticrc file in your home directory (or a custom one in your code directory) with the following contents:
exclude = RequireRCSKeywords RequireTidyCode RequireFinalReturn [TestingAndDebugging::RequireUseStrict] equivalent_modules = Dancer [TestingAndDebugging::RequireUseWarnings] equivalent_modules = Dancer
The exclude = line turns off several policies that I don’t want (obviously, this is very subjective). The TestingAndDebugging::RequireUseStrict and TestingAndDebugging::RequireUseWarnings sections tell Perl::Critic that we don’t require strict and warnings when using the Dancer module. (Dancer is a Web framework and using it turns on strictures and warnings for you.)
We can use our .perlcriticrc to include new policies we have created, change the default warning level and do many other things. Perl::Critic can be an excellent tool for making sure your coding standards are met.
Summary
In this chapter you learned about numerous small problems that, while not core Perl, are nonetheless common enough, yet tricky, tasks it’s having a basic exposure to. You’ve learned about reading and writing CSV files, different ways of handling XML and a bit more about the dates and times.
You’ve also learned about a variety of very useful tools, such as the debugger, that help you better understand how your programs behave. You’ve learned a bit about Devel::Cover, code that can tell you how well your test suites cover your code base. You’ve learned using Devel::NYTProf to uncover performance problems in your code and using the Benchmark module to test whether alternative implementations are actually faster.
Finally, you’ve been exposed to Perl::Critic, a tool that lets you uncover potential problems in your code.
Exercises
1. Describe at least three potential problems with the following code to read a CSV file:
open my $fh, '<', $file
or die "Cannot open $file for reading: $!";
while ( my $line = <$fh> ) {
my ( $name, $rank, $notes ) = split /,/ => $line;
print <<"END";
Name: $name
Rank: $rank
Notes: $notes
}2. Why might you use DateTime::Tiny instead of the DateTime module? List some strengths and weaknesses of each.
3. Why should you use Devel::NYTProf? When should you not use it? What are some of the problems with aggressively optimizing your code for performance?
4. Type in the following program:
use Getopt::Long;
my $name = Nobody;
my $times = 3;
GetOptions(
'name=s' => \$name,
'times=i' => \$times,
) or die;
hello( $name, $times );
sub hello {
for ( 1 .. $_[1] ) {
print "$_[0]\n";
}
}The program is correct and does what it intended. Try running the program both with and without arguments if you’re unsure of what it’s doing. Then run this command for the “gentle” warnings from perlcritic.
perlcritic −5 program.pl
Now run perlcritic −1 program.pl and read the violations. How do they differ? Do you agree or disagree with what Perl::Critic is reporting?
5. Make the Perl::Critic violations reported in Exercise 4 go away. You may wish to read the Perl::Critic documentation to fix some of these issues. See --profile in perldoc perlcritic for a useful start. You may wish to run perlcritic with --statistics to see the full names of the policy violations.
WHAT YOU LEARNED IN THIS CHAPTER
TOPIC | DESCRIPTION |
|---|---|
| A Perl module to handle correctly reading and writing CSV data. |
| A simple but inflexible method of reading and writing XML data. |
| An excellent XML parsing module. |
| A useful module for writing correct XML. |
| A full-featured date and time manipulation/presentation module. |
| A minimalistic date object. Good when you don’t need date math. |
| Like Date::Tiny, but for dates and times. |
Perl Debugger | Used to run Perl programs in debug mode and understand their behavior. |
| A module used to tell you what code is covered by your test suite. |
| A module used to profile your program and identify slow code. |
| A module used to compare the performance characteristics of different versions of code. |
| A code analysis module that allows you to find possible problems in your code. |
| The command line interface to the |
Answers to exercises
1. Describe at least three potential problems with the following code to read a CSV file:
open my $fh, '<', $file
or die "Cannot open $file for reading: $!";
while ( my $line = <$fh> ) {
my ( $name, $rank, $notes ) = split /,/ => $line;
print <<"END";
Name: $name
Rank: $rank
Notes: $notes
END
}Commas might be embedded in quotes, breaking the split on commas.
Newlines might be embedded in quotes, causing the filehandle read to return a partial column.
Quote are used only to quote columns with special characters and are not part of the data. The program above does not remove them.
2. Why might you use DateTime::Tiny instead of the DateTime module? List some strengths and weaknesses of each.
DateTime::Tiny would be used when you only need a simple data object. It can tell you its day, month, hour, and so on. It’s also easy to print as a string. It’s very lightweight and very fast. However, it does not support date comparisons or other forms of date manipulation. It can be inflated to a DateTime object.
The DateTime module is the most complete DateTime manipulation solution available on the CPAN. It’s extremely comprehensive and flexible (including excellent handling of time zones, though those were not discussed in the chapter) and can allow you to compare date and times and do simple date math. Unfortunately, the module is very large and slow to load and often provides more functionality than a simple program might need.
3. Why should you use Devel::NYTProf? When should you not use it? What are some of the problems with aggressively optimizing your code for performance?
You should generally run Devel::NYTProf when your code is running slowly and you need to figure out why. If your code is running fast enough, running Devel::NYTProf can be interesting, but it can also be a distraction when you have other tasks that you need to accomplish. When your program runs fast enough, you should consider leaving it alone and not falling prey to the endless tweaking that so many programmers are prone to. Further, over-optimizing your code can sometimes make it harder to read. Clean, simple code tends to be easier to maintain and often has fewer bugs than heavily optimized but obscure code.
4. Type in the following program:
use Getopt::Long;
my $name = Nobody;
my $times = 3;
GetOptions(
'name=s' => \$name,
'times=i' => \$times,
) or die;
hello( $name, $times );
sub hello {
for ( 1 .. $_[1] ) {
print "$_[0]\n";
}
}The program is correct and does what it intended. Try running the program both with and without arguments if you’re unsure of what it’s doing. Then run this command for the “gentle” warnings from perlcritic.
perlcritic −5 program.pl
Now run perlcritic −1 program.pl and read the violations. How do they differ? Do you agree or disagree with what Perl::Critic is reporting?
Running perlcritic −5 program.pl produces the following output:
Code before strictures are enabled at line 3, column 1. See page 429 of PBP. (Severity: 5) Running perlcritic −1 program.pl produces the following: RCS keywords $Id$ not found at line 1, column 1. See page 441 of PBP. (Severity: 2) RCS keywords $Revision$, $HeadURL$, $Date$ not found at line 1, column 1. See page 441 of PBP. (Severity: 2) RCS keywords $Revision$, $Source$, $Date$ not found at line 1, column 1. See page 441 of PBP. (Severity: 2) Code not contained in explicit package at line 1, column 1. Violates encapsulation. (Severity: 4) No package-scoped "$VERSION" variable found at line 1, column 1. See page 404 of PBP. (Severity: 2) Code before strictures are enabled at line 3, column 1. See page 429 of PBP. (Severity: 5) Code before warnings are enabled at line 3, column 1. See page 431 of PBP. (Severity: 4) 3 is not one of the allowed literal values (0, 1, 2). Use the Readonly or Const::Fast module or the "constant" pragma instead at line 4, column 13. Unnamed numeric literals make code less maintainable. (Severity: 2) "die" used instead of "croak" at line 9, column 6. See page 283 of PBP. (Severity: 3) Module does not end with "1;" at line 13, column 1. Must end with a recognizable true value. (Severity: 4) Always unpack @_ first at line 13, column 1. See page 178 of PBP. (Severity: 4) Subroutine "hello" does not end with "return" at line 13, column 1. See page 197 of PBP. (Severity: 4) Return value of flagged function ignored - print at line 15, column 9. See pages 208,278 of PBP. (Severity: 1)
The RCS keywords violation makes little sense if you are not using external tools (such as CVS or Subversion) that support RCS keywords.
Issues such as not using strict or warnings are generally agreed to be problematic, but some of the reported issues (such as not having a package name) don’t appear relevant to scripts. Others, such as die instead of croak, don’t make much sense in this context.
The reported 3 is not one of the allowed literal values, and the suggestion to replace it with a read-only constant is clearly not applicable here where we want a default value that can change.
Unpacking @_ is almost always good advice unless this is very hot code for which maximum performance is critical.
Finally, we have a curious combination of a violation for not ending the subroutine with a return statement and then ignoring the returned value.
Interesting that you can get more violations reported than there are lines of code, eh?
5. Make the Perl::Critic violations reported in Exercise 4 go away. You may wish to read the Perl::Critic documentation to fix some of these issues. See --profile in perldoc perlcritic for a useful start. You may wish to run perlcritic with --statistics to see the full names of the policy violations.
First, we’ll create a perlcriticrc file specifically for scripts. Save this as perlcriticscripts:
exclude = RequireRCSKeywords [-Modules::RequireExplicitPackage] [-Modules::RequireEndWithOne] [-Modules::RequireVersionVar]
Here’s one way to rewrite the program to make it pass the strictest level:
use strict;
use warnings;
use Getopt::Long;
sub hello {
my ( $name, $times ) = @_;
for ( 1 .. $times ) {
print "$name\n"; ## no critic 'RequireCheckedSyscalls'
}
return;
}
my $name = 'Nobody';
my $times = 3; ## no critic 'ProhibitMagicNumbers'
GetOptions(
'name=s' => \$name,
'times=i' => \$times,
) or die; ## no critic 'RequireCarping'
hello( $name, $times );You can then verify this works with:
$ perlcritic −1 --profile perlcriticscripts my_program.pl bad.pl source OK
Perl::Critic violations are often very subjective and you may feel that they are not suitable for your needs. That’s OK, but make sure you understand why Perl::Critic is warning about the issues it finds. If you don’t understand the violation, you may very well be writing problematic code without realizing it.








Add a comment



Add a comment