Chapter 10. Sort, map and grep
WHAT YOU WILL LEARN IN THIS CHAPTER
Sorting lists alphabetically and numerically
Creating custom sorts with sort subroutines
Using grep to efficiently filter lists
Using map to efficiently transform lists
Understanding common traps when sorting and filtering lists
Combing map, sort and grep to create powerful list manipulations
By this time, you have a sufficient understanding of Perl that you’re very likely to be able to use it for small tasks in relation to your day-to-day work. However, there’s an odd sort of “litmus test” for Perl developers. For some reason, really understanding sort, map and grep seems to be the difference between beginner and intermediate Perl developers. Once you cross this threshold, you’re well on your way to being a Perl expert.
Though we’ve touched on sort, map and grep briefly, we have deliberately kept their usage simple. Now we’ll see a bit more about their full power.
The one thing to remember is that each of these creates a new list from an old list.
Basic sorting
The sort builtin sorts a list and returns a new list. It has three forms:
sort LIST
sort BLOCK LIST
sort SUBNAME LIST
As an example of each:
@passengers = sort @passengers;
@passengers = sort { $a->{age} <=> $b->{age} } @passengers;
@passengers = women_and_children_first @passengers;Sorting alphabetically
The simplest sort in Perl is this:
my @list = sort qw(this is a list);
print “@list”;
That will print out the Yoda-esque “a is list this”. By default, sort will sort items with a string comparison. Well, actually it will sort items via a numeric comparison of the string’s code point, from lower to higher values. So the following line:
# declare our source code as UTF-8 use utf8; # and we're printing UTF-8 binmode STDOUT, 'encoding(UTF-8)'; print join ' ', sort qw/b 日 aa 国 1 本 a A/;
Will print:
1 A a aa b 国 日 本
Note
If you’re unsure of why Perl sorts in this order, you can covert each character to its UTF-8 code point with this (obviously, we’ve left out the aa for this example):
use utf8::all;
print join ' ', map { as_code_point($_) } sort qw/b 日 国 1 本 a A/;
sub as_code_point {
my $char = shift;
die "Only characters!" if length($char) > 1;
return "U+" . uc sprintf "%04x", ord $char;
}And that prints out:
U+0031 U+0041 U+0061 U+0062 U+56FD U+65E5 U+672C
As you can see, by default Perl’s sort will sort characters in ascending order by their numerical ord() value. If you don’t understand the map, don’t worry. We’ll explain it carefully later in this chapter.
If you want to see the decimal value of the numbers, use this:
use utf8::all; print join ' ', map ord, sort qw/b 日 国 1 本 a A/;
Sorting Numerically
Perl’s default sort will (more or less) sort strings as characters. This means that if you do this:
print join "\n", sort qw/1 9 10 99 222/;
You get this:
1 10 222 9 99
You probably meant to sort your numbers numerically. In this case, we provide a sort block:
print join "\n", sort { $a <=> $b } qw/1 9 10 99 222/;And now we get the correct sort order:
1 9 10 99 222
In the form sort BLOCK LIST, Perl will iterate over the pairs of items in the list and set the special package variables $a and $b to each element in the list in turn. The <=> operator (sometimes called the spaceship operator) was covered in Chapter 4, Working With data. In this case, it tells Perl to compare $a and $b as numbers instead of strings.
Warning
The $a and $b variables are special package variables. Do not declare them with my or else your sort routines are likely to break.
Reverse sorting
Many times you want a list reversed. You could do this:
my @reversed_names = reverse sort @names;
That reads clearly, but for larger lists, this is inefficient. It sorts the list into ascending order and then reverses it into descending order. Why not just sort directly into reverse descending order? In this case, you can use a sort block and swap the $a and $b variables.
my @reversed_names = sort { $b cmp $a } @names;
This works when sorting numbers in descending order, too:
my @descending = sort { $b <=> $a } @numbers;Complex sort conditions
When sorting on a single value, sorting is fairly straightforward, but if you need to sort on multiple values, you need to use a sort block or sort subroutine. Example 10.1, “Complex sorting in Perl” has an example of complex sorting.
Example 10.1. Complex sorting in Perl
use strict;
use warnings;
use diagnostics;
my @employees = (
{
name => 'Sally Jones',
years => 4,
payscale => 4,
},
{
name => 'Abby Hoffman',
years => 1,
payscale => 10,
},
{
name => 'Jack Johnson',
years => 4,
payscale => 5,
},
{
name => 'Mr. Magnate',
years => 12,
payscale => 1,
},
);
@employees =
sort {
$b->{years} <=> $a->{years}
||
$a->{payscale} <=> $b->{years}
}
@employees;
printf "Name Years Payscale\n";
foreach my $employee (@employees) {
printf "%-15s %2d %2d\n" => @{$employee}{qw/name years payscale/};
}Note
employee.pl available for download at Wrox.com.
Running Example 10.1, “Complex sorting in Perl” will print:
Name Years Payscale Mr. Magnate 12 1 Sally Jones 4 4 Jack Johnson 4 5 Abby Hoffman 1 10
The idea in this case is that we want to print a list of employees. They should be printed from the highest number of years in the company to the lowest. That’s our first sort condition:
$b->{years} <=> $a->{years}Note that the $b and the $a are reversed in order to provide a descending sort.
However, but Sally Jones and Jack Johnson have the same number of years with the company. The highest payscale is 1 and the lowest is 10 and in case there’s a tie, we want to print employees from highest to lowest payscale (in other words, from 1 to 10).
$a->{payscale} <=> $b->{years}You may remember that the <=> operator returns 0 (zero) if the two terms are equal, so we can use the || operator to sort by payscale if the employees have the same number of years with the company:
@employees =
sort {
$b->{years} <=> $a->{years}
||
$a->{payscale} <=> $b->{years}
}
@employees;What happens if the employees have the same number of years and the same payscale? Well, just throw in a sort by name:
@employees =
sort {
$b->{years} <=> $a->{years}
||
$a->{payscale} <=> $b->{years}
||
$a->{name} cmp $b->{name}
}
@employees;This looks like a lot of work, but the || operator short-circuits. That means since only one of the conditions is required to be true, as soon as one of the conditions evaluates as true, the subsequent conditions are not evaluated. It’s actually a fairly efficient sort.
Writing a sort subroutine
In handling complex sorts, you might find that this is a bit daunting:
@employees =
sort {
$b->{years} <=> $a->{years}
||
$a->{payscale} <=> $b->{years}
||
$a->{name} cmp $b->{name}
}
@employees;The fix is simple. Put that sort block into a subroutine and replace the block with the subroutine name:
sub by_seniority_then_pay_then_name {
$b->{years} <=> $a->{years}
||
$a->{payscale} <=> $b->{years}
||
$a->{name} cmp $b->{name}
}
@employees = sort by_seniority_then_pay_then_name @employees;When you have a complex sort condition, giving it a named sort subroutine improves readability quite a bit. As an added bonus, if you need to replicate a complex sort elsewhere, you already have the code handy.
Note
For those who are curious, Perl’s sort, by default, is stable. This means that if two values compare the same way, they will be returned in the same order they were originally found. Thus, if we left of the sorting by name condition in our employee sort, all employees with the same years and payscale would be guaranteed to be returned in order that they were in the original list. This is very useful, particularly if you have a list that is already partially sorted. This is far more common than you think.
Some people prefer to not use the $a and $b variables. They are not strictly required in the sort subroutine. If you want to use variables with names of your choosing (and not $a or $b) you’ll need to use a $$ prototype to force passing $a and $b to the sort sub for assignment to your variables:
sub by_seniority_then_pay_then_name($$) {
my ( $employee1, $employee2 ) = @_;
$employee2->{years} <=> $employee1->{years}
||
$employee1->{payscale} <=> $employee2->{years}
||
$employee1->{name} cmp $employee2->{name}
}
@employees = sort by_seniority_then_pay_then_name @employees;Be aware that if you do this, the sort subroutine will be a bit slower because, as an optimization in Perl the $a and $b variables are automatically aliased by Perl when sort is encountered.
Sorting and Unicode Fun!
Why do we sort data? We could argue that there are two reasons for this:
Making it faster for computers to find data
Making it faster for humans to find data
If all we care about is to make it faster for computers to find data, then the default sort behavior is often fine. However, humans are an annoyingly troublesome lot. In Swedish, the letter z comes before the letter ö, but in German it’s the other way around. If you’re sorting data for display to people, they will complain very bitterly (and quite rightly) if they have trouble finding what they need because the sort order of the data is not what they expect, so you need to make sure that you’re sorting correctly for your target audience.
And here’s another fun example. Run the following code:
use utf8::all;
use charnames ":full";
print "\N{ANGSTROM SIGN}\n";
print "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}\n";That prints out:
Å Å
Those are the Unicode code points U+212B and U+00C5, respectively, but for purposes of sorting or comparison they are supposed to be considered the same character. Further, Unicode has combining character to indicate that two symbols should be combined. This gives Unicode a great flexibility in representing different characters. Using the two code points above, along with an upper case A and a COMBINING RING ABOVE gives this code:
use charnames ':short';
binmode STDOUT, ':encoding(UTF-8)';
print "\N{U+212B}\n";
print "\N{U+00C5}\n";
print "\N{U+0041}\N{U+030A}\n";Which prints:
Å Å Å
Many computers will actually print those slightly differently, but they should look generally similar and for purposes of sorting and comparing (cmp), they must, as already stated, be considered the same character despite being different code points. We’re repeating ourselves because it’s important and there’s a good chance you’ll get it wrong. But don’t feel bad. Perl’s default sort builtin also gets this wrong.
Warning
Note that while Å (U+212B), Å (U+00C5) and Å (U+0041 U+030A) are considered to be identical characters, the fact that they look the same is an accident. Do not rely on a character’s appearance to decide whether or not two characters are the same.
So how do you get this right? Collation. Collation, for our purposes, is defining the correct order for data. Sorting, by contrast, is putting data into that correct order. The Unicode Collation Algorithm, described at http://www.unicode.org/reports/tr10/, tells you how to do properly collate Unicode data. Fortunately, the Unicode::Collate module was first included with Perl in version 5.7.3 and it implements the Unicode Collation Algorithm for you.
So as a general rule, sorting is handled correctly with the code shown in Example 10.2, “Using Unicode::Collate”.
Example 10.2. Using Unicode::Collate
use strict;
use warnings;
use diagnostics;
use utf8::all;
use Unicode::Collate;
my @apples = (
"\N{U+212B}pples",
"\N{U+00C5}pples",
"\N{U+0041}\N{U+030A}pples",
"apples",
"Apples",
);
my @bad = sort @apples;
my @sorted = Unicode::Collate->new->sort(@apples);
print "Original: @apples\n";
print "Sorted: @bad\n";
print "Collated: @sorted\n";Note
collate.pl available for download at Wrox.com.
Running collate.pl prints out:
Original: Åpples Åpples Åpples apples Apples Sorted: Apples Åpples apples Åpples Åpples Collated: apples Apples Åpples Åpples Åpples
You’ll notice that the second, Sorted:, line starts with Apples Åpples apples. Clearly that’s not right, but Perl’s default sort does not recognize the U+0041U+0030A as being combined. It merely sorts on the numeric value of the individual octets, leading to incorrect sorting.
Unicode::Collate is great, but you often need to sort according to a specific locale. In Perl, you can do this:
use locale; # but don't really do this
That tells Perl to use the proper sorting for your systems LC_COLLATE environment variable. Unfortunately, many programmers have been bitten by this because not all operating systems support this, nor are the locales guaranteed to be installed, ensuring that this method is not portable. Instead, use Unicode::Collate::Locale.
We previously mentioned that in Swedish, the letter z comes before the letter ö, but the sort order is reversed in German. Example 10.3, “Using Unicode::Collate::Locale to sort according to locale.” shows the use of Unicode::Collate::Locale to get the correct sort order.
Example 10.3. Using Unicode::Collate::Locale to sort according to locale.
use strict;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;
my @letters = qw(z ö);
my @reversed = reverse @letters;
my $german = Unicode::Collate::Locale->new( locale => 'de_DE' );
my $swedish = Unicode::Collate::Locale->new( locale => 'sv_SE' );
foreach my $letters ( \@letters, \@reversed ) {
print "Original: @$letters\n";
my @german = $german->sort(@$letters);
my @swedish = $swedish->sort(@$letters);
print "German: @german\n";
print "Swedish: @swedish\n\n";
}Note
locale_sort.pl available for download at Wrox.com.
When you run Example 10.2, “Using Unicode::Collate”, you should see:
Original: z ö German: ö z Swedish: z ö Original: ö z German: ö z Swedish: z ö
Unicode::Collate::Locale will first released with Unicode::Collate version 0.55 in August of 2010, so you may need to install a newer version of Unicode::Collate from the CPAN.
Map and grep
Many times you want to transform or filter a list instead of (or in addition to) sorting the list. We’ll start with filtering the list first.
Using grep
We’ve used the grep builtin a few times in this book and we’ve deliberately kept the examples simple to make the basic use clear. However, we’ll pretend you’ve never heard of it just to give you a quick refresher. The grep builtin takes a list and produces another list of all values matching grep’s criteria. For example, to only use numbers greater than zero:
my @greater = grep { $_ > 0 } @numbers;The grep builtin takes two forms:
NEWLIST = grep BLOCK LIST; NEWLIST = grep EXPRESSION, LIST;
The first form, used in our code example above, is probably the most popular. We could have written our “greater than zero” filter as any of these three:
my @greater = grep { $_ > 0 } @numbers;
my @greater = grep $_ > 0, @numbers;
my @greater = grep( $_ > 0, @numbers );Note that grep BLOCK does not take a comma after the block while grep EXPRESSION does.
When using grep, we iterate over every element in the LIST, setting each element in turn to $_. The grep builtin will only return elements for which the BLOCK or EXPRESSION returns true. You can have arbitrarily complex expressions in the grep. To grab the palindromes from a list:
my @palindromes = grep { uc eq reverse uc } @words;Note
Your author debated quite a bit about writing this palindrome checker:
my @palindromes = grep { uc eq reverse uc } @words;Ignoring Unicode issues here (in some encodings, characters are different depending on their location in a word), it might seem “friendlier” to write the code like this:
my @palindromes = grep { uc($_) eq scalar reverse uc($_) }
@words;The reason the first version works is because uc operates on the $_ version by default. The scalar builtin is often used with reverse to force it to reverse a string, but the eq forces scalar context, rendering the scalar keyword redundant. While we recommend you use the longer form to avoid confusion, you need to get used to seeing the shorter forms, so they will be used from time to time.
Because a bare regex matches against $_, this is often seen in grep. To find words beginning with the vowels a, e, i, o or u:
my @starts_with_vowels = grep { /^[aeiou]/ } @words;Because grep returns a list, you can combine this with sort. To find all numbers greater than or equal to ten and return them sorted from lowest to highest:
my @numbers = ( 13, 3, −2, 7, 270, 19, −3.2, 10.1 );
my @result = sort { $a <=> $b } grep { $_ >= 10 } @numbers;
print join ', ', @result;And that prints out:
10.1, 13, 19, 270
When chaining list builtins like this, many people prefer to write them on separate lines to make things more clear:
my @result = sort { $a <=> $b }
grep { $_ >= 10 } @numbers;One thing to keep in mind when using list builtins like grep is that they operate on an entire list. Sometimes you’ll see code like this:
my @positive = grep { $_ > 0 } @numbers;
my $first = $positive[0];That can be very inefficient, particularly if you have a lot of @numbers. A for loop with last is better.
my $first;
for (@numbers) {
if ( $_ > 0 ) {
$first = $_;
last;
}
}That for loop terminates the search through @numbers on the first successful match, if any. Of course, if none of the @numbers are greater than zero, it’s not more efficient than the grep.
Using map
The map builtin, like the grep builtin, takes a list and returns a new list. It maps old values to new values. Its syntax is virtually identical to grep’s:
NEWLIST = map BLOCK LIST; NEWLIST = map EXPRESSION, LIST;
So to upper-case every word in a list:
my @UPPER = map { uc } @words;Like grep, map operators on every element in a list, so only use it if you want to transform an entire list. And like grep, because it returns a list, you can chain with grep and sort. So let’s say you have an array of numbers and you want to take the square roots of those numbers greater than zero:
my @roots = map { sqrt($_) }
grep { $_ > 0 } @numbers.Aliasing issues
One significant issue to be aware of with both map and grep is that when individual elements of the list are assigned to $_, they are aliased to the original value. This means that if you change the value of $_, you’re changing the value of the original item. Here’s a short program to demonstrate just how terribly wrong things can go if you’re not aware of this. We’ll use a lesser-known feature of Data::Dumper to name our variables in our output to make it easier to follow.
use Data::Dumper;
my @numbers = qw{ 1 2 3 4 5 };
my @incremented = map $_++, @numbers; # No!
print Data::Dumper->Dump(
[ \@numbers, \@incremented ],
[ '*numbers', '*incremented' ]
);And that prints out:
@numbers = (
'2',
'3',
'4',
'5',
'6'
);
@incremented = (
'1',
'2',
'3',
'4',
'5'
);What’s going on here? We haven’t incremented the numbers we intended and we incremented the original list we meant to leave alone!
Because $_ is an alias, the $_++ changes the original numbers, but because the ++ is the postfix version of the autoincrement operator, it changes the value after it’s been returned. Thus, our new list gets the old values and our old list gets the new values! Here is one way to write that the way you intended it:
my @incremented = map $_ + 1, @numbers;
Also, if you use the block for, you can localize the value of $_ in the block, but it’s beginning to look like an ugly hack.
my @incremented = map { local $_ = $_; ++$_ }, @numbers;Note
Many Perl developers are not aware of this alternate syntax for Data::Dumper and if they are, they often avoid is because it’s rather cumbersome to use. See perldoc Data::Dumper to understand it.
You author has released Data::Dumper::Names and Data::Dumper::Simple to the CPAN to make this easier to use. They behave slightly differently, so read their documentation to understand the differences.
Trying to do too much
At their core, sort, map and grep all take an existing list and return a new list. There’s nothing very complicated about this, but the sorting, filtering or transforming may be complicated. Keep them simple or else they’re hard to follow. We’ve already shown how to do with sort. We’ve taken this:
@employees= {
$b->{years} <=> $a->{years}
||
$a->{payscale} <=> $b->{years}
||
$a->{name} cmp $b->{name}
} @employees;And turned it into this:
@employees = sort by_seniority_then_pay_then_name @employees;
However, you may remember this rather complicated grep:
my %is_prime;
@primes = grep {
( exists $is_prime{$_} and $is_prime{$_} )
or
( $is_prime{$_} = is_prime($_) )
} @numbers;It was deliberately left complicated in the code so that we could show how some people will write a grep statement. Instead, you could push the caching into the is_prime() function:
{
my %is_prime;
sub is_prime {
my $number = $_[0];
return $is_prime{$number} if exists $is_prime{$number};
$is_prime{$number} = 0 if $number < 2;
$is_prime{$number} = 1 if $number == 2;
for ( 2 .. int sqrt($number) ) {
if ( !($number % $_) ) {
$is_prime{$number} = 0;
last;
}
}
$is_prime{$number} = 1 if !exists $is_prime{$number};
return $is_prime{$number};
}
}By doing that, our grep then becomes:
@primes = grep { is_prime($_) } @numbers;Not only is that much easier to read, any other code that now calls is_prime() gets to take advantage of the caching.
Also, the actual grep() is faster than it was originally. You can discover easier ways to verify that with the Benchmark module that we’ll cover in Chapter 18, Common tasks.
Trying To Be Clever
The sort, map and grep functions are the sort of functions which developers love to abuse. Sadly, many developers think being clever is more important than writing maintainable code and so they’ll write something like this:
my %person = (
id => 6,
name => 'The Prisoner',
profession => 'Ex-Spy',
status => 'Silent',
);
my $result = '';
foreach ( sort { $a ne 'id' } keys %person ) {
my $value = $person{$_};
$result .= sprintf "%-10s - $value\n", $_;
}
print $result;And that prints out:
id - 6 name - The Prisoner status - Silent profession - Ex-Spy
The order of name, status and profession may be different on your system, but id is guaranteed to be first. Why? Because of the strange sort { $a ne 'id' } construction. As you may recall, both the cmp and <=> operators return −1, 0 or 1 depending on whether the left value is less than, equal to or greater than the right value. The $a ne 'id' will return false (the empty string, in this case) which Perl will treat as 0 (zero) and $a ne $anything_else will return 1, ensuring that the value 'id' will always be sorted first.
This sort of cleverness can be fun if you’re just playing around, but for serious code, particularly in a production environment where someone may need to fix something quickly, please try to be clear. Here’s a better solution:
my $format = "%-10s - %s\n";
# always have id as the first line
my $result = sprintf $format, 'id', delete $person{id};
foreach ( keys %person ) {
my $value = $person{$_};
$result .= sprintf $format, $_, $value;
}Putting it all together
Now that we’ve covered sort, map, and grep, it’s time to put all of them together for some more advanced techniques. You won’t see these as often in your code, but when you need them, they’ll really help you out.
Schwartzian Transform (a.k.a., decorate, sort, undecorate)
The Schwartzian Transform is a famous technique named after the well-known Perl hacker Randal Schwartz. He explains this technique in a Unix Review column at http://www.stonehenge.com/merlyn/UnixReview/col64.html.
Basically, many sorting operations can be extremely expensive if we need to calculate the value to be sorted on. The Schwartzian Transform allows us to calculate once and only once a sort key, sort on it, and then strip the sort key. In other languages, this technique is sometimes known as decorate, sort, undecorate.
Let’s say that you have the following data in a file:
James|007|Spy Number 6|6|Ex-spy Agent 99|99|Spy with unknown name Napoleon Solo|11|Uncle spy Unknown|666|Maybe a spy
Except that it’s actually about 300,000 lines or so and you need to sort on the number between the pipes, such as |007|. So you try a straightforward sort like this:
my @sorted = sort by_id <>;
sub by_id {
$a =~ /\|(\d+)/;
my $a_id = $1;
$b =~ /\|(\d+)/;
my $b_id = $1;
return $a_id <=> $b_id;
}What you’ve done is use a regular expression to extract the number into a variable and then you return the variables compared with the spaceship operator. You’ve also been good and you’ve written a sort subroutine to make the sort operation easier to follow, but when you run this, you find it takes 7 or 8 seconds to run.
This might be fast enough for you. If this is a one-off script, you may not care about speed. If this is part of a process that runs once a night, you again may not care about speed. However, if you call this a lot, that extra time might be very problematic, so we’ll switch to the Schwartzian Transform.
my @sorted = map { $_->[0] } # undecorate
sort { $a->[1] <=> $b->[1] } # sort
map { /\|(\d+)/; [ $_, $1 ] } <>; # decorateNow, instead of 7 or 8 seconds, it takes about 2 seconds to sort 300,000 lines, but how does it work?
Note
If you don’t remember <>, the diamond operator, go back to Chapter 9, Files and Directories and reread “The Diamond Operator” section. This operator takes a bit of time to get used to.
Both sorts compare every number to every other number, but the naive sort has to use a regular expression to extract the id every time, even if it’s been previously extracted. With the Schwartzian Transform, we extract the number only once with the first map, known as the “decorate” step.
map { /\|(\d+)/; [ $_, $1 ] } <>; # decorateThe [ $_, $1 ] is an array reference which has the original value in the first element (remember, that’s index 0) and the sort key (the id, in this case) from $1 in the second element (index 1);
Then the second step, the sort, just sorts on element 1, the sort key:
sort { $a->[1] <=> $b->[1] } # sortFinally, the last map (undecorate), returns element 0, the original string.
my @sorted = map { $_->[0] } # undecorateBy not re-extracting the id for every comparison, we gain a significant speed improvement and that is the beauty of the Schwartzian Transform.
Note that the intermediate container doesn’t need to be an array. You can use a hash reference if you prefer.
my @sorted = map { $_->{original} }
sort { $a->{id} <=> $b->{id} }
map { /\|(\d+)/; { original => $_, id => $1 } } <>;Hash lookups are slightly slower than array lookups, but even the code above is twice as fast as the naive sort.
For the micro-optimization fans, you can shave another 5% to 10% of the time by using index instead of regular expressions.
my @sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { my $i = 1 + index $_, "|";
my $length = index( $_, "|", $i ) - $i;>
[ $_, substr $_, $i, $length ]
} <>;How that works is an exercise left for the reader. You will find people trying to shave tiny amounts of time with code like this, but unless you can prove why you need to save that time, we strongly urge you not to write code this tricky. It’s a beast to maintain and will make your coworkers unhappy with your “cleverness”. If you must write that, document it well.
Guttman-Rosler Transform
This sort technique is very advanced and you can skip this section if you wish, but we include it for completeness.
When using sort, if you can eliminate the dereferencing in the sort block, you will have a faster sort. In the paper “A Fresh Look at Efficient Perl Sorting” (http://www.sysarch.com/Perl/sort_paper.html) by Uri Guttman and Larry Rosler, we have the Guttman-Rosler Transform introduced (though they did not call it that themselves).
my @sorted = map { substr $_, 4 }
sort
map { /\|(\d+)/; pack("A4", $1).$_ } <>;This looks strange and you’ll want to read perldoc perlpacktut to understand how pack works. The pack template, A4, creates a four octet ASCII string out of the id and we prepend that to the beginning of the string. As a result, the sort itself can rely on its standard sort behavior and the final map removes the packed data. On the computer your author is using to write this chapter, Table 10.1, “Sort Time Comparisons” shows the typical performance of each sort method on our test file with a 300,000 line file with random integers for ids:
Table 10.1. Sort Time Comparisons
Sort Method | Approximate Time |
|---|---|
Naïve sort | 7.5 seconds |
Schartzian Transform | 2.5 seconds |
Guttman-Rosler Transform | 1.5 seconds |
As you can see, the Guttman-Rosler transform is very fast, but it is rarely used as most Perl developers are not familiar with the pack function.
Summary
In this chapter you’ve learned about sort, map, and grep, three powerful functions that transform a list into a new list. You’ve learned a variety of sorting techniques and how to use sort subroutines. You’ve learned how to efficiently filter a list of values with grep and how to transform an old list of values into a new list of values with map. Finally, you’ve learned ways of combining these functions together to get powerful data manipulation techniques.
Exercises
Given the following list of hexadecimal numbers, print them in descending numeric order;
my @numbers = ( 0x23, 0xAA, 0xaa, 0x01, 0xfB );
Given a list of numbers, use
grepto return only the numbers that are perfect squares. Then print them in ascending numeric order.Assume the following list of numbers:
my @numbers = ( 28, 49, 1000, 4, 25, 49, 529 );
Write the
grepin bothBLOCKandEXPRESSIONform.NEWLIST = grep BLOCK LIST; NEWLIST = grep EXPRESSION, LIST;
What happens if one of the values in the
@numbersarray is actually the stringGet a job, hippy!? How would this change your code?Given the following list, write a
grepstatement that creates a new list, but without duplicate elements. There a several ways you can solve this.my @list = qw( bob sally Andromalius sally bob ned Andromalius );
Given the following array:
my @employees = ( { first_name => 'Sally', last_name => 'Jones', years => 4, payscale => 4, }, { first_name => 'Abby', last_name => 'Hoffman', years => 1, payscale => 10, }, { first_name => 'Jack', last_name => 'Johnson', years => 4, payscale => 5, }, { first_name => 'Mr.', last_name => 'Magnate', years => 12, payscale => 1, }, );
Use map to create a new array that looks like this:
my @names = ( 'Jack Johnson', 'Sally Jones', 'Mr. Magnate', );
Note that the names are sorted by last name, ascending, and we exclude employees who have been with the company a year or less.
WHAT YOU LEARNED IN THIS CHAPTER
Topic | Key Concepts |
|---|---|
sort | Used to order a list of data |
Sort subroutines | Making complex sorts easier to understand |
Sorting Unicode | Using the Unicode Collation Algorithm to sort different character sets |
grep | Only select elements of a list which pass desired criteria |
map | Transform every element of a list, making a new list |
map and sort traps | Aliasing, complexity and “clever” code are not your friends |
Complex sorts | Using map and sort together for more efficient sorting |
Answers to exercises
Given the following list of hexadecimal numbers, print them in descending numeric order;
my @numbers = ( 0x23, 0xAA, 0xaa, 0x01, 0xfB );
See
hex()andoct()in Chapter 4, Working With data if you need a refresher on the0x...syntax.Those are merely the hexadecimal representation of numbers, so a descending numeric sort is just:
my @sorted = sort { $b <=> $a } @numbers; print join ', ' => @sorted;And that will print:
251, 170, 170, 35, 1
Of course, you may not want to print the decimal values when the original numbers were in hexadecimal.
my @numbers = ( 0x23, 0xAA, 0xaa, 0x01, 0xfB ); my @sorted = sort { $b <=> $a } @numbers; print join ', ' => map { sprintf "0x%X", $_ } @sorted;And that prints:
0xFB, 0xAA, 0xAA, 0x23, 0x1
Given a list of numbers, use
grepto return only the numbers that are perfect squares. Then print them in ascending numeric order.Assume the following list of numbers:
my @numbers = ( 28, 49, 1000, 4, 25, 49, 529 );
Write the
grepin bothBLOCKandEXPRESSIONform.NEWLIST = grep BLOCK LIST; NEWLIST = grep EXPRESSION, LIST;
The
BLOCKform looks like this:my @numbers = ( 28, 49, 1000, 4, 25, 49, 529 ); my @squares = sort { $a <=> $b } grep { int(sqrt($_)) == sqrt($_) } @numbers; print join ', ' => @squares;What the
grepis doing is take the integer value of the square root and comparing it against the square root. So for the number9, we get3 == 3and that’s a perfect square. However, the square root of 1000 is reduced to something like31.6227766016838 == 31and that is clearly not true, so 1000 is not a perfect square.The
EXPRESSIONform looks like this:my @squares = sort { $a <=> $b } grep int(sqrt($_)) == sqrt($_), @numbers;It can also be written like this:
my @squares = sort { $a <=> $b } grep(int(sqrt($_)) == sqrt($_), @numbers_;What happens if one of the values in the
@numbersarray is actually the stringGet a job, hippy!? How would this change your code?There are several ways of handling this, but one way is to realize that perfect squares must be positive integers, so the following would do the trick:
my @squares = sort { $a <=> $b } grep { /^[0-9]+$/ && ( int(sqrt($_)) == sqrt($_) ) } @numbers;In other words, we use the regular expression
/^[0-9]+$/to guarantee that we only have digits.Given the following list, write a
grepstatement that creates a new list, but without duplicate elements. There a several ways you can solve this.my @list = qw( bob sally Andromalius sally bob ned Andromalius );
One solution:
my %seen; my @unique = grep { not $seen{$_}++ } @list;This is a moderately common idiom in Perl (though many people just use the
List::MoreUtils uniqfunction). Here’s how it works.The first time that
bobis encountered ingrep, we have this:not $seen{'bob'}++We know that
$seen{'bob'}must be false the first time it’s encountered because there is no entry in the%seenhash and it evaluates as this:not undef
And that evaluates as true, allow
grepto say “bob’s OK and we’ll pass him along”. However, the++autoincrement operator kicks in after Perl has returned the value. It sees theundefvalue, treats it as0(zero) and adds1to it.The second time that bob is encountered in the
grep,$seen{'bob'}has the value of1andnot 1is0(zero) and that evaluates as false. Thus,grepwill not return any value’s after they are seen for the first time.Given the following array:
my @employees = ( { first_name => 'Sally', last_name => 'Jones', years => 4, payscale => 4, }, { first_name => 'Abby', last_name => 'Hoffman', years => 1, payscale => 10, }, { first_name => 'Jack', last_name => 'Johnson', years => 4, payscale => 5, }, { first_name => 'Mr.', last_name => 'Magnate', years => 12, payscale => 1, }, );
Use map to create a new array that looks like this:
my @names = ( 'Jack Johnson', 'Sally Jones', 'Mr. Magnate', );
Note that the names are sorted by last name, ascending, and we exclude employees who have been with the company a year or less.
The map and sort looks like this:
my @names = map { "$_->{first_name} $_->{last_name}" }
sort { $a->{last_name} cmp $b->{last_name} }
grep { $_->{years} > 1 }
@employees;Unlike our other examples, we’re using all of sort, map and grep. The grep will filter the list before the sort because there is no point in sorting values that we will throw away (particularly if the list we’re sorting is large). The sort, of course, comes before the map because after the map, we no longer have our data structure and it would be harder to sort on that last name.
There comes a time when you, as a Perl developer, will feel very comfortable with these techniques. Here’s how it might be rewritten with a for loop. We’ll use a sort subroutine to make the sort clearer.
sub by_last_name { $a->{last_name} cmp $b->{last_name} }
my @names;
foreach my $employee ( sort by_last_name @employees ) {
next if $employee->{years} <= 1;
push @names, "$employee->{first_name} $employee->{last_name}";
}It’s entirely up to you (and the circumstances of your code) which method you find cleaner and easier to maintain.





Add a comment



Add a comment