9781118013847
sort_comma_map_and_grep.html

Chapter 10. Sort, map and grep

WHAT YOU WILL LEARN IN THIS CHAPTER

  • Sorting lists alphabetically and numerically

  • Creating custom sorts with sort subroutines

  • Using grep to efficiently filter lists

  • Using map to efficiently transform lists

  • Understanding common traps when sorting and filtering lists

  • Combing map, sort and grep to create powerful list manipulations

By this time, you have a sufficient understanding of Perl that you’re very likely to be able to use it for small tasks in relation to your day-to-day work. However, there’s an odd sort of “litmus test” for Perl developers. For some reason, really understanding sort, map and grep seems to be the difference between beginner and intermediate Perl developers. Once you cross this threshold, you’re well on your way to being a Perl expert.

Though we’ve touched on sort, map and grep briefly, we have deliberately kept their usage simple. Now we’ll see a bit more about their full power.

The one thing to remember is that each of these creates a new list from an old list.

Basic sorting

The sort builtin sorts a list and returns a new list. It has three forms:

sort LIST
sort BLOCK LIST
sort SUBNAME LIST
As an example of each:
@passengers = sort                             @passengers;
@passengers = sort { $a->{age} <=> $b->{age} } @passengers;
@passengers = women_and_children_first         @passengers;

Sorting alphabetically

The simplest sort in Perl is this:

my @list = sort qw(this is a list);

print “@list”;

That will print out the Yoda-esque “a is list this”. By default, sort will sort items with a string comparison. Well, actually it will sort items via a numeric comparison of the string’s code point, from lower to higher values. So the following line:

# declare our source code as UTF-8
use utf8;
# and we're printing UTF-8
binmode STDOUT, 'encoding(UTF-8)';
print join ' ', sort qw/b 日 aa 国 1 本 a A/;

Will print:

1 A a aa b 国 日 本

Note

If you’re unsure of why Perl sorts in this order, you can covert each character to its UTF-8 code point with this (obviously, we’ve left out the aa for this example):

use utf8::all;
print join ' ', map { as_code_point($_) } sort qw/b 日 国 1 本 a A/;
sub as_code_point {
   my $char = shift;
   die "Only characters!" if length($char) > 1;
   return "U+" . uc sprintf "%04x", ord $char;
}

And that prints out:

U+0031 U+0041 U+0061 U+0062 U+56FD U+65E5 U+672C

As you can see, by default Perl’s sort will sort characters in ascending order by their numerical ord() value. If you don’t understand the map, don’t worry. We’ll explain it carefully later in this chapter.

If you want to see the decimal value of the numbers, use this:

use utf8::all;
print join ' ', map ord, sort qw/b 日 国 1 本 a A/;

Sorting Numerically

Perl’s default sort will (more or less) sort strings as characters. This means that if you do this:

print join "\n", sort qw/1 9 10 99 222/;

You get this:

1
10
222
9
99

You probably meant to sort your numbers numerically. In this case, we provide a sort block:

print join "\n", sort { $a <=> $b } qw/1 9 10 99 222/;

And now we get the correct sort order:

1
9
10
99
222

In the form sort BLOCK LIST, Perl will iterate over the pairs of items in the list and set the special package variables $a and $b to each element in the list in turn. The <=> operator (sometimes called the spaceship operator) was covered in Chapter 4, Working With data. In this case, it tells Perl to compare $a and $b as numbers instead of strings.

Warning

The $a and $b variables are special package variables. Do not declare them with my or else your sort routines are likely to break.

Reverse sorting

Many times you want a list reversed. You could do this:

my @reversed_names = reverse sort @names;

That reads clearly, but for larger lists, this is inefficient. It sorts the list into ascending order and then reverses it into descending order. Why not just sort directly into reverse descending order? In this case, you can use a sort block and swap the $a and $b variables.

my @reversed_names = sort { $b cmp $a } @names;
This works when sorting numbers in descending order, too:
my @descending = sort { $b <=> $a } @numbers;

Complex sort conditions

When sorting on a single value, sorting is fairly straightforward, but if you need to sort on multiple values, you need to use a sort block or sort subroutine. Example 10.1, “Complex sorting in Perl” has an example of complex sorting.

Example 10.1. Complex sorting in Perl

use strict;
use warnings;
use diagnostics;
my @employees = (
    {
        name     => 'Sally Jones',
        years    => 4,
        payscale => 4,
    },
    {
        name     => 'Abby Hoffman',
        years    => 1,
        payscale => 10,
    },
    {
        name     => 'Jack Johnson',
        years    => 4,
        payscale => 5,
    },
    {
        name     => 'Mr. Magnate',
        years    => 12,
        payscale => 1,
    },
);
@employees =
  sort {
    $b->{years}    <=> $a->{years}
    ||
    $a->{payscale} <=> $b->{years}
  }
  @employees;
printf "Name         Years Payscale\n";
foreach my $employee (@employees) {
    printf "%-15s %2d       %2d\n" => @{$employee}{qw/name years payscale/};
}

Note

employee.pl available for download at Wrox.com.

Running Example 10.1, “Complex sorting in Perl” will print:

Name         Years Payscale
Mr. Magnate     12        1
Sally Jones      4        4
Jack Johnson     4        5
Abby Hoffman     1       10

The idea in this case is that we want to print a list of employees. They should be printed from the highest number of years in the company to the lowest. That’s our first sort condition:

$b->{years} <=> $a->{years}

Note that the $b and the $a are reversed in order to provide a descending sort.

However, but Sally Jones and Jack Johnson have the same number of years with the company. The highest payscale is 1 and the lowest is 10 and in case there’s a tie, we want to print employees from highest to lowest payscale (in other words, from 1 to 10).

$a->{payscale} <=> $b->{years}

You may remember that the <=> operator returns 0 (zero) if the two terms are equal, so we can use the || operator to sort by payscale if the employees have the same number of years with the company:

@employees =
  sort {
    $b->{years}    <=> $a->{years}
    ||
    $a->{payscale} <=> $b->{years}
  }
  @employees;

What happens if the employees have the same number of years and the same payscale? Well, just throw in a sort by name:

@employees =
  sort {
    $b->{years}    <=> $a->{years}
    ||
    $a->{payscale} <=> $b->{years}
    ||
    $a->{name}     cmp $b->{name}
  }
  @employees;

This looks like a lot of work, but the || operator short-circuits. That means since only one of the conditions is required to be true, as soon as one of the conditions evaluates as true, the subsequent conditions are not evaluated. It’s actually a fairly efficient sort.

Writing a sort subroutine

In handling complex sorts, you might find that this is a bit daunting:

@employees =
  sort {
    $b->{years}    <=> $a->{years}
    ||
    $a->{payscale} <=> $b->{years}
    ||
    $a->{name}     cmp $b->{name}
  }
  @employees;

The fix is simple. Put that sort block into a subroutine and replace the block with the subroutine name:

sub by_seniority_then_pay_then_name {
    $b->{years}    <=> $a->{years}
    ||
    $a->{payscale} <=> $b->{years}
    ||
    $a->{name}     cmp $b->{name}
}
@employees = sort by_seniority_then_pay_then_name @employees;

When you have a complex sort condition, giving it a named sort subroutine improves readability quite a bit. As an added bonus, if you need to replicate a complex sort elsewhere, you already have the code handy.

Note

For those who are curious, Perl’s sort, by default, is stable. This means that if two values compare the same way, they will be returned in the same order they were originally found. Thus, if we left of the sorting by name condition in our employee sort, all employees with the same years and payscale would be guaranteed to be returned in order that they were in the original list. This is very useful, particularly if you have a list that is already partially sorted. This is far more common than you think.

Some people prefer to not use the $a and $b variables. They are not strictly required in the sort subroutine. If you want to use variables with names of your choosing (and not $a or $b) you’ll need to use a $$ prototype to force passing $a and $b to the sort sub for assignment to your variables:

sub by_seniority_then_pay_then_name($$) {
    my ( $employee1, $employee2 ) = @_;
    $employee2->{years}    <=> $employee1->{years}
    ||
    $employee1->{payscale} <=> $employee2->{years}
    ||
    $employee1->{name}     cmp $employee2->{name}
}
@employees = sort by_seniority_then_pay_then_name @employees;

Be aware that if you do this, the sort subroutine will be a bit slower because, as an optimization in Perl the $a and $b variables are automatically aliased by Perl when sort is encountered.

Sorting and Unicode Fun!

Why do we sort data? We could argue that there are two reasons for this:

  • Making it faster for computers to find data

  • Making it faster for humans to find data

If all we care about is to make it faster for computers to find data, then the default sort behavior is often fine. However, humans are an annoyingly troublesome lot. In Swedish, the letter z comes before the letter ö, but in German it’s the other way around. If you’re sorting data for display to people, they will complain very bitterly (and quite rightly) if they have trouble finding what they need because the sort order of the data is not what they expect, so you need to make sure that you’re sorting correctly for your target audience.

And here’s another fun example. Run the following code:

use utf8::all;
use charnames ":full";
print "\N{ANGSTROM SIGN}\n";
print "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}\n";

That prints out:

Å
Å

Those are the Unicode code points U+212B and U+00C5, respectively, but for purposes of sorting or comparison they are supposed to be considered the same character. Further, Unicode has combining character to indicate that two symbols should be combined. This gives Unicode a great flexibility in representing different characters. Using the two code points above, along with an upper case A and a COMBINING RING ABOVE gives this code:

use charnames ':short';
binmode STDOUT, ':encoding(UTF-8)';
print "\N{U+212B}\n";
print "\N{U+00C5}\n";
print "\N{U+0041}\N{U+030A}\n";

Which prints:

Å
Å
Å

Many computers will actually print those slightly differently, but they should look generally similar and for purposes of sorting and comparing (cmp), they must, as already stated, be considered the same character despite being different code points. We’re repeating ourselves because it’s important and there’s a good chance you’ll get it wrong. But don’t feel bad. Perl’s default sort builtin also gets this wrong.

Warning

Note that while (U+212B), Å (U+00C5) and Å (U+0041 U+030A) are considered to be identical characters, the fact that they look the same is an accident. Do not rely on a character’s appearance to decide whether or not two characters are the same.

So how do you get this right? Collation. Collation, for our purposes, is defining the correct order for data. Sorting, by contrast, is putting data into that correct order. The Unicode Collation Algorithm, described at http://www.unicode.org/reports/tr10/, tells you how to do properly collate Unicode data. Fortunately, the Unicode::Collate module was first included with Perl in version 5.7.3 and it implements the Unicode Collation Algorithm for you.

So as a general rule, sorting is handled correctly with the code shown in Example 10.2, “Using Unicode::Collate”.

Example 10.2. Using Unicode::Collate

use strict;
use warnings;
use diagnostics;
use utf8::all;
use Unicode::Collate;
my @apples = (
    "\N{U+212B}pples",
    "\N{U+00C5}pples",
    "\N{U+0041}\N{U+030A}pples",
    "apples",
    "Apples",
);
my @bad    = sort @apples;
my @sorted = Unicode::Collate->new->sort(@apples);
print "Original: @apples\n";
print "Sorted:   @bad\n";
print "Collated: @sorted\n";

Note

collate.pl available for download at Wrox.com.

Running collate.pl prints out:

Original: Åpples Åpples Åpples apples Apples
Sorted:   Apples Åpples apples Åpples Åpples
Collated: apples Apples Åpples Åpples Åpples

You’ll notice that the second, Sorted:, line starts with Apples Åpples apples. Clearly that’s not right, but Perl’s default sort does not recognize the U+0041U+0030A as being combined. It merely sorts on the numeric value of the individual octets, leading to incorrect sorting.

Unicode::Collate is great, but you often need to sort according to a specific locale. In Perl, you can do this:

use locale; # but don't really do this

That tells Perl to use the proper sorting for your systems LC_COLLATE environment variable. Unfortunately, many programmers have been bitten by this because not all operating systems support this, nor are the locales guaranteed to be installed, ensuring that this method is not portable. Instead, use Unicode::Collate::Locale.

We previously mentioned that in Swedish, the letter z comes before the letter ö, but the sort order is reversed in German. Example 10.3, “Using Unicode::Collate::Locale to sort according to locale.” shows the use of Unicode::Collate::Locale to get the correct sort order.

Example 10.3. Using Unicode::Collate::Locale to sort according to locale.

use strict;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;
my @letters  = qw(z ö);
my @reversed = reverse @letters;
my $german  = Unicode::Collate::Locale->new( locale => 'de_DE' );
my $swedish = Unicode::Collate::Locale->new( locale => 'sv_SE' );
foreach my $letters ( \@letters, \@reversed ) {
    print "Original: @$letters\n";
    my @german  = $german->sort(@$letters);
    my @swedish = $swedish->sort(@$letters);
    print "German:   @german\n";
    print "Swedish:  @swedish\n\n";
}

Note

locale_sort.pl available for download at Wrox.com.

When you run Example 10.2, “Using Unicode::Collate”, you should see:

Original: z ö
German:   ö z
Swedish:  z ö
Original: ö z
German:   ö z
Swedish:  z ö

Unicode::Collate::Locale will first released with Unicode::Collate version 0.55 in August of 2010, so you may need to install a newer version of Unicode::Collate from the CPAN.

Map and grep

Many times you want to transform or filter a list instead of (or in addition to) sorting the list. We’ll start with filtering the list first.

Using grep

We’ve used the grep builtin a few times in this book and we’ve deliberately kept the examples simple to make the basic use clear. However, we’ll pretend you’ve never heard of it just to give you a quick refresher. The grep builtin takes a list and produces another list of all values matching grep’s criteria. For example, to only use numbers greater than zero:

my @greater = grep { $_ > 0 } @numbers;

The grep builtin takes two forms:

NEWLIST = grep BLOCK       LIST;
NEWLIST = grep EXPRESSION, LIST;

The first form, used in our code example above, is probably the most popular. We could have written our “greater than zero” filter as any of these three:

my @greater = grep { $_ > 0 } @numbers;
my @greater = grep   $_ > 0,  @numbers;
my @greater = grep(  $_ > 0,  @numbers );

Note that grep BLOCK does not take a comma after the block while grep EXPRESSION does.

When using grep, we iterate over every element in the LIST, setting each element in turn to $_. The grep builtin will only return elements for which the BLOCK or EXPRESSION returns true. You can have arbitrarily complex expressions in the grep. To grab the palindromes from a list:

my @palindromes = grep { uc eq reverse uc } @words;

Note

Your author debated quite a bit about writing this palindrome checker:

my @palindromes = grep { uc eq reverse uc } @words;

Ignoring Unicode issues here (in some encodings, characters are different depending on their location in a word), it might seem “friendlier” to write the code like this:

my @palindromes = grep { uc($_) eq scalar reverse uc($_) }
  @words;

The reason the first version works is because uc operates on the $_ version by default. The scalar builtin is often used with reverse to force it to reverse a string, but the eq forces scalar context, rendering the scalar keyword redundant. While we recommend you use the longer form to avoid confusion, you need to get used to seeing the shorter forms, so they will be used from time to time.

Because a bare regex matches against $_, this is often seen in grep. To find words beginning with the vowels a, e, i, o or u:

my @starts_with_vowels = grep { /^[aeiou]/ } @words;

Because grep returns a list, you can combine this with sort. To find all numbers greater than or equal to ten and return them sorted from lowest to highest:

my @numbers = ( 13, 3, −2, 7, 270, 19, −3.2, 10.1 );
my @result = sort { $a <=> $b } grep { $_ >= 10 } @numbers;
print join ', ', @result;

And that prints out:

10.1, 13, 19, 270

When chaining list builtins like this, many people prefer to write them on separate lines to make things more clear:

my @result = sort { $a <=> $b }
             grep { $_ >= 10  } @numbers;

One thing to keep in mind when using list builtins like grep is that they operate on an entire list. Sometimes you’ll see code like this:

my @positive = grep { $_ > 0 } @numbers;
my $first    = $positive[0];

That can be very inefficient, particularly if you have a lot of @numbers. A for loop with last is better.

my $first;
for (@numbers) {
    if ( $_ > 0 ) {
        $first = $_;
        last;
    }
}

That for loop terminates the search through @numbers on the first successful match, if any. Of course, if none of the @numbers are greater than zero, it’s not more efficient than the grep.

Using map

The map builtin, like the grep builtin, takes a list and returns a new list. It maps old values to new values. Its syntax is virtually identical to grep’s:

NEWLIST = map BLOCK       LIST;
NEWLIST = map EXPRESSION, LIST;

So to upper-case every word in a list:

my @UPPER = map { uc } @words;

Like grep, map operators on every element in a list, so only use it if you want to transform an entire list. And like grep, because it returns a list, you can chain with grep and sort. So let’s say you have an array of numbers and you want to take the square roots of those numbers greater than zero:

my @roots = map  { sqrt($_) }
            grep { $_ > 0 } @numbers.

Aliasing issues

One significant issue to be aware of with both map and grep is that when individual elements of the list are assigned to $_, they are aliased to the original value. This means that if you change the value of $_, you’re changing the value of the original item. Here’s a short program to demonstrate just how terribly wrong things can go if you’re not aware of this. We’ll use a lesser-known feature of Data::Dumper to name our variables in our output to make it easier to follow.

use Data::Dumper;
my @numbers = qw{ 1 2 3 4 5 };
my @incremented = map $_++, @numbers;   # No!
print Data::Dumper->Dump(
    [ \@numbers, \@incremented ],
    [ '*numbers', '*incremented' ]
);

And that prints out:

@numbers = (
             '2',
             '3',
             '4',
             '5',
             '6'
           );
@incremented = (
                 '1',
                 '2',
                 '3',
                 '4',
                 '5'
               );

What’s going on here? We haven’t incremented the numbers we intended and we incremented the original list we meant to leave alone!

Because $_ is an alias, the $_++ changes the original numbers, but because the ++ is the postfix version of the autoincrement operator, it changes the value after it’s been returned. Thus, our new list gets the old values and our old list gets the new values! Here is one way to write that the way you intended it:

my @incremented = map $_ + 1, @numbers;

Also, if you use the block for, you can localize the value of $_ in the block, but it’s beginning to look like an ugly hack.

my @incremented = map { local $_ = $_; ++$_ }, @numbers;

Note

Many Perl developers are not aware of this alternate syntax for Data::Dumper and if they are, they often avoid is because it’s rather cumbersome to use. See perldoc Data::Dumper to understand it.

You author has released Data::Dumper::Names and Data::Dumper::Simple to the CPAN to make this easier to use. They behave slightly differently, so read their documentation to understand the differences.

Trying to do too much

At their core, sort, map and grep all take an existing list and return a new list. There’s nothing very complicated about this, but the sorting, filtering or transforming may be complicated. Keep them simple or else they’re hard to follow. We’ve already shown how to do with sort. We’ve taken this:

@employees= {
    $b->{years}    <=> $a->{years}
    ||
    $a->{payscale} <=> $b->{years}
    ||
    $a->{name}     cmp $b->{name}
} @employees;

And turned it into this:

@employees = sort by_seniority_then_pay_then_name @employees;

However, you may remember this rather complicated grep:

my %is_prime;
@primes = grep {
             ( exists $is_prime{$_} and $is_prime{$_} )
             or
             ( $is_prime{$_} = is_prime($_) )
          } @numbers;

It was deliberately left complicated in the code so that we could show how some people will write a grep statement. Instead, you could push the caching into the is_prime() function:

{
    my %is_prime;
    sub is_prime {
        my $number = $_[0];
        return $is_prime{$number} if exists $is_prime{$number};
        $is_prime{$number} = 0 if $number < 2;
        $is_prime{$number} = 1 if $number == 2;
        for ( 2 .. int sqrt($number) ) {
             if ( !($number % $_) ) {
                $is_prime{$number} = 0;
                last;
            }
        }
        $is_prime{$number} = 1 if !exists $is_prime{$number};
        return $is_prime{$number};
    }
}

By doing that, our grep then becomes:

@primes = grep { is_prime($_) } @numbers;

Not only is that much easier to read, any other code that now calls is_prime() gets to take advantage of the caching.

Also, the actual grep() is faster than it was originally. You can discover easier ways to verify that with the Benchmark module that we’ll cover in Chapter 18, Common tasks.

Trying To Be Clever

The sort, map and grep functions are the sort of functions which developers love to abuse. Sadly, many developers think being clever is more important than writing maintainable code and so they’ll write something like this:

my %person = (
    id         => 6,
    name       => 'The Prisoner',
    profession => 'Ex-Spy',
    status     => 'Silent',
);
my $result = '';
foreach ( sort { $a ne 'id' } keys %person ) {
    my $value = $person{$_};
    $result .= sprintf "%-10s - $value\n", $_;
}
print $result;

And that prints out:

id         - 6
name       - The Prisoner
status     - Silent
profession - Ex-Spy

The order of name, status and profession may be different on your system, but id is guaranteed to be first. Why? Because of the strange sort { $a ne 'id' } construction. As you may recall, both the cmp and <=> operators return −1, 0 or 1 depending on whether the left value is less than, equal to or greater than the right value. The $a ne 'id' will return false (the empty string, in this case) which Perl will treat as 0 (zero) and $a ne $anything_else will return 1, ensuring that the value 'id' will always be sorted first.

This sort of cleverness can be fun if you’re just playing around, but for serious code, particularly in a production environment where someone may need to fix something quickly, please try to be clear. Here’s a better solution:

my $format = "%-10s - %s\n";
# always have id as the first line
my $result = sprintf $format, 'id', delete $person{id};
foreach ( keys %person ) {
    my $value = $person{$_};
    $result .= sprintf $format, $_, $value;
}

Putting it all together

Now that we’ve covered sort, map, and grep, it’s time to put all of them together for some more advanced techniques. You won’t see these as often in your code, but when you need them, they’ll really help you out.

Schwartzian Transform (a.k.a., decorate, sort, undecorate)

The Schwartzian Transform is a famous technique named after the well-known Perl hacker Randal Schwartz. He explains this technique in a Unix Review column at http://www.stonehenge.com/merlyn/UnixReview/col64.html.

Basically, many sorting operations can be extremely expensive if we need to calculate the value to be sorted on. The Schwartzian Transform allows us to calculate once and only once a sort key, sort on it, and then strip the sort key. In other languages, this technique is sometimes known as decorate, sort, undecorate.

Let’s say that you have the following data in a file:

James|007|Spy
Number 6|6|Ex-spy
Agent 99|99|Spy with unknown name
Napoleon Solo|11|Uncle spy
Unknown|666|Maybe a spy

Except that it’s actually about 300,000 lines or so and you need to sort on the number between the pipes, such as |007|. So you try a straightforward sort like this:

my @sorted = sort by_id <>;
sub by_id {
    $a =~ /\|(\d+)/;
    my $a_id = $1;
    $b =~ /\|(\d+)/;
    my $b_id = $1;
    return $a_id <=> $b_id;
}

What you’ve done is use a regular expression to extract the number into a variable and then you return the variables compared with the spaceship operator. You’ve also been good and you’ve written a sort subroutine to make the sort operation easier to follow, but when you run this, you find it takes 7 or 8 seconds to run.

This might be fast enough for you. If this is a one-off script, you may not care about speed. If this is part of a process that runs once a night, you again may not care about speed. However, if you call this a lot, that extra time might be very problematic, so we’ll switch to the Schwartzian Transform.

my @sorted = map  { $_->[0] }                    # undecorate
             sort { $a->[1] <=> $b->[1] }        # sort
             map  { /\|(\d+)/; [ $_, $1 ] } <>;  # decorate

Now, instead of 7 or 8 seconds, it takes about 2 seconds to sort 300,000 lines, but how does it work?

Note

If you don’t remember <>, the diamond operator, go back to Chapter 9, Files and Directories and reread “The Diamond Operator” section. This operator takes a bit of time to get used to.

Both sorts compare every number to every other number, but the naive sort has to use a regular expression to extract the id every time, even if it’s been previously extracted. With the Schwartzian Transform, we extract the number only once with the first map, known as the “decorate” step.

map  { /\|(\d+)/; [ $_, $1 ] } <>;  # decorate

The [ $_, $1 ] is an array reference which has the original value in the first element (remember, that’s index 0) and the sort key (the id, in this case) from $1 in the second element (index 1);

Then the second step, the sort, just sorts on element 1, the sort key:

sort { $a->[1] <=> $b->[1] }        # sort

Finally, the last map (undecorate), returns element 0, the original string.

my @sorted = map  { $_->[0] }       # undecorate

By not re-extracting the id for every comparison, we gain a significant speed improvement and that is the beauty of the Schwartzian Transform.

Note that the intermediate container doesn’t need to be an array. You can use a hash reference if you prefer.

my @sorted = map  { $_->{original} }
             sort { $a->{id} <=> $b->{id} }
             map  { /\|(\d+)/; { original => $_, id => $1 } } <>;

Hash lookups are slightly slower than array lookups, but even the code above is twice as fast as the naive sort.

For the micro-optimization fans, you can shave another 5% to 10% of the time by using index instead of regular expressions.

my @sorted = map  { $_->[0] }
             sort { $a->[1] <=> $b->[1] }
             map  { my $i      = 1 + index $_, "|";
                    my $length = index( $_, "|", $i ) - $i;>
                   [ $_, substr $_, $i, $length ]
             } <>;

How that works is an exercise left for the reader. You will find people trying to shave tiny amounts of time with code like this, but unless you can prove why you need to save that time, we strongly urge you not to write code this tricky. It’s a beast to maintain and will make your coworkers unhappy with your “cleverness”. If you must write that, document it well.

Guttman-Rosler Transform

This sort technique is very advanced and you can skip this section if you wish, but we include it for completeness.

When using sort, if you can eliminate the dereferencing in the sort block, you will have a faster sort. In the paper “A Fresh Look at Efficient Perl Sorting” (http://www.sysarch.com/Perl/sort_paper.html) by Uri Guttman and Larry Rosler, we have the Guttman-Rosler Transform introduced (though they did not call it that themselves).

my @sorted = map { substr $_, 4 }
       sort
       map { /\|(\d+)/; pack("A4", $1).$_ } <>;

This looks strange and you’ll want to read perldoc perlpacktut to understand how pack works. The pack template, A4, creates a four octet ASCII string out of the id and we prepend that to the beginning of the string. As a result, the sort itself can rely on its standard sort behavior and the final map removes the packed data. On the computer your author is using to write this chapter, Table 10.1, “Sort Time Comparisons” shows the typical performance of each sort method on our test file with a 300,000 line file with random integers for ids:

Table 10.1. Sort Time Comparisons

Sort Method

Approximate Time

Naïve sort

7.5 seconds

Schartzian Transform

2.5 seconds

Guttman-Rosler Transform

1.5 seconds


As you can see, the Guttman-Rosler transform is very fast, but it is rarely used as most Perl developers are not familiar with the pack function.

Summary

In this chapter you’ve learned about sort, map, and grep, three powerful functions that transform a list into a new list. You’ve learned a variety of sorting techniques and how to use sort subroutines. You’ve learned how to efficiently filter a list of values with grep and how to transform an old list of values into a new list of values with map. Finally, you’ve learned ways of combining these functions together to get powerful data manipulation techniques.

Exercises

  1. Given the following list of hexadecimal numbers, print them in descending numeric order;

    my @numbers = ( 0x23, 0xAA, 0xaa, 0x01, 0xfB );
  2. Given a list of numbers, use grep to return only the numbers that are perfect squares. Then print them in ascending numeric order.

    Assume the following list of numbers:

    my @numbers = ( 28, 49, 1000, 4, 25, 49, 529 );

    Write the grep in both BLOCK and EXPRESSION form.

    NEWLIST = grep BLOCK       LIST;
    NEWLIST = grep EXPRESSION, LIST;

    What happens if one of the values in the @numbers array is actually the string Get a job, hippy!? How would this change your code?

  3. Given the following list, write a grep statement that creates a new list, but without duplicate elements. There a several ways you can solve this.

    my @list = qw(
       bob
       sally
       Andromalius
       sally
       bob
       ned
       Andromalius
    );
  4. Given the following array:

    my @employees = (
        {
            first_name => 'Sally',
            last_name  => 'Jones',
            years      => 4,
            payscale   => 4,
        },
        {
            first_name => 'Abby',
            last_name  => 'Hoffman',
            years      => 1,
            payscale   => 10,
        },
        {
            first_name => 'Jack',
            last_name  => 'Johnson',
            years      => 4,
            payscale   => 5,
        },
        {
            first_name => 'Mr.',
            last_name  => 'Magnate',
            years      => 12,
            payscale   => 1,
        },
    );

Use map to create a new array that looks like this:

my @names = (
  'Jack Johnson',
  'Sally Jones',
  'Mr. Magnate',
);

Note that the names are sorted by last name, ascending, and we exclude employees who have been with the company a year or less.

WHAT YOU LEARNED IN THIS CHAPTER

Topic

Key Concepts

sort

Used to order a list of data

Sort subroutines

Making complex sorts easier to understand

Sorting Unicode

Using the Unicode Collation Algorithm to sort different character sets

grep

Only select elements of a list which pass desired criteria

map

Transform every element of a list, making a new list

map and sort traps

Aliasing, complexity and “clever” code are not your friends

Complex sorts

Using map and sort together for more efficient sorting

Answers to exercises

  1. Given the following list of hexadecimal numbers, print them in descending numeric order;

    my @numbers = ( 0x23, 0xAA, 0xaa, 0x01, 0xfB );

    See hex() and oct() in Chapter 4, Working With data if you need a refresher on the 0x... syntax.

    Those are merely the hexadecimal representation of numbers, so a descending numeric sort is just:

    my @sorted = sort { $b <=> $a } @numbers;
    print join ', ' => @sorted;

    And that will print:

    251, 170, 170, 35, 1

    Of course, you may not want to print the decimal values when the original numbers were in hexadecimal.

    my @numbers = ( 0x23, 0xAA, 0xaa, 0x01, 0xfB );
    my @sorted = sort { $b <=> $a } @numbers;
    print join ', ' => map { sprintf "0x%X", $_ }  @sorted;

    And that prints:

    0xFB, 0xAA, 0xAA, 0x23, 0x1
  2. Given a list of numbers, use grep to return only the numbers that are perfect squares. Then print them in ascending numeric order.

    Assume the following list of numbers:

    my @numbers = ( 28, 49, 1000, 4, 25, 49, 529 );

    Write the grep in both BLOCK and EXPRESSION form.

    NEWLIST = grep BLOCK       LIST;
    NEWLIST = grep EXPRESSION, LIST;

    The BLOCK form looks like this:

    my @numbers = ( 28, 49, 1000, 4, 25, 49, 529 );
    my @squares = sort { $a <=> $b }
                  grep { int(sqrt($_)) == sqrt($_) } @numbers;
    print join ', ' => @squares;

    What the grep is doing is take the integer value of the square root and comparing it against the square root. So for the number 9, we get 3 == 3 and that’s a perfect square. However, the square root of 1000 is reduced to something like 31.6227766016838 == 31 and that is clearly not true, so 1000 is not a perfect square.

    The EXPRESSION form looks like this:

    my @squares = sort { $a <=> $b }
                  grep int(sqrt($_)) == sqrt($_), @numbers;

    It can also be written like this:

    my @squares = sort { $a <=> $b }
                  grep(int(sqrt($_)) == sqrt($_), @numbers_;

    What happens if one of the values in the @numbers array is actually the string Get a job, hippy!? How would this change your code?

    There are several ways of handling this, but one way is to realize that perfect squares must be positive integers, so the following would do the trick:

    my @squares = sort { $a <=> $b }
                  grep { /^[0-9]+$/ && ( int(sqrt($_)) == sqrt($_) ) } @numbers;

    In other words, we use the regular expression /^[0-9]+$/ to guarantee that we only have digits.

  3. Given the following list, write a grep statement that creates a new list, but without duplicate elements. There a several ways you can solve this.

    my @list = qw(
       bob
       sally
       Andromalius
       sally
       bob
       ned
       Andromalius
    );

    One solution:

    my %seen;
    my @unique = grep { not $seen{$_}++ } @list;

    This is a moderately common idiom in Perl (though many people just use the List::MoreUtils uniq function). Here’s how it works.

    The first time that bob is encountered in grep, we have this:

    not $seen{'bob'}++

    We know that $seen{'bob'} must be false the first time it’s encountered because there is no entry in the %seen hash and it evaluates as this:

    not undef

    And that evaluates as true, allow grep to say “bob’s OK and we’ll pass him along”. However, the ++ autoincrement operator kicks in after Perl has returned the value. It sees the undef value, treats it as 0 (zero) and adds 1 to it.

    The second time that bob is encountered in the grep, $seen{'bob'} has the value of 1 and not 1 is 0 (zero) and that evaluates as false. Thus, grep will not return any value’s after they are seen for the first time.

  4. Given the following array:

    my @employees = (
        {
            first_name => 'Sally',
            last_name  => 'Jones',
            years      => 4,
            payscale   => 4,
        },
        {
            first_name => 'Abby',
            last_name  => 'Hoffman',
            years      => 1,
            payscale   => 10,
        },
        {
            first_name => 'Jack',
            last_name  => 'Johnson',
            years      => 4,
            payscale   => 5,
        },
        {
            first_name => 'Mr.',
            last_name  => 'Magnate',
            years      => 12,
            payscale   => 1,
        },
    );

Use map to create a new array that looks like this:

my @names = (
  'Jack Johnson',
  'Sally Jones',
  'Mr. Magnate',
);

Note that the names are sorted by last name, ascending, and we exclude employees who have been with the company a year or less.

The map and sort looks like this:

my @names = map  { "$_->{first_name} $_->{last_name}"  }
            sort { $a->{last_name} cmp $b->{last_name} }
            grep { $_->{years} > 1                     }
              @employees;

Unlike our other examples, we’re using all of sort, map and grep. The grep will filter the list before the sort because there is no point in sorting values that we will throw away (particularly if the list we’re sorting is large). The sort, of course, comes before the map because after the map, we no longer have our data structure and it would be harder to sort on that last name.

There comes a time when you, as a Perl developer, will feel very comfortable with these techniques. Here’s how it might be rewritten with a for loop. We’ll use a sort subroutine to make the sort clearer.

sub by_last_name { $a->{last_name} cmp $b->{last_name} }
my @names;
foreach my $employee ( sort by_last_name @employees ) {
    next if $employee->{years} <= 1;
    push @names, "$employee->{first_name} $employee->{last_name}";
}

It’s entirely up to you (and the circumstances of your code) which method you find cleaner and easier to maintain.

Site last updated on: July 5, 2012 at 11:41:08 AM PDT
Cover for Beginning Perl (Wrox)

View 2 comments

  1. perrettdl – Posted June 19, 2012

    Table "Schartzian Transform" "Schwartzian Transform"

  2. Curtis Poe – Posted June 27, 2012

    Fixed. Thank you!

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Marcel Grünauer – Posted June 14, 2012

    I know headings are supposed to start with an uppercase letter, but map and grep are functions, so I assume "Sort" refers directly to the function, not the concept, and so maybe should be lowercase as well. Also, maybe function names could be in monospace everywhere.

    The author has indicated that the issue raised in this comment has been resolved.

  2. Curtis Poe – Posted June 27, 2012

    @Marcel: these have been fixed. Thanks.

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Marcel Grünauer – Posted June 14, 2012

    "convert", not "covert".

    The author has indicated that the issue raised in this comment has been resolved.

  2. Curtis Poe – Posted June 27, 2012

    Fixed. Thanks!

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Marcel Grünauer – Posted June 14, 2012

    Consistency: This is called a "sort block" just above, but a "sort routine" (with monospaced "sort") here.

    The author has indicated that the issue raised in this comment has been resolved.

  2. Curtis Poe – Posted June 27, 2012

    Fixed. Thanks!

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Friend – Posted June 15, 2012

    That should be $a->{payscale} <=> $b->{payscale} as also for the next few lines of code too

  2. Curtis Poe – Posted June 27, 2012

    Ouch! Fixed. Thanks :)

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 1 comment

  1. Friend – Posted June 15, 2012

    'ö' gets sorted as if it was a simple 'o'. Not easy at all. You're right.

Add a comment

View 1 comment

  1. chrisjack1 – Posted June 27, 2012

    s/That/It/

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Marcel Grünauer – Posted June 14, 2012

    "grep" should be monospaced everywhere.

    The author has indicated that the issue raised in this comment has been resolved.

  2. Curtis Poe – Posted June 27, 2012

    Fixed. Thanks.

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Friend – Posted June 15, 2012

    ... they often avoid IT because it's rather ...

    BTW: I prefer Data::Dump

  2. Curtis Poe – Posted June 27, 2012

    Fixed. Thanks.

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Marcel Grünauer – Posted June 14, 2012

    $b->{years} should be $b->{payscale}

    The author has indicated that the issue raised in this comment has been resolved.

  2. Curtis Poe – Posted June 27, 2012

    Kill me now. I missed that a bunch because, by chance, my data was such that it did not expose the bug.

    Fixed.

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. perrettdl – Posted June 19, 2012

    @employees= { should read @employees = sort {

  2. Curtis Poe – Posted June 27, 2012

    Fixed, thank you!

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Friend – Posted June 15, 2012

    You can get the same and IMHO more readable results by using 'sprintf("%04".$_)' for the string you wish to add. But it would be slower to encode than with pack.

  2. Curtis Poe – Posted June 27, 2012

    Agreed on both points. However, the definition of the Guttman-Rosler Transform requires the pack at the beginning. For maximum performance.

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment