Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Google   


Perl Sort

News Recommended Links Reference Sorting algorithms Tutorial Unix sort
Grep and map   Regular expressions History Humor Etc

[From Perl Cookbook] Perl sort function takes an optional code block, which lets you replace the default alphabetic comparison subroutine with your own.

The comparison function should return a negative number if $a ought to appear before $b in the output list, 0 if they're the same and their order doesn't matter, or a positive number if $a ought to appear after $b.

Perl has two operators that behave this way: <=> for sorting numbers in ascending numeric order, and cmp for sorting strings in ascending alphabetic order. By default, sort uses cmp-style comparisons.

Here's code that sorts the list of PIDs in @pids, lets the user select one, then sends it a TERM signal followed by a KILL signal. We use a code block that compares $a to $b with <=> to sort numerically:

# @pids is an unsorted array of process IDs
foreach my $pid (sort { $a <=> $b } @pids) {
    print "$pid\n";
}
print "Select a process ID to kill:\n";
chomp ($pid = <>);
die "Exiting ... \n" unless $pid && $pid =~ /^\d+$/;
kill('TERM',$pid);
sleep 2;
kill('KILL',$pid);

If you use $a <=> $b or $a cmp $b, the list will be sorted in ascending order. For a descending sort, all we have to do is swap $a and $b in the sort subroutine:

@descending = sort { $b <=> $a } @unsorted;

Comparison routines must be consistent; that is, they should always return the same answer when called with the same values.

You can also say sort SUBNAME LIST where SUBNAME is the name of a comparison subroutine returning -1, 0, or +1. In the interests of speed, the normal calling conventions are bypassed, and the values to be compared magically appear for the duration of the subroutine in the global package variables $a and $b. Because of the odd way Perl calls this subroutine, it may not be recursive.

A word of warning: $a and $b are set in the package active in the call to sort, which may not be the same as the one that the SUBNAME function passed to sort was compiled in! For example:

package Sort_Subs;
sub revnum { $b <=> $a }

package Other_Pack;
@all = sort Sort_Subs::revnum 4, 19, 8, 3;

This will silently fail (unless you have -w in effect, in which case it will vocally fail), because the sort call sets the package variables $a and $b in its own package, Other_Pack, but the revnum function uses its own package's versions. This is another reason why in-lining sort functions is easier, as in:

@all = sort { $b <=> $a } 4, 19, 8, 3;


Notes:
  • Those pages are written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
  • This is a Spartan WHYFF (We Help You For Free) site. It cannot replace the best teachers and the best books.
  • The site contain some obsolete pages as it develops like a living tree... Some links on older pages are broken. Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.

Search Amazon by keywords:

Google   
Open directory

Research Index

 

Old News ;-)

[Mar 24, 2008] Perl Training Australia - Perl's sort function

A common task when working with any type of data is to sort that data into a consistent ordering. A list of people may be sorted by name, a list of files may be sorted by size, and a list of phone calls may be sorted by time or duration.

Perl makes it easy to sort lists of values. If I have a list of names, I can produce a sorted list of names with one simple line:

    @sorted_names = sort @names;

By itself, Perl's sort functions will place elements into "asciibetical" order. That is, according to the ordering of the ASCII table. For the most part this is the same as alphabetical, except that case does matter -- all strings starting in upper case will always be listed before those starting in lower case. That's very fast, but it's not always what we want.

What if we wanted to sort our list in true alphabetical order, or by numerical order instead? To do that, we need to pass an argument to Perl's sort function.

    @sorted_names = sort {lc $a cmp lc $b} @names;

The argument to sort is a block that Perl will call to determine how to order two elements. The rules here are simple. Two special variables, $a and $b, are bound by Perl to the values being compared. Perl then invokes our code and expects to see a value of -1, 0, or +1, depending on whether $a is less than, equal to, or greater than $b.

In our example, the 'cmp' operator returns -1, 0, or +1 for these three conditions, and compares both of its arguments as strings. Since we wanted to sort our values alphabetically, regardless of case, we also call 'lc' on each of the values, to provide us with a lower-cased version of each string.

If we want to perform a numerical sort, we simply need to pass in a different comparison block:

    @sorted_numbers = sort {$a <=> $b} @numbers;

Here we're using the '<=>' operator, popularly known as the 'starship operator'. It also returns values of -1, 0, or +1 as appropriate, but compares its arguments as numbers, not strings.

If we want to have a list in descending order (largest to smallest) rather than the other way around, we only need to swap the positions of $a and $b:

    @descending_numbers = sort {$b <=> $a } @numbers;

As you can see, Perl's sort is very flexible.

It's also possible to pass an arbitrary subroutine to Perl's sort function. This allows us to not only use some very complex comparison functions, but can also improve readability. Let's pretend we two subroutines declared, using our sort blocks from above:

    sub numerically    { $a <=> $b }
    sub alphabetically { lc $a cmp lc $b }

Now we can write:

    @sorted_numbers = sort numerically    @numbers;
    @sorted_names   = sort alphabetically @names;

NEXT WEEK: Using the Schwartzian Transform when working with expensive comparison algorithms

 

Sort::Merge Perl extension for merge sort
 

Sorting Data - Perl Programming

Perl Sorting

Now we get to the more complicated ways. If our "unsorted" were very large, it could take a while to run. At http://perlmonks.thepen.com/145659.html, I found this:
#!/usr/bin/perl

my @words=<>;
#Guttman Rosler Transform or GRT
my @sorted=map { substr($_,4) }
sort
map { pack("LA*",tr/eE/eE/,$_) } @words;
foreach(@sorted) {
print;
}

This needs a lot of explanation. First, "map". Perl's "map" function is like a "foreach" loop: whatever you want to do to an array is in the first argument, the array you work on is its second. It's better than a foreach loop though, because it gives us back a new array. As a very simple example:

@lcwords=map {lc($_)} @words;

So the "map { pack("LA*",tr/eE/eE/,$_) } @words;" part of the example above returns a new array that consists of a packed 4 byte number followed by the orginal contents of each line. That number is, of course, the count provided by "tr". We use pack here because it's very quick, but if we had more complex needs, we could use sprintf or even build the string ourselves. We just have to make sure we can un-build it at the end.

It's then ordered by the "sort" on the previous line, and then the first "map { substr($_,4) }" works on that array, stripping out the packed 4 byte number that made our previous sorting work.

In spite of the double map use, this will actually run much faster than what we did before. Even on a relatively small file, "time" will show that this can be twice as fast.

 

Recommended Links

Perl Sorting Techniques.

Get an alphabetical sort of words, but make 'aardvark' always come last.

(Now, why you would want to do that is another question...)
@sorted = sort { 
                if($a eq 'aardvark') { return -1; }
                elsif($b eq 'aardvark') { return 1; }
                else { return $a cmp $b; } 
               } @words;
    

A Fresh Look at Efficient Perl Sorting

Sorting can be a major bottleneck in Perl programs. Performance can vary by orders of magnitude, depending on how the sort is written. In this paper, we examine Perl´s sort function in depth and describe how to use it with simple and complex data. Next we analyze and compare several well-known Perl sorting optimizations (including the Orcish Maneuver and the Schwartzian Transform). We then show how to improve their performance significantly, by packing multiple sortkeys into a single string. Finally, we present a fresh approach, using the sort function with packed sortkeys and without a sortsub. This provides much better performance than any of the other methods, and is easy to implement directly or by using a new module we created, Sort::Records.

NOTE: Sort::Records died during development but five years later, Sort::Maker was released and does all that was promised and more. Find it on CPAN

 Perl's sort function

Perl makes it easy to sort lists of values. If I have a list of names, I can produce a sorted list of names with one simple line:

    @sorted_names = sort @names;

By itself, Perl's sort functions will place elements into "asciibetical" order. That is, according to the ordering of the ASCII table. For the most part this is the same as alphabetical, except that case does matter -- all strings starting in upper case will always be listed before those starting in lower case. That's very fast, but it's not always what we want.

What if we wanted to sort our list in true alphabetical order, or by numerical order instead? To do that, we need to pass an argument to Perl's sort function.

    @sorted_names = sort {lc $a cmp lc $b} @names;

The argument to sort is a block that Perl will call to determine how to order two elements. The rules here are simple. Two special variables, $a and $b, are bound by Perl to the values being compared. Perl then invokes our code and expects to see a value of -1, 0, or +1, depending on whether $a is less than, equal to, or greater than $b.

In our example, the 'cmp' operator returns -1, 0, or +1 for these three conditions, and compares both of its arguments as strings. Since we wanted to sort our values alphabetically, regardless of case, we also call 'lc' on each of the values, to provide us with a lower-cased version of each string.

If we want to perform a numerical sort, we simply need to pass in a different comparison block:

    @sorted_numbers = sort {$a <=> $b} @numbers;

Here we're using the '<=>' operator, popularly known as the 'starship operator'. It also returns values of -1, 0, or +1 as appropriate, but compares its arguments as numbers, not strings.

If we want to have a list in descending order (largest to smallest) rather than the other way around, we only need to swap the positions of $a and $b:

    @descending_numbers = sort {$b <=> $a } @numbers;

As you can see, Perl's sort is very flexible.

It's also possible to pass an arbitrary subroutine to Perl's sort function. This allows us to not only use some very complex comparison functions, but can also improve readability. Let's pretend we two subroutines declared, using our sort blocks from above:

    sub numerically    { $a <=> $b }
    sub alphabetically { lc $a cmp lc $b }

Now we can write:

    @sorted_numbers = sort numerically    @numbers;
    @sorted_names   = sort alphabetically @names;

Tutorial

Sorting (jan 96) by Randal L. Schwartz

One of the most important tasks in managing data is getting it into some sort of sensible order. Perl provides a fairly powerful sort operator, which has a tremendous amount of flexibility. I'm going to talk about some sorts of sorts, and hopefully you'll sort everything out by the time you're finished reading. (And no, despite the similarity to my name, I will not talk about ``random sorts''.)

Let's take a simple case. I have the words in a list somewhere in the perl program, and I want to sort them into alphabetical (technically, ascending ASCII) order. Easy enough:

        @somelist = ("Fred","Barney","Betty","Wilma");
        @sortedlist = sort @somelist;

This puts the value of (``Barney'',``Betty'',``Fred'',``Wilma'') into @sortedlist. If I had had these names in a file, I could have read them from the file:

        #!/usr/bin/perl
        @somelist = <>; # read everything
        @sortedlist = sort @somelist;
        print @sortedlist;

In this case, @somelist (and thus @sortedlist) will also have newlines at the end of each name. That's OK here, because it won't affect the sort order, and it makes printing them out that much easier.

Of course, I can shorten this a bit:

        #!/usr/bin/perl
        @somelist = <>;
        print sort @somelist;

Or even further:

        #!/usr/bin/perl
        print sort <>;

(I suppose this is what gives Perl its reputation for being cryptic.) Here, I've used no variables at all. However, it does indeed sort everything being read, and print the result.

These sorts of sorts are fine if the data is textual. However, if the data is numeric, we'll get a bad order. That's because comparing 15 with 3 as strings will place 15 before 3, not after it. Because the default sort is textual, we need some other way to tell sort to sort numerically, not textually.

Anything besides a textual sort of the values of the element of the list has to be handled with a ``sort subroutine''. The way this works is simple -- at some point, when perl is looking at two elements from the larger list, trying to figure out how to order those two elements, it has to perform a comparison of some sort (heh). By default, this is an ASCII string comparison. However, you can give your own comparison function using a sort subroutine, with the following interface rules:

1. your sort subroutine will be called repeatedly with two elements of the larger list.

2. the two elements will appear in the scalar variables $a and $b. (No need to make them local or look at @_.)

3. you need to ``compare'' $a and $b in the sort subroutine, and decide which is bigger.

4. you need to return -1, 0, or +1, depending on whether $a is ``less than'', ``equal to'', or ``greater than'' $b, using your comparison operation.

Those of you familiar with the qsort() library function from C should recognize this stuff. In fact, Perl uses the qsort() function, so it's no surprise.

So, here's a sort subroutine that does the job in comparing $a and $b numerically, rather than as text:

        sub numerically {
                if ($a < $b) { -1; }
                elsif ($a == $b) { 0; }
                else { +1; }
        }

Now, all we have to do is tell Perl to use this sort subroutine as the comparison function, rather than the built-in ASCII ascending sort. That's done by placing the name of the subroutine (without any leading ampersand) between the keyword ``sort'' and the list of things to be sorted. For example:

        @newlist = sort numerically 32,1,4,8,16,2;

And now, instead of the list coming out in ASCII order (as it would if I had left out the ``numerically'' word), I get the powers of two in proper numeric sequence in @newlist.

The comparison of $a and $b numerically to generate one of -1, 0, or +1, is performed often enough that Larry Wall believed it warranted its own operator, <=>, which has come to be known as the ``spaceship operator'' for reasons I would rather not discuss. So, I can shorten ``numerically'' down to this:

        sub numerically {
                $a <=> $b;
        }

Now this is short enough that it seems a waste to have to define a separate subroutine, and in fact Perl allows an even more compact notation: the inline sort block, which looks like this:

        @newlist = sort { $a <=> $b; } @oldlist;

The interface to this inline sort block is exactly as I've described for the subroutine above. It's just a little more compact. Personally, I use this style whenever the sort subroutine is under 40 characters or so, and break down to create a real subroutine above that.

Let's look at reading a list of numbers from the input again:

        #!/usr/bin/perl
        print sort numerically <>;
        sub numerically { $a <=> $b; }

Now, if I present this program with a list of numbers, I'll get the sorted list of numbers. This is functionally equivalent to a Unix ``sort'' command with a ``-n'' switch.

Let's get a little crazier. Suppose I have a file that has people's names in the first column, and bowling scores in the second column:

        Fred 210
        Barney 195
        Betty 200
        Wilma 170
        Dino 30

and that I want to sort this file based on bowling scores. Well, getting the data into the program is pretty simple:

        #!/usr/bin/perl
        @data = <>;

but each element of @data looks like: ``Fred 210\n'', and so on. How do I sort this list @data, but look only at the number and not the name?

Well, I'd need to pull the number out of the string. How do I do that? One way is with split:

        $a = "Fred 210\n";
        ($name,$score) = split /\s+/, $a;

Here, I split $a by whitespace, yielding a two element list. The first element goes into $name (which I really don't care about) and the second element goes into $score. There. Now all I have to do is tell Perl to look at just the score:

        sub scores {
                ($name_a,$score_a) = split /\s+/, $a;
                ($name_b,$score_b) = split /\s+/, $b;
                $score_a <=> $score_b;
        }

and in fact, this will do it!

        #!/usr/bin/perl
        sub scores { ... } # as above
        print sort scores <>;

So, what's wrong with this picture? Well, it'd be just fine if we only looked at each entry in the list once. However, after we're done comparing Fred's score to Barney (and decide Fred is better), we also have to compare Fred's score to Betty's score. That means that we've had to split Fred's data twice so far. In fact, for a huge list, it'll have to perform the very same split over and over and over again.

There's a few ways out of this. One is to compute a separate array that has only the scores, and then sort that array. Let's look at that first.

The goal is to first read the data, and then compute an associative array whose keys represent a particular element of the array, and values represent the precomputed scores. Then, we are reducing the problem to one of an associative array lookup instead of a (perhaps) expensive split.

        @data = <>; # read data
        foreach (@data) {
                ($name,$score) = split; # get score
                $score{$_} = $score; # record it
        }

Now, $score{``Fred 210\n''} will be just 210, and so on, for each of the original elements of @data.

Next, we have to use the information. We need a subroutine that, given two elements of @data in $a and $b, looks up the corresponding scores in %score, and compares those numerically:

        sub score {
                $score{$a} <=> $score{$b};
        }

and this indeed does it. Let's put it all together:

        #!/usr/bin/perl
        @data = <>; # read data
        foreach (@data) {
                ($name,$score) = split; # get score
                $score{$_} = $score; # record it
        }
        print sort {
                $score{$a} <=> $score{$b};
        } @data;

Note that in this version, I recoded the sort subroutine as an inline block instead. (I'm just trying to give you a lot of alternative notations to play with.)

Another way to tackle the problem is to massage the list into a list of pairs of things. The second element of each pair (actually, an anonymous list of two elements) will be the computed sort determinant, and the first element will be the original data value (so we can get back to the original data). This is best handled with the ``map'' operator (not available in older Perl versions).

        @pairs = map {
                ($name, $score) = split;
                [ $_, $score ];
        } @data;

Here, the block of code is executed for each element of @data, with $_ set to the element. This causes each element to be split into $name and $score, and then I build a two-element anonymous list from the $score and the original value $_. These are collected into a new list. If @data had five elements, then @pairs has five elements, each of which is a reference to a two-element anonymous list. Ouch!

The next step is to sort the @pairs list. Within the sort subroutine, $a and $b will be references to two-element lists. The second element of each list is the sort key, and is addressed like $a->[1]. So, we get a sort subroutine like this:

        sub mangle {
                $a->[1] <=> $b->[1];
        }

and the sort looks like this:

        @sorted = sort mangle @pairs;

Now, @sorted is still the same pairs of data, but sorted according to the scores (did you forget we were still working with the scores?). I have to peel away the anonymous lists to get back the original data, while still preserving the order. Easy -- map to the rescue again:

        @finally = map {
                $_->[0];
        } @sorted;

This is because $_ will be each element of @sorted -- a reference to an anonymous list, and therefore $->[0] will fetch the first element of the anonymous list pointed to by $_, which is the original data. Whew!

Of course, in the Perl tradition, I can shove all this stuff together in a very lisp-like way. You've got to read this back to front to see what is happening:

        #!/usr/bin/perl
        print
                map { $_->[0] }
                sort { $a->[1] <=> $b->[1] }
                map {
                        ($name,$score) = split;
                        [$_,$score];
                } <>;

Eeek. But hey, it works!

One last optimization: I can put the split directly inside the anonymous list creation:

        #!/usr/bin/perl
        print
                map { $_->[0] }
                sort { $a->[1] <=> $b->[1] }
                map { [$_, (split)[1] ] }
                <>;

which works because the split here is being pulled apart by a ``literal slice'' -- only the second element of the list remains after we slice it up.

Perl provides some powerful sorting techniques, which can really be a boon once mastered. I hope I have inspired you more than I've confused you. Next time, something entirely different.

 

Reference

sort - sort a list of values

sort SUBNAME LIST

sort BLOCK LIST

sort LIST

Sorts the LIST and returns the sorted list value. If SUBNAME or BLOCK is omitted, sort()s in standard string comparison order. If SUBNAME is specified, it gives the name of a subroutine that returns an integer less than, equal to, or greater than 0, depending on how the elements of the array are to be ordered. (The <=> and cmp operators are extremely useful in such routines.) SUBNAME may be a scalar variable name (unsubscripted), in which case the value provides the name of (or a reference to) the actual subroutine to use. In place of a SUBNAME, you can provide a BLOCK as an anonymous, in-line sort subroutine.

In the interests of efficiency the normal calling code for subroutines is bypassed, with the following effects: the subroutine may not be a recursive subroutine, and the two elements to be compared are passed into the subroutine not via @_ but as the package global variables $a and $b (see example below). They are passed by reference, so don't modify $a and $b. And don't try to declare them as lexicals either.

You also cannot exit out of the sort block or subroutine using any of the loop control operators described in the perlsyn manpage or with goto().

When use locale is in effect, sort LIST sorts LIST according to the current collation locale. See the perllocale manpage.

Examples:

 

    # sort lexically
    @articles = sort @files;

 

    # same thing, but with explicit sort routine
    @articles = sort {$a cmp $b} @files;

 

    # now case-insensitively
    @articles = sort {uc($a) cmp uc($b)} @files;

 

    # same thing in reversed order
    @articles = sort {$b cmp $a} @files;

 

    # sort numerically ascending
    @articles = sort {$a <=> $b} @files;

 

    # sort numerically descending
    @articles = sort {$b <=> $a} @files;

 

    # sort using explicit subroutine name
    sub byage {
        $age{$a} <=> $age{$b};  # presuming numeric
    }
    @sortedclass = sort byage @class;

 

    # this sorts the %age hash by value instead of key
    # using an in-line function
    @eldest = sort { $age{$b} <=> $age{$a} } keys %age;

 

    sub backwards { $b cmp $a; }
    @harry = ('dog','cat','x','Cain','Abel');
    @george = ('gone','chased','yz','Punished','Axed');
    print sort @harry;
            # prints AbelCaincatdogx
    print sort backwards @harry;
            # prints xdogcatCainAbel
    print sort @george, 'to', @harry;
            # prints AbelAxedCainPunishedcatchaseddoggonetoxyz

 

    # inefficiently sort by descending numeric compare using
    # the first integer after the first = sign, or the
    # whole record case-insensitively otherwise

 

    @new = sort {
        ($b =~ /=(\d+)/)[0] <=> ($a =~ /=(\d+)/)[0]
                            ||
                    uc($a)  cmp  uc($b)
    } @old;

 

    # same thing, but much more efficiently;
    # we'll build auxiliary indices instead
    # for speed
    @nums = @caps = ();
    for (@old) {
        push @nums, /=(\d+)/;
        push @caps, uc($_);
    }

 

    @new = @old[ sort {
                        $nums[$b] <=> $nums[$a]
                                 ||
                        $caps[$a] cmp $caps[$b]
                       } 0..$#old
               ];

 

    # same thing using a Schwartzian Transform (no temps)
    @new = map { $_->[0] }
        sort { $b->[1] <=> $a->[1]
                        ||
               $a->[2] cmp $b->[2]
        } map { [$_, /=(\d+)/, uc($_)] } @old;

If you're using strict, you MUST NOT declare $a and $b as lexicals. They are package globals. That means if you're in the main package, it's

 

    @articles = sort {$main::b <=> $main::a} @files;

or just

 

    @articles = sort {$::b <=> $::a} @files;

but if you're in the FooPack package, it's

 

    @articles = sort {$FooPack::b <=> $FooPack::a} @files;

The comparison function is required to behave. If it returns inconsistent results (sometimes saying $x[1] is less than $x[2] and sometimes saying the opposite, for example) the results are not well-defined.

 



Copyright © 1996-2007 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Standard disclaimer: The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: March 25, 2008