## Chapter 9. Files and Directories

• Working with Files

• Working with Directories

• Understanding Unicode

• Useful File Manipulation Modules

Up to this point, except for a brief discussion of @ARGV in Chapter 3, Variables, the data in your program has been embedded in the program itself. That’s not very useful. In the real world, we’re constantly reading data from files, Web services, databases and a variety of other sources. Here’s we’ll introduce you to the basics of reading and writing to files and directories.

## Basic file handling

As you probably know by now, most common operating systems have their data internally organized around files and directories. Even if the data is stored in a database, it’s probably represented as files somewhere. Perl makes it very easy to read and writes those files and we’ll show you the most common ways of doing that.

### Opening and reading a file

For this section type the following into a file named targets.txt in a directory named chapter_9.

James|007|Spy
Number 6|6|Ex-spy
Agent 99|99|Spy with unknown name
Napoleon Solo|11|Uncle spy
# This guy is only rumored to exist. Not everyone believes it.
Unknown|666|Maybe a spy

Those are names, case numbers and bizarre job titles for people your overly optimistic intelligence agency wishes to interrogate.

To open a file, you use the open() builtin. The two most common forms of open() are:

open FILEHANDLE, MODE, FILENAME
open FILEHANDLE, FILENAME

The first syntax above is referred to as the “three argument open” and the second is the “two argument open”. The second is an older version of open() and it’s generally frowned upon today, but we’ll explain it just a bit so you can understand it if you see it in legacy code (if someone is still writing using the two argument open today, it’s either because they must support a version of Perl prior to version 5.6 or they don’t know any better).

The arguments to open() are:

• FILEHANDLE — the identifier you will use elsewhere to read or write to the file

• MODE — specifies if you are opening the file to read and/or write to it

• FILENAME — mostly, just what it looks like, the name of the file in your system

### Note

See perldoc -f open for way more information than you expected. perldoc opentut is good, too. If you need fine-grained control over how to open files (such as dieing if you try to open for writing a file that already exists), see perldoc -f sysopen. It’s also explained in detail with perlopentut.

To open a file in read mode, use the < sign for the mode. Here’s what it looks like:

my $filename = 'chapter_9/targets.txt'; open my$spies_to_espy, '<', $filename or die "Cannot open '$filename' for writing: $!"; That is a lot of new stuff at once, so we will break it down carefully. The my $spies_to_espy variable contains the filehandle that you will use to access the contents of $filename. Like variables, a filehandle with a descriptive name leads to clearer code. You will find that filehandle is commonly abbreviated at $fh.

The < tells Perl you’re going to open the file for reading.

If the attempt to open the file fails the open() builtin returns false and sets the special $! variable. $! contains a human-readable description of the error. You can print $! to provide an error message. If the above file does not exist, that might print: Cannot open 'chapter_9/targets.txt' for writing: No such file or directory at my_program.pl line 17. When using open() and other related functions, always include the or die section at the end. Otherwise, Perl may ignore the error and silently Do The Wrong Thing and that would be disappointing. To automate remembering you can install the handy autodie module from the CPAN and it will take care of this for you: use autodie; my$filename = 'chapter_9/targets.txt';
open my $filehandle, '<',$filename;

If open() fails, you’ll get a virtually identical error message to the one above.

The autodie module was included with Perl as of version 5.10.1, so if you have that version of Perl or newer, you won’t need to install it separately.

### Note

Windows, and some operating systems, use the backslash, \, as a file name delimiter. This could be an issue in Perl, which uses the \ to specify characters like tab, \t, and newline \n.

When you attempt to do something like:

my $filename = "chapter_9\targets.txt"; In a double quoted string, the \t is the tab character but your file name is probably not chapter_9<TAB>argets.txt. You could escape the \ like this: my$filename = "chapter_9\\targets.txt";

But that can quickly start to get really ugly:

my $file = "path\\to\\some\\$file";

In a Perl program just use forward slashes and internally Perl will Do The Right Thing for your operating system.

my $file = "path/to/some/$file";

or

my $file = "C:/path/to/some/$file";

That’s much cleaner.

Now that we have opened the file, let’s read from it and print the name, case number and description of each record. Example 9.1, “Reading a parsing a file” shows the code to do this.

Example 9.1. Reading a parsing a file

use strict;
use warnings;
use diagnostics;
my $filename = 'chapter_9/targets.txt'; open my$spies_to_espy, '<', $filename or die "Cannot open '$filename' for writing: $!"; while ( my$line = <$spies_to_espy> ) { next if$line =~ /^\s*#/; # skip comments!
chomp($line); my ($name, $case_number,$description )
= split /\|/, $line; print "$name ($case_number):$description\n";
}

### Note

The sharp-eyed among you may have wondered what’s really going on with using a while loop and a filehandle. What if the filehandle just returns an empty string or some other value that evaluates to false? It still Just Works because when reading filehandles in a while loop, Perl magically converts it as follows:

while ( my $line = <$fh> ) { ... }
# becomes
while ( defined ( my $line = <$fh> ) ) { ... }

Remember that the assignment, my $line = <$fh>, will return the value of the entire expression and the filehandle can only return undef at EOF (end of file). Thus, the while loop works. This behavior happens because Perl knows that’s what you really need here. Don’t rely on this behavior for other uses of while.

The $/ variable defaults to whatever the newline character is for your operating system. For Windows, this is the carriage return plus line feed (\r\n). For Unix-like systems such as Linux, Mac OS X, AIX and so on, it’s just the line feed character (\n) and for versions of Mac OS prior to OS X, it’s just the carriage return (\r). Other operating systems may use different characters, but Perl takes care of this for you. ### Note If you have a file from another operating system or that delimits “records” with a different character, you can assign to the $/ variable to ensure that lines are split correctly. Just be sure to use the local() builtin with it to avoid having other parts of your system picking up the new value. You can also read in an entire file into a scalar by setting $/ to undef. This is often referred to as “slurp mode”. Just using a bare local$/; will set $/ to an uninitialized value: my$file_contents = slurp('chapter_9/targets.txt');
print $file_contents; sub slurp { my$file = shift;
open my $fh, '<',$file
or die "Cannot open '$file' for reading:$!";
local $/; my$contents = <$fh>; return$contents;
}

That’s written for clarity. However, you’ll often see it written like this:

sub slurp {
my $file = shift; open my$fh, '<', $file or die "Cannot open '$file' for reading: $!"; return do { local$/; <$fh> }; } The next line of code should be clear. We skip comments in the file by preceding them with a # symbol. The \s* allows you to have zero or more spaces in front of the # symbol. next if$line =~ /^\s*#/; # skip comments!

Then we have the chomp():

chomp($line); As you may recall from Chapter 4, Working With data, chomp() removes anything matching $/ from the end of the variable. In this case, we don’t need to do this because we’re adding it back in when we print the data, It is a very good habit to get into. You often store data in variables and probably do not want the line separator.

Then we split the line on the pipe character, |. Because split() expects a regular expression as its first argument and the | is used for alternation, we need to escape it to match a literal pipe character.

my ( $name,$case_number, $description ) = split /\|/,$line;

And finally we print our results:

print "$name ($case_number): $description\n"; The final line closes our filehandle: close$fh or die "Could not close '$filename':$!";

Note that if the filehandle falls out of scope, Perl will close the filehandle for you. You’ll see many programs take advantage of this feature and not close their filehandles.

Note that the <> operator assigns to $_ by default, so you can omit the my$line = if you prefer:

while (<$fh>) { next if /^\s*#/; # skip comments! chomp; my ($name, $case_number,$description ) = split /\|/, $_; print "$name ($case_number):$description\n";
}

#### Reading Files the Wrong Way

For versions of Perl prior to version 5.6 (released over a decade ago!), you will often see this syntax:

open FH, $filename or die "Cannot open '$filename" for reading: $!"; Or: open FH, "<$filename"
or die "Cannot open '$filename" for reading:$!";

This combines a few practices that are today considered very bad. The FH looks like a bareword and should not be allowed with use strict, but in this instance, it’s considered to be a typeglob. You use it like a normal filehandle:

while ( my $line = <FH> ) { ... } This is considered bad practice because typeglobs are global and there can be some very strange bugs associated with other portions of your program messing with global variables. Imagine trying to debug what’s going wrong with this: open FH or die$!;

That’s perfectly legal and it might just open a file in read mode, but we won’t cover this monstrosity here (again, see perlopentut for the gory bits).

Note that we’re also using the two argument form of open() in this bad example:

open FH, $filename; # and open FH, "<$filename";

For the first, we’ve simply omitted the < mode. If that’s left off, Perl assumes read mode. For the second, it’s included in the string, along with the filename. That does the same thing. It has to do with making this seem a bit more familiar to Unix programmers, but suffice it to say that it’s strongly discouraged today. Some operating systems allow filenames to start with characters that might be interpreted by Perl as changing the mode of the file open. Thus, in the good ol’ days, a simple open FH, $filename may have very unexpected behavior. Don’t do that. Stick with the three argument open. ### Note For more information on typeglobs, see “Typeglobs and Filehandles” in perldoc perldata. #### Writing Files Writing files has a similar syntax, but we use > to open the file in “write mode”. If you wish to append to a file, use >>. So to add Maxwell Smart as a new “target” in targets.txt, you could write the following: open my$fh, '>>', $filename or die "Cannot open '$filename' for appending: $!"; print$fh "Maxwell Smart|86|Definitely a spy\n";

And now the file should contain:

James|007|Spy
Number 6|6|Ex-spy
# This guy is only rumored to exist. Not everyone believes it.
Unknown|666|Maybe a spy
Maxwell Smart|86|Definitely a spy

print $fh "Maxwell Smart|99|Definitely a spy\n"; Note that there’s no comma after the $fh. That’s what let’s Perl know that $fh is a file handle it’s printing to instead of something to print. So if you see something like this when you weren’t expecting any output: GLOB(0xbfe220)Maxwell Smart|99|Definitely a spy You probably put a comma after the filehandle, telling Perl that it’s something to print instead of a filehandle to print to. If you want, you can rewrite the file by reading it and then writing to it. Let’s sort the lines of the file and strip the comments from it. Here’s one way of doing that. my$filename = 'chapter_9/targets.txt';
open my $fh, '<',$filename
or die "Cannot open '$filename' for reading:$!";
# each element in @lines gets one line from the file
# remember grep from Chapter 4?
my @lines = sort grep { !/^\s*#/ } <$fh>; close$fh or die "Cannot close '$filename':$!";
open my $fh, '>',$filename
or die "Cannot open '$filename' for writing$!";
print $fh @lines; close$fh or die "Cannot close '$filename':$!";

Once again, this code is building on everything you’ve learned so far. There’s nothing to magical here.

There is another way of rewriting a file. We need four things: seek(), tell(), truncate() and read-write mode.

To open a file in read-write mode, prepend the mode with a +. In this case, we’ll use +< mode. There is a corresponding +> mode, but you should probably never use it because it will delete the contents of your file first. That’s probably not very helpful. Here’s our new program.

my $filename = 'chapter_9/targets.txt'; open my$fh, '+<', $filename or die "Cannot open '$filename' in read-write mode: $!"; my @lines = sort grep { !/^\s*#/ } <$fh>;
seek $fh, 0, 0 or die "Cannot seek '$filame', 0, 0: $!"; print$fh @lines;
truncate $fh, tell($fh)
or die "Cannot truncate '$filename':$!";
close $fh or die "Cannot close$filename: $!"; The seek() function has the following syntax: seek FILEHANDLE, OFFSET, STARTINGAT The values for STARTINGAT are • 0 — set the new position in bytes to OFFSET • 1 — set the new position to the current position plus OFFSET • 2 — set the new position to the end of file plus OFFSET (this is usually a negative value). The tell() function returns the position of the filehandle, in bytes. The truncate() builtin tells Perl to truncate the file at the given position. This may seem a bit confusing, but it’s what Perl needs to know to handle this. And again, don’t forget that you can use autodie to make this simpler use autodie; my$filename = 'chapter_9/targets.txt';
open my $fh, '+<',$filename;
my @lines = sort grep { !/^\s*#/ } <$fh>; seek$fh, 0, 0;
print    $fh @lines; truncate$fh, tell($fh); close$fh;

Though your author usually uses autodie, we’re avoiding it in examples to constantly remind you to check the success or failure of your system calls. As usual, see perldoc -f for the various functions to learn more about them.

### File test operators

When you’re working with files or directories, you often want to know things about them first. For example, you might want to see if a file exists before trying to read it. The -e filetest operator does this. We’ll also use -f operator to find out if it’s a file.

my $filename = 'somefile'; if ( -e$filename && -f $filename ) { ... } Every time you use a filetest operator, the system makes another stat() call (see perldoc -f stat) and this can be expensive, so Perl let’s you use a special filehandle named _. When a filetest operator is used, subsequent filetest operators can use _ that contains the results from the last stat() call. This is generally much less expensive, particularly if you’re stacking many filetest operators: # does it exist? Is it a file? Is it readable? if ( -e$filename && -f _ && -r _ ) { ... }

Also, if you’re using Perl 5.9.1 or better, you can stack the operators and write the above as:

### Note

We’ve mentioned that while (<>) is the same as while (defined($_ = <ARGV>)). But how do we know this? Perl has a very handy module named B::Deparse. The B:: modules are “Backend” modules and they let you see some things about Perl that are normally not visible. In this case, let’s use B::Deparse to “deparse” the while (<>) construct. perl -MO=Deparse -e 'while (<>) {}' That prints out: while (defined($_ = <ARGV>)) {
();
}
-e syntax OK

You can see the changed code and you’ll also note that it’s been neatly formatted. B::Deparse has a number of interesting options to help you better understand complicated code. See perldoc B::Deparse for more information. The -M switch for Perl tells it to load the module requested, in this case the mysteriously named O (that’s the letter ‘oh', not the number ‘zero'). See perldoc O to understand how that loads B::Deparse. And if you’re really brave, see perldoc B for a better understand of the B:: modules, but be warned: it’s dense.

### Temporary Files

Sometimes you need to create temporary files that disappear when your program ends. For example, you may want to filter a file, but write it out to a tempfile first. Other times, you may want to create a tempfile and feed it to another program. There are several ways to do this, but we’ll use the File::Temp module as it’s fairly common.

use File::Temp 'tempfile';
}
else {
$config{$key} = $value; } } print Dumper(\%config); __DATA__ # max_tries = 3 max_tries = 2 timeout = 30 # only these people are OK user = Ovid user = Sally user = Bob ### Note reading_from_data.pl available for download at Wrox.com. Running the code in Example 9.2, “Reading DATA” will print something similar to: $VAR1 = {
'max_tries' => '2',
'timeout' => '30',
'user' => [
'Ovid',
'Sally',
'Bob'
]
};

In this case, we’ve used the DATA section of our code to embed a tiny config file. As a general rule, you can only read from the DATA section once, but if you really need to read from it more than once:

# Find the start of the __DATA__ section
my $data_start = tell DATA; while ( <DATA> ) { # do something } # Reset DATA filehandle to start of __DATA__ seek DATA,$data_start, 0;

In case you’re wondering, yes, you can also write to the DATA section if you have the correct permission but this is generally a bad idea and is left as an exercise for the foolhardy (hint: if you get it wrong, you can overwrite your program).

### Note

The example of using a DATA section for configuration works, but be aware that this is only to show you how __DATA__ works. There are plenty of useful modules on the CPAN for handling configuration files. Some very popular ones are AppConfig, Config::General, Config::Std and Config::Tiny. You could still keep your config in the DATA section, but you really want it to be in a separate file as this is something that others are likely to need to read and edit.

### binmode

When working with text files, opening the file and reading and writing to it is generally handled transparently. However, what happens if you open a file written on a Linux system and being read on a Windows system? As we explained earlier, the $/ variable defaults to the newline character, but that is \n on Linux and \r\n on Windows. Perl silently newline characters to the appropriate newline character for your operating system. This means that reading and writing text files (such as XML or YAML documents) works transparently, regardless of the operating system you are on. What happens if you’re working with a binary file, such as an image? You don’t want Perl to try and “fix” the newlines, so you open the file and use the binmode builtin. my$image = 'really_cool.jpg';
open my $fh, '<',$image
or die "Cannot open '$image' for reading:$!";
binmode $fh; # treat it as a binary file With the above code, you don’t have to worry about newlines being translated. ### Note See perldoc -f binmode for more information. The binmode builtin accepts an optional “layer” description (older versions of Perl referred to this as the “discipline”). The :raw layer is the default, so the following two lines are equivalent: binmode$fh;
binmode $fh, ':raw'; If you want to tell Perl that the file is UTF-8 (covered later in this chapter), you can use the :encoding(UTF-8) layer: my$kanji_examples = 'kanji.txt';
open my $fh, '<',$kanji_examples
or die "Cannot open '$kanji_examples' for reading:$!";
closedir $dh or die "Cannot close '$directory': $!"; ### Warning Do not be tempted to think that readdir() only returns files and directories. Depending on what your operating system supports, it might be a symbolic link (-d), a named piped (-p) or a socket (-S). We will generally not be covering those in this book, but you should be aware of this as it’s a common beginner mistake. Note that opendir() does not have a three-argument form. You do not “write” to directories, though you can certainly create directories and files in them. ### Globbing You can also use the File::Glob module to “glob” directories. This uses the common file globbing semantics. For example, *.txt will match any file with a .txt extension. You can use the glob() builtin or the angle brackets for this behavior. ### Note See perldoc File::Glob for more information on glob() and <>. The following are three equivalent ways of listing all directory entries with a .txt extension. We’ll start using autodie to make our life simpler. Using opendir(): use strict; use warnings; use autodie; my$dir = 'drafts/';
opendir(my $dh,$dir);
my @txt = grep { /\.txt$/ } readdir($dh);
print join "\n", @txt;
closedir $dh; Using glob(): use strict; use warnings; use autodie; my$dir = 'drafts';
my @txt = glob("$dir/*.txt"); print join "\n", @txt; Using <>: use strict; use warnings; use autodie; my$dir = 'drafts';
my @txt = <$dir/*.txt>; # no quotes! print join "\n", @txt; ### Note Typeglobs and fileglobs are not the same thing. We apologize for the confusion. ## Unicode If a coworker asks why their programming language doesn’t compile and you don’t recognize the programming language, you already know the basic problem with Unicode: when Perl is processing data, it needs to know what character set it is encoded as. As the world becomes more interconnected, it’s increasingly important that different systems are able to communicate correctly. We are introducing this now because as you’re reading and writing files, it’s becoming increasingly common to find that those files are not ASCII or Latin-1, as many developers assume (or more correctly, many developers aren’t aware of the issues).. ### Warning Any version of Perl prior to 5.6 is broken by default for Unicode. 5.12 is sometimes considered the minimum “safe” version and 5.14 offers a level of Unicode support that few other languages can equal. ### What is Unicode? In the good ol’ days of programming (arbitrarily defined as “when your author was growing up”), aspiring programmers were typing game programs directly from the BASIC listing in programming magazines. These programs were written in ASCII, the American Standard Code for Information Interchange. Back then, characters tended to be represented by 7 or 8 bits of data. ASCII characters took seven bits of data, with values ranging from 0 to 128. Eight bit numbers could use characters from 129 to 255. Different systems often represented the 129 to 255 numbers in different ways and were sometimes referred to as extended ASCII. You might have had interesting graphic figures or you may have had accented characters. But what did the Japanese do when they wanted to write 日本国? Clearly having only 255 characters is not enough for many writing systems. The Unicode standard is a way of describing every character in every writing system with a single number. This number is called a code point and it’s comprised of one or more octets. We use the word octet to refer to eight bits, so all characters that can be represented by the numbers 0 to 255 take up one octet of space. Your author’s wife is French and her first name is Leïla. The ï in Leïla is represented as the code point U+00EF (the 00EF is hexadecimal). The letters A and a are U+0041 and U+0061, respectively, and 国 is U+56FD. However, a code point describes a character, but it doesn’t describe the encoding of that character. The EF in code point U+00EF is the decimal number 239. That number can be described in 8 bits as 11101111. Some encodings, such as UTF-8 and UTF-16, encode that in 16 bits (two octets). UTF-32 encodes that in 32 bits (4 octets). ### Note A bit is a single 0 or 1. 8 bits forms an octet. Many people refer to 8 bits as one byte, but in reality, a byte’s length is dependent on the machine you’re running it on, so we use the word octet to avoid ambiguity. What’s important to remember here is that the codepoint associated with a character has no relation to the encoding. Any given character encoding (such as UTF-8, UTF-32, and so on) is free to encode any codepoint in any way it desires, so long as the coding is unambious. UTF-8 has an advantage over many other encodings because ASCII characters are represented identically in ASCII and UTF-8, making it backwards-compatible with ASCII. This is why UTF-8 tends to be the dominant encoding for Unicode. If you send ASCII to a system that is expecting UTF-8, it will often work just fine. That doesn’t tell you, however, how to use Unicode. ### Two simple rules A typical workflow for a program is: • Initialization • Input • Calculation • Output The two simple rules are: decode all of your text input and encode all of your text output. With this, you can ensure that inside of your Perl program, you’re working with Perl’s internal string format and don’t have to worry about errors that occur when you’re trying to concatenate strings in different encodings. #### Decoding Your Data Decoding your data means “decode your data to Perl’s internal format”. What is Perl’s internal format? It doesn’t matter. If Perl ever needs to change that internal format, you should not be relying on knowing the details. Suffice it to say that Perl will generally treat your text data as binary data instead of characters until you decode it and write it out somewhere. This is the hard part. You have to find out what the encoding of your source data is! So if your data is in 7bit-jis (a Japanese pre-Unicode encoding), you could use the Encode::decode() function to transform it into Perl’s internal format: use Encode qw(encode decode); my$string = decode('7bit-jis', $byte_string); And now Perl will happily handle this for you, including reporting its length correctly. However, it’s better to not have to decode strings on a string-by-string basis. It’s better to decode them at the source, if possible (thus making it harder to forget). You can use Perl’s IO layers to handle that. One way is to specify the layer with binmode(): open my$fh, '<', $some_file or die$!;
binmode $fh, ':7bit-jis'; Or better still, specify it with the mode because it’s harder to miss: open my$fh, '<:7bit-jis', $some_file or die$!;

If you don’t know the encoding of your source data, ask the person who sent you the data. If that fails, Encode (first shipped with Perl 5.7.3) includes the Encode::Guess module. It’s not a bad module, but it’s a “guess” at the encoding. Read the documentation carefully and be aware that it will guess wrong from time to time.

Now that you’ve decoded your data and done fun things with it, you need to encode it back to its original format before you send it along. Not surprisingly, the encode() function from Encode does this for you.

use Encode qw(encode decode);
my $encoded = encode('7bit-jis',$string);

Or again, use the IO layers:

open my $fh, '>:7bit-jis',$some_file or die $!; Then, when you write the data out to the console, a file or some other data sink, it will be encoded correctly. #### A Typical Unicode Nightmare So decode your input and encode your output. Not too bad, right? Well, that’s until you try it. First, let’s look at this code snippet. my$string = '日本国';
my $length = length($string);
print "$string has$length characters\n";

And that prints out (assuming you have the correct font installed):

日本国 has 9 characters

Of course, that’s not true. It has 9 octets, but it clearly has 3 characters. So the first thing that many people do is this:

use utf8;
my $string = '日本国'; my$length = length($string); print "$string has $length characters\n"; Many people assume that use utf8 means “magically make everything UTF-8”, but that’s not correct. We get the following output: Wide character in print at /var/tmp/eval_Yrhm.pl line 4. 日本国 has 3 characters ### Note You can cut-and-paste 日本国 from http://en.wikipedia.org/wiki/Japan since you are unlikely to be able to type those characters directly. Note the strange Wide character in print warning, but we now have the correct length. The use utf8 pragma only tells Perl that our source code is utf8. It doesn’t tell Perl that your output is UTF-8, so Perl is expecting a binary output to the STDOUT filehandle but you’ve sent UTF-8, so let’s fix that. use utf8; my$string = '日本国';
my $length = length($string);
binmode STDOUT, ':encoding(UTF-8)';
print "$string has$length characters\n";

And that will give us the correct output with no warnings (note that the Wide Character in Print warning will occur even if you don’t use warnings).

Alternatively, if you don’t want to force STDOUT to be UTF-8, you could just encode the string from Perl’s internal format to UTF-8 and this will also make the warning go away.

use utf8;
use Encode qw(encode decode);
my $string = '日本国'; my$length = length($string);$string    = encode('UTF-8', $string); print "$string has $length characters\n"; But we’re still not quite where we want to be in understanding this. The use utf8 pragma tells Perl that your source code is UTF-8, bit it doesn’t tell Perl that your input is UTF-8. Try this: use utf8; use Encode qw(encode decode); my$string = shift @ARGV;
my $length = length($string);
$string = encode('UTF-8',$string);
print "$string has$length characters\n";

If you save that as length.pl and run that with perl length.pl 日本国, you will get output similar to:

æ¥æ¬å½ has 9 characters

You won’t even get a warning. Why? Because you haven’t decoded the data and Perl assumes it’s Latin-1 data (ISO-8859-1) that it already knows how to deal with. When you explicitly decode the data, everything works as expected:

use utf8;
use Encode qw(encode decode);
my $string = decode('UTF-8', shift); my$length = length($string);$string    = encode('UTF-8', $string); print "$string has $length characters\n"; If you are unsure of what encodings your system provides, the following one-liner will print all of them for you: perl -MEncode -e 'print join "\n" => Encode->encodings(":all")' ### Warning Be very careful when using the UTF-8 layer. Many Perl references will tell you to do something like this: binmode STDOUT, ':utf8'; Or this: open my$fh, '<:utf8', $filename; This is extremely bad because :utf8 is not the same as :encoding(UTF-8). The :encoding(UTF-8) layer says “this filehandle is guaranteed to be UTF-8” and it will die if you feed it invalid data. The :utf8 layer says “this filehandle is in UTF-8”, but it doesn’t verify that this is true. As a result, programs that use the :utf8 layer can be deliberately fed invalid data and this is a security hole. Do not use the :utf8 layer. Read http://www.perlmonks.org/?node_id=644786 for more information. ### Note Just because your source code is UTF-8 doesn’t mean that your text editor or IDE is set to recognize or save your source code as UTF-8. Consult your editor’s documentation on how to do this. Also, your terminal program may not default to UTF-8. Check how to set your terminal’s preferences for displaying UTF-8 data correctly. This is often in a preference entitled “Character Encoding” or something similar. In the event that your terminal cannot handle UTF-8 data, use a modern terminal program. In the even that you terminal and editor/IDE both claim to handle UTF-8 data correctly and you still see garbage on the screen you may have to ensure you have the correct fonts installed. You’ll need to consult your operating system’s documentation for how to do this. ### Lots of complicated rules Before we go further, it’s recommended that you read the following: perldoc perlunitut perldoc perlunifaq perldoc perlunicode perldoc perluniintro perldoc Encode Unfortunately, while the two simple rules will cover general cases, they won’t cover all cases because they can’t, but we’re going to cover a few issues to be aware of. #### Case Folding Case folding is converting all of the characters in a string to upper or lower case. This is useful when you want to make case-insensitive comparisons. It’s also often a dangerous thing to do with Unicode. Consider the following program: use utf8; binmode(STDOUT, ":encoding(UTF-8)"); print uc("σ"), "\n"; # Greek small letter sigma print uc("ς"), "\n"; # Greek small final letter sigma That will print out the same letter twice, an upper-case sigma character: Σ Σ The σ and ς characters are the same lower-case sigma character, but the latter is used at the end of the word. When you call uc() on them, they both resolve to an upper-case sigma, Σ. This leads to this problem: use utf8; binmode(STDOUT, ":encoding(UTF-8)"); print lc(uc("σ")), "\n"; # Greek small letter sigma print lc(uc("ς")), "\n"; # Greek small final letter sigma That will print σ twice, meaning that case-folding is not round-trip safe in Unicode. In fact, in earlier versions of Perl, in some cases, characters in the range 128 to 255 would often have strange behavior when you tried to use lc, uc, ucfirst, and so on. When used as characters, they would sometimes be considered Unicode code points and when used as bytes, they could be considered “unassigned characters” and not match \w in regular expressions. The solution is simple: use feature 'unicode_strings'; Unfortunately, that feature was not added until Perl 5.11.3 (a development release). So today it’s argued that you should really use Perl 5.12 or better (preferably 5.14) if you really want to be “Unicode safe”. #### Converting Between Encodings You need to convert between UTF-16 and ISO-8859-1 (Latin-1). To do this, you must convert from one encoding to Perl’s internal format and then convert to the desired format: my$string = decode('UTF-16', $utf16_data); my$latin1 = encode('iso-8859-1', $string); However, ISO-8859-1 is a subset of UTF-16, so you may lose data. #### Wide Character In Print You’ll see this warning a lot when you’re working with character encodings and you’re not being careful. When this happens, it’s because you haven’t specified your encoding layer. Perl then assumes your data is ISO-8859-1 (for backwards-compatibility) and tries to output UTF-8. Any data that doesn’t fit in the ISO-8859-1 range emits this warning. That’s why you got this warning with this code snippet we used earlier: use utf8; my$string = '日本国';
my $length = length($string);
print "$string has$length characters\n";

#### Assuming Everything is UTF-8

The input data may be read from files, the command line, sockets, and other data sources. The output data may be written to STDOUT, files, or other data sinks. To tell Perl that all input and output data is UTF-8, you can set the PERL_UNICODE environment variable to AS. The A and S letter combination is described in the -C section of perldoc perlrun.

Unfortunately, it’s not as simple as setting the environment variable in your code. You must set this before you run your program. On a Linux style system, you can do this:

PERL_UNICODE=AS perl program.pl

Or you can export the variable and it will be set for all programs:

export PERL_UNICODE=AS

On Windows, the syntax is:

set PERL_UNICODE=AS

This can be a hassle to do every time and it may very well be the wrong thing to do if you have non-UTF-8 data.

#### is_utf8()

Sometimes you’ll see this in code:

use Encode 'is_utf8';
if ( is_utf8($string) ) { # wrong! } Or the identical: if ( utf8::is_utf8($string) ) {
# wrong!
}

This does not work as you think it does. The is_utf8() function is used internally to determine if Perl should treat a string as Latin-1 or UTF-8. However, just because the utf8 flag is set does not mean that the string is actually UTF-8. Like the Encode::Guess module, it’s just a guess (for you) and you explicitly set your encoding layers as described earlier.

#### A UTF-8 shortcut

If you want a shortcut for assuming that @ARGV, your filehandles and your source code are all UTF-8, you can install the utf8::all module from the CPAN.

use utf8::all;

You may recall this program from earlier:

use utf8;
use Encode qw(encode decode);
my $string = decode('UTF-8', shift); my$length = length($string);$string    = encode('UTF-8', $string); print "$string has $length characters\n"; With the utf8::all pragma, that becomes: use utf8::all; my$string = shift @ARGV;
my $length = length($string);
print "$string has$length characters\n";

In other words, it makes it easier to write programs with UTF-8 data. It’s not perfect, but it’s a good start.

#### Printing Unicode

By now you already know how to open your STDOUT to handle printing Unicode, but what about typing those funny characters? Well, you don’t have to. One way of avoiding this is the charnames pragma:

use utf8::all;
use charnames ':short';
# note that double-quoted strings are required
print "\N{greek:Sigma} is an upper-case sigma.\n";

And that prints (with no warning due to utf8::all):

Σ is an upper-case sigma.

The \N{} construct with charnames is resolved at compile-time, so you cannot use variables there.

You can also use the Unicode fullnames:

use utf8::all;
use charnames ':full';
print "\N{GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI}\n";

Which prints ἧ.

If you know the code point but not the name, you can use \N{U+codepoint}. Again, remember this is done at compile time. Thus, the code point for the smiley face character is U+263A, so you can print it with this:

use utf8::all;
print "\N{U+263A}\n";

Or you can just fall back to the chr() function:

print chr(0x263a);

See http://unicode.org/charts/ for a list of the appropriate names you may wish to print.

#### Unicode Character Properties and Regular Expressions

The character ἧ is a Greek letter, but is it upper or lower case? You can try Unicode character properties to find out:

use utf8::all;
my $character ='ἧ'; if ($character =~ /\p{Lowercase}/ ) {
print "$character is lower case\n"; } if ($character =~ /\p{Uppercase}/ ) {
print "$character is upper case\n"; } That correctly prints ἧ is lower case. Unicode properties are properties about characters that describe something about it. They might describe the case of the letter, the script used, whether or not it’s a math symbol or punctuation and so on. Unicode is so all-encompasing — and it must be since it is trying to handle all writing systems — that you will find many strange things in Unicode land. Here’s one of them: use utf8::all; # latin capital letter d with small letter z my$character = "\N{U+01F2}";
if ( $character =~ /\p{Lowercase}/ ) { print "$character is lower case\n";
}
if ( $character =~ /\p{Uppercase}/ ) { print "$character is upper case\n";
}
if ( $character =~ /\p{Titlecase_Letter}/ ) { print "$character is title case\n";
}

And that prints:

ǲ is title case

This is because the Latin capital letter d with small letter z is considered a Titlecase character and is not upper or lower case. Fun, eh?

### Note

See perldoc perluniprop for a full list of Unicode properties supported and how to use them. See also Chapter 4, Working With data of the Unicode version 6 standard: http://www.unicode.org/versions/Unicode6.0.0/ch04.pdf. perldoc perlunicode also has a list of common properties in the “Unicode Character Properties” section.

You can spend a long time understanding Unicode and this section of the book is far too short, but here a couple of good starting points for understanding Unicode and some of the associated issues.

First, read Joel Spolsky’s famous “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” article at http://www.joelonsoftware.com/articles/Unicode.html

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

In that link, Tom Christiansen explains, in-depth, many of the traps to be aware of. It’s mind-bending, but it begins to give you an idea of what you’re up against.

Also, http://en.wikipedia.org/wiki/Free_software_Unicode_typefaces has a list of Free Unicode fonts you can install if you’re tired of seeing broken characters when you try to print Unicode.

## Useful Modules

If you start working frequently with the file system, you’ll be happy to know that many Perl modules are available to take away the drudgery. Further, as they get new features added and bugs fixed, they’ll correctly handle issues that you don’t want to have to worry about.

### File::Find

The File::Find module was released with Perl 5. It’s a great module that, unfortunately, is showing its age. You’ll often find when working with Perl that older modules are stable, powerful, and have difficult interfaces. This is because when Perl 5 was released, many people were still experimenting with all of its features and trying to figure out the best way to work with them. File::Find is a module from that era and its interface is clumsy, but it works very well. It has a variety of options, but you have to do most of the work yourself. Here’s one way to delete all empty text files in a directory and its subdirectories.

use File::Find;
find( \&wanted, 'some_directory/' );
sub wanted {
if ( /\.txt$/ && -f$_ && -z _ ) {
# only delete empty text files
unlink $_ or die "Could not unlink '$File::Find::name': $!"; } } You could also have written that as (but this is a touch clumsy): use File::Find; find( sub { if ( /\.txt$/ && -f $_ && -z _ ) { unlink$_ or die "Could not unlink '$File::Find::name':$!";
}
},
'some_directory',
);

From the documentation:

find(\&wanted,  @directories);
find(\%options, @directories);

The find() function does a depth-first search over the given @directories in the order they are given. For each file or directory found, it calls the wanted() subroutine. (See below for details on how to use the wanted() function). Additionally, for each directory found, it will chdir() into that directory and continue the search, invoking the wanted() function on each file or subdirectory in the directory.

Every time the wanted() function is called, the following three variables will be set:

• $File::Find::name — Full path to the file or directory found • $File::Find::dir — Full path to the current directory found

• $_ — The short name of the file or directory found In this case, the “full path” is relative to the starting directory. When you start a Perl program, its “current directory” is generally the directory you were in when you started the program. However, you can call chdir($some_directory) and Perl will attempt to change its current directory to that directory. Thus, the $_ variable is relative to the current directory that the File::Find::find() function is in at the time. In other words, if you write: find sub { print "$_ -> $File::Find::name\n" }, 'notes/' ); If there is a file named notes/some_file.txt, the following variables will be set when that file is reached: • $File::Find::namenotes/some_file.txt

• $File::Find::dirnotes/ • $_some_file.txt

Because the find() function will change into the directory it’s searching at the time, file test operators and functions such as open and unlink should operate on $_ instead of $File::Find::name. However, the latter is very useful if you need to do error reporting:

# the $_ is optional with unlink as it default to$_
unlink $_ or die "Could not unlink '$File::Find::name': $!"; It’s also useful if you need to collect the names for later use: find (\&html_documents, @directories); my @html_docs; sub html_documents { push @html_docs,$File::Find::name
if /\.html?$/; } When the find() function is finished, it your Perl program’s current directory will be the one you started with, so working with the @html_docs array needs the full paths relative to the current directory and not just the short name in $_.

See perldoc File::Find for many more options for this module.

### File::Path

File::Path was released with Perl 5.001 and lets you manipulate file paths and not just individual files and directories.

use autodie ':all';
use File::Path qw(make_path remove_tree);
make_path('path/to/create/', 'another/path/to/create');
remove_tree('path/to/remove');

Those should be self-explanatory. The latter removes a “tree” because path/to/remove/ may have a complete directory tree underneath it. As with other modules listed here, see the documentation to understand all that it can do. We’ve only covered the basics here. We’ve used autodie to make error handling a bit safer, but the docs show a slightly different approach.

### File::Find::Rule

We haven’t covered object-oriented Perl yet (that’s Chapter 12, Object Oriented Perl), but the File::Find::Rule module is so useful that we’ll briefly explain it. If you don’t understand what’s going on, bookmark this page to return to after you’ve read Chapter 12, Object Oriented Perl.

File::Find::Rule is an excellent alternative to the File::Find module because it has a cleaner syntax that is easier to follow. Our code to find HTML documents becomes this:

my @html_docs = File::Find::Rule
->file
->name(qr/\.html?$/) ->in(@directories); The -> syntax, as you may recall, is the “dereferencing operator”. In this case it’s also used when we call methods on an object. We’ll cover objects more in Chapter 12, Object Oriented Perl, but for now, be aware that ->file, ->name, and ->in are sort of like subroutine calls. With the File::Find::Rule examples, just note the syntax and try these examples on your own. You’ll understand this better when we cover objects. Moving along, here’s how to find empty files: my @empty = File::Find::Rule->file->empty->in(@directories); You’ll note how naturally that reads. The file() method means “find only files”. The empty() method means “find only empty files” (or directories, if you asked for directories). The in() method means, well, I’m sure you get the idea by now. The name() method seen just a bit earlier takes a glob or regex and returns everything matching that. So let’s say you’re converting a project from the Subversion source control system to Git and you want to delete all of Subversion’s annoying .svn directories, you could do this: use File::Path 'remove_tree'; use File::Find::Fule; my @svn_dirs = File::Find::Rule->directory->name('.svn')->in($dir);
foreach my $svn_dir (@svn_dirs) { remove_tree($svn_dir)
or die "Cannot rmdir($svn_dir):$!";
}

File::Find::Rule also provides an exec() method. Like File::Find, it takes a callback (a sub reference passed to it). Unlike File::Find, it passes relevant variables to the subref as arguments, so the above could be written as:

File::Find::Rule->find->directory->name('.svn')->exec( sub {
my ( $short_name,$directory, $fullname ) = @_; remove_tree($svn_dir)
or die "Cannot rmdir($svn_dir):$!";
} )->in(@directories);

If the exec() method is encountered, the $short_name, $directory and $fullname are passed to the subref. These are analogous to the $_, $File::Find::dir and $File::Find::name variables used with File::Find.

Of course, sometimes you prefer an iterator. This is handy when you’re working with a very large directory structure and you want to process everything as it’s encountered rather than waiting for a list to be generated. So instead of this:

my @html_docs = File::Find::Rule->file->name(qr/\.html?$/)->in(@directories); You could write this: my$find = File::Find::Rule
->file
->name(qr/\.html?$/) ->start(@directories); while ( defined ( my$html_document = $find->match ) ) { # do something with$html_document
}

Or maybe you want to print all files greater than a half meg?

File::Find::Rule
->file
->size('>.5M')
->exec(sub {
my ( $short_name,$directory, $fullname ) = @_; print "$fullname\n";
})->in(@ARGV);

Like File::Find, File::Find::Rule has many options, so reading the documentation is very useful.

## Summary

We’ve covered the basics of file and directory manipulation in Perl. You’ve learned how to open files and read and write to them. You’ve learned about file test operators to check for interesting properties about your filesystem and how to use binmode() to tell Perl how it’s supposed to read and write the data in filehandles.

Also, because this is the first chapter to start working with data outside of your program, we’ve introduced Unicode. It’s a complicated topic and one that more and more programmers are expected to understand. Due to the Internet, what was previously a problem encountered by only a handful of people is one that many must now deal with and understand. You will save yourself much grief in your future career by coming to grips with it now.

## Exercises

1. The Unix cat utility takes a list of files as arguments and concatenates them, printing the result to STDOUT. Write this utility in Perl as cat.pl (if you know the Unix cat utility, you don’t have to provide the rest of the behavior).

2. Modify cat.pl to strip comments and blank lines. Consider a comment to be any line in a file that begins with zero or more spaces followed by a # symbol.

3. Write a program, codepoints2char.pl, that will take a list of decimal (not hexidecimal) numbers and print the Unicode character. Assume UTF-8. Try running it with:

perl codepoint2char.pl 3232 95 3232

Note that this exercise is problematic because it requires the proper fonts installed for the codepoints you wish to display. The 3232 (U+OCA0) code point is from Kannada, one of the Dravidian languages of India. You may need to search for an install a free Kannada font.

4. Write a program, chars2codepoints.pl, that will take a list of words on the command line and print out, in decimal, their codepoints separated by spaces. having each word’s list of code points on a separate line. You can search Wikipedia for interesting lists of words written in other scripts.

For extra credit, print out the value as a Unicode code point. In other words, decimal 3232 becomes U+0CA0. (Hint: see sprintf() in Chapter 4, Working With data)

