Sample solutions and discussion Perl Quiz of The Week #14 (20030611) Write a program which generates an HTML table of contents for a directory of files. It should read the directory, producing a list of the files, and write out a file "toc.html" in that same directory, in the following format: Table of Contents for [Directory Name]

Table of Contents for [Directory Name]

file1
file2
... lastfile
The files should be listed in alphabetic order. The program should accept a command-line argument that tells it what directory to index; if the argument is omitted, it should index the current directory. ---------------------------------------------------------------- I only saw one solution posted on the perl-qotw-discuss list, from Pr. Offer Kaye; I also wrote one myself. Offer's uses a number of standard modules to take care of escaping HTML and URLs: use strict; use warnings; use Cwd; use URI::file; use HTML::Entities; my $toc_file = "toc.html"; if ($#ARGV > 0) {die "Incorrect usage. Use \"$0 dir_name\" or just: \"$0\"\n"} my $dir_name = ($#ARGV == 0) ? $ARGV[0] : cwd(); opendir(DIR, $dir_name) or die "Can't opendir $dir_name: $!\n"; my @list = sort {lc($a) cmp lc($b)} readdir(DIR); chdir $dir_name or die "Couldn't chdir to $dir_name for some reason...\n"; open(OUT,">$toc_file") or die "Couldn't open $toc_file for writing: $!\n"; print OUT "Table of Contents for "; print OUT encode_entities($dir_name),"\n"; print OUT "

Table of Contents for "; print OUT encode_entities($dir_name),"

\n"; for (@list){ next if (($_ eq ".") or ($_ eq "..")); my $u = encode_entities(URI::file->new($_)); print OUT "

\n"; } print OUT ""; Most of this seems completely straightforeard and I don't have much to say about it. I wonder about using 'cwd()' instead of just '.'. I suppose '.' might not be portable, but if you want it to be portable, it would seem to be to be simpler to use File::Spec->curdir(), which then just returns '.'. The output from Pr. Kaye's program is a little peculiar if the filenames are peculiar. For example, I created a file named 'ooky' to make sure that things didn't start blinking. The encode_entities call turns this into '%3Cblink%3Eooky'. This is fine when it appears as a URL, but as HTML text, it's wrong; you actually see '%3Cblink%3Eooky' on the page, percent signs and all. It should have been turned into '<blink>ooky', which would have displayed as 'ooky'. Since I expected everyone would use the modules, I thought I'd see if it was posible to do it without the modules, while still using only features from the Llama book. This turns out to be possible, but just barely. (The modules are *not* discussed in the Llama book.) The main part of the program, of course, is almost the same: for my $f (sort readdir D) { my $url = escape_url($f); my $html = escape_html($f); print "$html
\n"; } The big question is how to implement 'escape_url' and 'escape_html' with only Llama features. 'escape_html' is quite easy: sub escape_html { my $t = shift; $t =~ s/&/&/g; $t =~ s//>/g; return $t; } For this project, this is just fine. 'escape_url', however, is much trickier. Normally, I would write something like this: sub escape_url { my $url = shift; $url =~ s/([^0-9A-Za-z.-_])/sprintf "%%%2x", ord $1/ge; return $url; } This uses *three* non-Llama features: 1. The ord() function Page 166 says "While constructing and interpreting such a byte string is fairly straightforward using 'chr' and 'ord' (not presented here)...". Since they're not presented, I can't use them. 2. sprintf "%x" 'printf' and 'sprintf' are discussed, but never the "%x" escape. 3. s///e s/// is of course discussed in detail, but never /e. The book mentions it in passing at the very very end; there's an example on page 253. I almost went ahead and used it. Then I noticed that typeglobs are also mentioned on page 253, so that allowing page 253 features would make a mockery of the 'Llama only' restriction. So I decided that s///e was forbidden. (Reminder: The 'Llama only' rule applies only to me; everyone else can of course use whatever they want to.) The big difficulty: how to convert "x" to hexadecimal without using ord() and sprintf()? I briefly considered using something like s/%/%25/g; s/\./%2c/g; s/$title

$title

"; for my $f (sort readdir D) { my $url = escape_url($f); my $html = escape_html($f); print "$html
\n"; } print "\n\n"; ################################################################ sub escape_url { my @chars = split //, $_[0]; my $result = ""; for (@chars) { my $ord = Ord($_); if ($good_char{$_}) { $result .= $_; } else { my $h0 = $ord % 16; my $h1 = ($ord - $h0) / 16; $result .= "%$hex[$h1]$hex[$h0]"; } } return $result; } sub escape_html { my $t = shift; $t =~ s/&/&/g; $t =~ s//>/g; return $t; } ---------------------------------------------------------------- 1. Alert readers will notice that my program forgot to escape the directory name in the title of the document. I didn't realize this until I saw Pr. Kaye's solution. Whoops! 2. I violated my own spec, which said "[the program] should write out a file "toc.html" in that same directory...". When time came to write the program, I decided the spec was dumb, and opted to have it write the output to STDOUT instead. If you think the specified behavior is better, just add open STDOUT, ">", "$dir/toc.html" or die ...; near the top of the program. I still think the specified behavior is dumb. The program is more flexible without it. What was I thinking, anyway? 3. Pr. Kaye says: Note that my solution will not work for ALL cases- just the more common. Plus, the HTML file created is very simplistic. A better ('better' as in more complete/robust) solution would perhaps be to determine the proper encoding (either based on the filenames or through a command-line switch) and create a (valid) XHTML file, properly formatted for that encoding. Or perhaps use UTF-8 regardless. This is a huge issue that I never considered at all. It reminded me of the time I asked for some simple calendrical computation and the -discuss list was awash with people asking about the French revolutionary calendar. But Pr. Kaye raises a good point. Pr. Kaye lives in Israel, and may very well encounter files whose names contain Hebrew characters. I have no idea of the corect way to deal with this. I would not be surprised to learn that my program fails miserably when presented with filenames containing Hebrew characters. (I would also not be surprised to learn that it works perfectly as long as the output file contains a line specifying that the character encoding is UTF-16. I really can't exaggerate my ignorance here.) Pr. Kaye continues: Another point is that I haven't got a way to test this on different file systems, so I'm not sure it will work everywhere- although I've done my best to try to make the solution robust in that sense. But I'm pretty sure that the script will not currently work across file systems. 4. Robert Spier asks: At this point, I'm stumped. What does Pr. stand for? I liked refering to people as "Mr." in previous reports. But then I got worried, because this assumes that all these people are men. I do not want to assume that, and really, I don't know. So I decided to invent a new formal title for programmers that would evade the issue. 'Pr.' stands for 'Programmer'. Lawyers get to impress people by putting 'Esq.' after their names. Now programmers can impress people by using a special title too. Thanks again to everyone who particpated, including Pr. Kaye and also those people who worked the problem in private and said nothing. I will post another quiz tonight.