use MIME::Latin1 qw(latin1_to_ascii); $dirty = "Fran\347ois"; print latin1_to_ascii($dirty); # prints out "Fran\c,ois"
"7bit"
encoder/decoder for handling the case where a user wants to 7bit-encode a
document that contains 8-bit (presumably Latin-1) characters. It provides a
mapping whereby every 8 bit character is mapped to a unique sequence of two
7-bit characters that approximates the appearance or pronunciation of the
Latin-1 character. For example:
This... maps to... -------------------------------------------------- A c with a cedilla c, A C with a cedilla C, An "AE" ligature AE An "ae" ligature ae Yen sign Y-
I call each of these 7-bit 2-character encodings mnemonic encodings, since they (hopefully) are visually reminiscent of the 8-bit characters they are meant to represent.
\xy
Where xy
is a two-character sequence that visually approximates the Latin-1
character. For example:
c cedilla => \c, n tilde => \n~ AE ligature => \AE small o slash => \o/
The sequences are taken almost exactly from the Sun character composition sequences for generating these characters. The translation may be further tweaked by the (optional) OPTS string:
\xy
:
\<<Fran\c,ois M\u"ller\>> c:\usr\games
"\"
is not inserted, making the output more compact:
<<Franc,ois Mu"ller>> c:\usr\games
"\"
output, but any other occurences of "\"
are escaped as well by turning them into "\\"
. Unlike the other options, this produces output which may easily be parsed
and turned back into the original 8-bit characters, so in a way it is its
own full-fledged encoding... and given that "\"
is a rare-enough character, not much uglier that the normal output:
\<<Fran\c,ois M\u"ller\>> c:\\usr\\games
You may use ascii_to_latin1 to decode this.
\xy
) back into actual 8-bit characters.
# Assume $enc holds the actual text... \<<Fran\c,ois \\ M\u"ller\>> print ascii_to_latin1($enc);
Unrecognized sequences are turned into '?' characters.
Note: you must have specified the "ENCODE" option when encoding in order to decode!
80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
To allow this scheme to work properly for all 8-bit-on characters, the general rule is: the first hex digit is DOWNcased, and the second hex digit is UPcased. Hence, these are all decodable sequences:
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aA aB aC aD aE aF
This ``downcase-upcase'' style is so we don't conflict with mnemonically-encoded ligatures like ``ae'' and ``AE'', the latter of which could reasonably have been represented as ``Ae''.
Note that we must never have a mnemonic encoding that could be mistaken for a hex sequence from ``80'' to ``fF'', since the ambiguity would make it impossible to decode. (However, ``12'', ``34'', ``Ff'', etc. are perfectly fine.)
Thanks to Rolf Nelson for reporting the "gap" in the encoding.
ascii_to_latin1()
to perform the reverse mapping. I will
strive for backwards-compatibility in that code.
All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
$CommentsMailTo = "perl5@dcs.ed.ac.uk"; include("../syssies_footer.inc");?>