MIME::Latin1 - translate ISO-8859-1 into 7-bit approximations

NAME

SYNOPSIS

    use MIME::Latin1 qw(latin1_to_ascii);
    
    $dirty = "Fran\347ois";
    print latin1_to_ascii($dirty);      # prints out "Fran\c,ois"

This is a small package used by the "7bit" encoder/decoder for handling the case where a user wants to 7bit-encode a document that contains 8-bit (presumably Latin-1) characters. It provides a mapping whereby every 8 bit character is mapped to a unique sequence of two 7-bit characters that approximates the appearance or pronunciation of the Latin-1 character. For example:

    This...                   maps to...
    --------------------------------------------------
    A c with a cedilla        c,
    A C with a cedilla        C,
    An "AE" ligature          AE
    An "ae" ligature          ae
    Yen sign                  Y-

I call each of these 7-bit 2-character encodings mnemonic encodings, since they (hopefully) are visually reminiscent of the 8-bit characters they are meant to represent.

PUBLIC INTERFACE

latin1_to_ascii STRING,[OPTS]

Function. Map the Latin-1 characters in the string to sequences of the form:

\xy

Where xy is a two-character sequence that visually approximates the Latin-1 character. For example:

     c cedilla      => \c,
     n tilde        => \n~
     AE ligature    => \AE
     small o slash  => \o/

The sequences are taken almost exactly from the Sun character composition sequences for generating these characters. The translation may be further tweaked by the (optional) OPTS string:

READABLE

Currently the default. Only 8-bit characters are affected, and their output is of the form \xy:

      \<<Fran\c,ois M\u"ller\>>   c:\usr\games

NOSLASH

Exactly like READABLE, except the leading "\" is not inserted, making the output more compact:

      <<Franc,ois Mu"ller>>       c:\usr\games

ENCODE

Not only is the leading "\" output, but any other occurences of "\" are escaped as well by turning them into "\\". Unlike the other options, this produces output which may easily be parsed and turned back into the original 8-bit characters, so in a way it is its own full-fledged encoding... and given that "\" is a rare-enough character, not much uglier that the normal output:

      \<<Fran\c,ois M\u"ller\>>   c:\\usr\\games

You may use ascii_to_latin1 to decode this.

Note: as of 3.12, the options string must, if defined, be one of the above options. Composite options like ``ENCODE|NOSLASH'' will no longer be supported (most will be self-contradictory anyway).

ascii_to_latin1 STRING

Function. Map the Latin-1 escapes in the string (sequences of the form \xy) back into actual 8-bit characters.

   # Assume $enc holds the actual text...    \<<Fran\c,ois \\ M\u"ller\>>
   print ascii_to_latin1($enc);

Unrecognized sequences are turned into '?' characters.

Note: you must have specified the "ENCODE" option when encoding in order to decode!

NOTES

Hex encoding

Characters in the octal range \200-\237 (hexadecimal \x80-\x9F) currently do not have mnemonic Latin-1 equivalents, and therefore are represented by the hex sequences ``80'' through ``9F'', where the second hex digit is upcased. That is:

   80  81  82  83  84  85  86  87  88  89  8A  8B  8C  8D  8E  8F
   90  91  92  93  94  95  96  97  98  99  9A  9B  9C  9D  9E  9F

To allow this scheme to work properly for all 8-bit-on characters, the general rule is: the first hex digit is DOWNcased, and the second hex digit is UPcased. Hence, these are all decodable sequences:

   a0  a1  a2  a3  a4  a5  a6  a7  a8  a9  aA  aB  aC  aD  aE  aF

This ``downcase-upcase'' style is so we don't conflict with mnemonically-encoded ligatures like ``ae'' and ``AE'', the latter of which could reasonably have been represented as ``Ae''.

Note that we must never have a mnemonic encoding that could be mistaken for a hex sequence from ``80'' to ``fF'', since the ambiguity would make it impossible to decode. (However, ``12'', ``34'', ``Ff'', etc. are perfectly fine.)

Thanks to Rolf Nelson for reporting the "gap" in the encoding.

Other restrictions

The first character of a 2-character encoding can not be a "\". This is because ``\\'' represents an encoded ``\'': to allow ``\\x'' would introduce an ambiguity for the decoder.

Going backwards

Since the mappings may fluctuate over time as I get more input, anyone writing a translator would be well-advised to use ascii_to_latin1() to perform the reverse mapping. I will strive for backwards-compatibility in that code.

Got a problem?

If you have better suggestions for some of the character representations, please contact me.

AUTHOR

VERSION