MIME::ParserBase - abstract class for parsing MIME mail

NAME

SYNOPSIS

This is an abstract class; however, here's how one of its concrete subclasses is used:

    use MIME::Parser;
    
    # Create a new parser object:
    my $parser = new MIME::Parser;
    
    # Parse an input stream:
    $entity = $parser->read(\*STDIN) or die "couldn't parse MIME stream";
    
    # Congratulations: you now have a (possibly multipart) MIME entity!
    $entity->dump_skeleton;          # for debugging

There are also some convenience methods:

    # Parse an in-core MIME message:
    $entity = $parser->parse_data($message)
          || die "couldn't parse MIME message";
    
    # Parse already-split input (as "deliver" would give it to you):
    $entity = $parser->parse_two("msg.head", "msg.body")
          || die "couldn't parse MIME files";

In case a parse fails, it's nice to know who sent it to us. So...

    # Parse an input stream:
    $entity = $parser->read(\*STDIN);
    if (!$entity) {           # oops!
        my $decapitated = $parser->last_head;    # last top-level head
    }

You can also alter the behavior of the parser:

    # Parse contained "message/rfc822" objects as nested MIME streams:
    $parser->parse_nested_messages('REPLACE');
     
    # Automatically attempt to RFC-1522-decode the MIME headers:
    $parser->decode_headers(1);

This is the class that contains all the knowledge for parsing MIME streams. It's an abstract class, containing no methods governing the output of the parsed entities: such methods belong in the concrete subclasses.

You can inherit from this class to create your own subclasses that parse MIME streams into MIME::Entity objects. One such subclass, MIME::Parser, is already provided in this kit.

PUBLIC INTERFACE

Construction, and setting options

new ARGS...

Class method. Create a new parser object. Passes any subsequent arguments onto the init() method.

Once you create a parser object, you can then set up various parameters before doing the actual parsing. Here's an example using one of our concrete subclasses:

    my $parser = new MIME::Parser;
    $parser->output_dir("/tmp");
    $parser->output_prefix("msg1");
    my $entity = $parser->read(\*STDIN);

decode_headers ONOFF

Instance method. If set true, then the parser will attempt to decode the MIME headers as per RFC-1522 the moment it sees them. This will probably be of most use to those of you who expect some international mail, especially mail from individuals with 8-bit characters in their names.

If set false, no attempt at decoding will be done.

With no argument, just returns the current setting.

Warning: some folks already have code which assumes that no decoding is done, and since this is pretty new and radical stuff, I have initially made ``off'' the default setting for backwards compatibility in 2.05. However, I will possibly change this in future releases, so please: if you want a particular setting, declare it when you create your parser object.

interface ROLE,[VALUE]

Instance method. During parsing, the parser normally creates instances of certain classes, like MIME::Entity. However, you may want to create a parser subclass that uses your own experimental head, entity, etc. classes (for example, your ``head'' class may provide some additional MIME-field-oriented methods).

If so, then this is the method that your subclass should invoke during init. Use it like this:

    package MyParser;
    @ISA = qw(MIME::Parser);
    ...
    sub init {
        my $self = shift;
        $self->SUPER::init(@_);        # do my parent's init
        $self->interface(ENTITY_CLASS => 'MIME::MyEntity');
        $self->interface(HEAD_CLASS   => 'MIME::MyHead');
        $self;                         # return
    }

With no VALUE, returns the VALUE currently associated with that ROLE.

last_head

Instance method. Return the top-level MIME header of the last stream we attempted to parse. This is useful for replying to people who sent us bad MIME messages.

    # Parse an input stream:
    $entity = $parser->read(\*STDIN);
    if (!$entity) {           # oops!
        my $decapitated = $parser->last_head;    # last top-level head
    }

parse_nested_messages OPTION

Instance method. Some MIME messages will contain a part of type message/rfc822: literally, the text of an embedded mail/news/whatever message. The normal behavior is to save such a message just as if it were a text/plain document, without attempting to decode it. However, you can change this: before parsing, invoke this method with the OPTION you want:

If OPTION is false, the normal behavior will be used.

If OPTION is true, the body of the message/rfc822 part is decoded (after all, it might be encoded!) into a temporary filehandle, which is then rewound and parsed by this parser, creating an entity object. What happens then is determined by the OPTION:

NEST or 1

The contained message becomes a ``part'' of the message/rfc822 entity, as though the message/rfc822 were a special kind of multipart entity. However, the message/rfc822 header (and the content-type) is retained.

Warning: since it is not legal MIME for anything but multipart to have a ``part'', the message/rfc822 message will appear to have no content if you simply print() it out. You will have to have to get at the reparsed body manually, by the MIME::Entity::parts() method.

IMHO, this option is probably only useful if you're processing messages, but not saving or re-sending them. In such cases, it is best to not use ``parse nested'' at all.

REPLACE

The contained message replaces the message/rfc822 entity, as though the message/rfc822 ``envelope'' never existed.

Warning: notice that, with this option, all the header information in the message/rfc822 header is lost. This might seriously bother you if you're dealing with a top-level message, and you've just lost the sender's address and the subject line. :-/.

Thanks to Andreas Koenig for suggesting this method.

Parsing messages

parse_data DATA

Instance method. Parse a MIME message that's already in-core. You may supply the DATA in any of a number of ways...

A scalar which holds the message.
A ref to a scalar which holds the message. This is an efficiency hack.
A ref to an array of scalars. The array elements are simply joined to produce a scalar; no newlines are inserted!

Returns a MIME::Entity, which may be a single entity, or an arbitrarily-nested multipart entity. Returns undef on failure.

Note: where the parsed body parts are stored (e.g., in-core vs. on-disk) is not determined by this class, but by the subclass you use to do the actual parsing (e.g., MIME::Parser). For efficiency, if you know you'll be parsing a small amount of data, it is probably best to tell the parser to store the parsed parts in core. For example, here's a short test program, using MIME::Parser:

        use MIME::Parser;
        
        my $msg = <<EOF;
    Content-type: text/html
    Content-transfer-encoding: 7bit

    <H1>Hello, world!</H1>;

    EOF
        $parser = new MIME::Parser;
        $parser->output_to_core('ALL');
        $entity = $parser->parse_data($msg);
        $entity->print(\*STDOUT);

parse_two HEADFILE, BODYFILE

Instance method. Convenience front-end onto read(), intended for programs running under mail-handlers like deliver, which splits the incoming mail message into a header file and a body file.

Simply give this method the paths to the respective files. These must be pathnames: Perl ``open-able'' expressions won't work, since the pathnames are shell-quoted for safety.

WARNING: it is assumed that, once the files are cat'ed together, there will be a blank line separating the head part and the body part.

Returns the parsed entity, or undef on error.

read INSTREAM

Instance method. Takes a MIME-stream and splits it into its component entities, each of which is decoded and placed in a separate file in the splitter's output_dir().

The INSTREAM can be given as a readable FileHandle, a globref'd filehandle (like \*STDIN), or as any blessed object conforming to the MIME::IO (or IO::) interface.

Returns a MIME::Entity, which may be a single entity, or an arbitrarily-nested multipart entity. Returns undef on failure.

WRITING SUBCLASSES

All you have to do to write a subclass is to provide or override the following methods:

init ARGS...

Instance method, private. Initiallize the new parser object, with any args passed to new().

You don't need to override this in your subclass. If you override it, however, make sure you call the inherited method to init your parents!

    package MyParser;
    @ISA = qw(MIME::Parser);
    ...
    sub init {
        my $self = shift;
        $self->SUPER::init(@_);        # do my parent's init
        
        # ...my init stuff goes here... 
        
        $self;                         # return
    }

Should return the self object on success, and undef on failure.

new_body_for HEAD

Abstract instance method. Based on the HEAD of a part we are parsing, return a new body object (any desirable subclass of MIME::Body) for receiving that part's data (both will be put into the ``entity'' object for that part).

If you want the parser to do something other than write its parts out to files, you should override this method in a subclass. For an example, see MIME::Parser.

Note: the reason that we don't use the ``interface'' mechanism for this is that your choice of (1) which body class to use, and (2) how its new() method is invoked, may be very much based on the information in the header.

You are of course free to override any other methods as you see fit, like new.

NOTES

This is an abstract class. If you actually want to parse a MIME stream, use one of the children of this class, like the backwards-compatible MIME::Parser.

Under the hood

RFC-1521 gives us the following BNF grammar for the body of a multipart MIME message:

      multipart-body  := preamble 1*encapsulation close-delimiter epilogue

      encapsulation   := delimiter body-part CRLF

      delimiter       := "--" boundary CRLF 
                                   ; taken from Content-Type field.
                                   ; There must be no space between "--" 
                                   ; and boundary.

      close-delimiter := "--" boundary "--" CRLF 
                                   ; Again, no space by "--"

      preamble        := discard-text   
                                   ; to be ignored upon receipt.

      epilogue        := discard-text   
                                   ; to be ignored upon receipt.

      discard-text    := *(*text CRLF)

      body-part       := <"message" as defined in RFC 822, with all 
                          header fields optional, and with the specified 
                          delimiter not occurring anywhere in the message 
                          body, either on a line by itself or as a substring 
                          anywhere.  Note that the semantics of a part 
                          differ from the semantics of a message, as 
                          described in the text.>

From this we glean the following algorithm for parsing a MIME stream:

    PROCEDURE parse
    INPUT
        A FILEHANDLE for the stream.
        An optional end-of-stream OUTER_BOUND (for a nested multipart message).
    
    RETURNS
        The (possibly-multipart) ENTITY that was parsed.
        A STATE indicating how we left things: "END" or "ERROR".
    
    BEGIN   
        LET OUTER_DELIM = "--OUTER_BOUND".
        LET OUTER_CLOSE = "--OUTER_BOUND--".
    
        LET ENTITY = a new MIME entity object.
        LET STATE  = "OK".
    
        Parse the (possibly empty) header, up to and including the
        blank line that terminates it.   Store it in the ENTITY.
    
        IF the MIME type is "multipart":
            LET INNER_BOUND = get multipart "boundary" from header.
            LET INNER_DELIM = "--INNER_BOUND".
            LET INNER_CLOSE = "--INNER_BOUND--".
    
            Parse preamble:
                REPEAT:
                    Read (and discard) next line
                UNTIL (line is INNER_DELIM) OR we hit EOF (error).
    
            Parse parts:
                REPEAT:
                    LET (PART, STATE) = parse(FILEHANDLE, INNER_BOUND).
                    Add PART to ENTITY.
                UNTIL (STATE != "DELIM").
    
            Parse epilogue:
                REPEAT (to parse epilogue): 
                    Read (and discard) next line
                UNTIL (line is OUTER_DELIM or OUTER_CLOSE) OR we hit EOF
                LET STATE = "EOF", "DELIM", or "CLOSE" accordingly.
     
        ELSE (if the MIME type is not "multipart"):
            Open output destination (e.g., a file)
    
            DO:
                Read, decode, and output data from FILEHANDLE
            UNTIL (line is OUTER_DELIM or OUTER_CLOSE) OR we hit EOF.
            LET STATE = "EOF", "DELIM", or "CLOSE" accordingly.
    
        ENDIF
    
        RETURN (ENTITY, STATE).
    END

For reasons discussed in MIME::Entity, we can't just discard the ``discard text'': some mailers actually put data in the preamble.

Questionable practices

Multipart messages are always read line-by-line

Multipart document parts are read line-by-line, so that the encapsulation boundaries may easily be detected. However, bad MIME composition agents (for example, naive CGI scripts) might return multipart documents where the parts are, say, unencoded bitmap files... and, consequently, where such ``lines'' might be veeeeeeeeery long indeed.

A better solution for this case would be to set up some form of state machine for input processing. This will be left for future versions.

Multipart parts read into temp files before decoding

In my original implementation, the MIME::Decoder classes had to be aware of encapsulation boundaries in multipart MIME documents. While this decode-while-parsing approach obviated the need for temporary files, it resulted in inflexible and complex decoder implementations.

The revised implementation uses a temporary file (a la tmpfile()) during parsing to hold the encoded portion of the current MIME document or part. This file is deleted automatically after the current part is decoded and the data is written to the ``body stream'' object; you'll never see it, and should never need to worry about it.

Some folks have asked for the ability to bypass this temp-file mechanism, I suppose because they assume it would slow down their application. I considered accomodating this wish, but the temp-file approach solves a lot of thorny problems in parsing, and it also protects against hidden bugs in user applications (what if you've directed the encoded part into a scalar, and someone unexpectedly sends you a 6 MB tar file?). Finally, I'm just not conviced that the temp-file use adds significant overhead.

Fuzzing of CRLF and newline on input

RFC-1521 dictates that MIME streams have lines terminated by CRLF ("\r\n"). However, it is extremely likely that folks will want to parse MIME streams where each line ends in the local newline character "\n" instead.

An attempt has been made to allow the parser to handle both CRLF and newline-terminated input.

Fuzzing of CRLF and newline on output

The "7bit" and "8bit" decoders will decode both a "\n" and a "\r\n" end-of-line sequence into a "\n".

The "binary" decoder (default if no encoding specified) still outputs stuff verbatim... so a MIME message with CRLFs and no explicit encoding will be output as a text file that, on many systems, will have an annoying ^M at the end of each line... but this is as it should be.

Inability to handle multipart boundaries that contain newlines

First, let's get something straight: this is an evil, EVIL practice, and is incompatible with RFC-1521... hence, it's not valid MIME.

If your mailer creates multipart boundary strings that contain newlines when they appear in the message body, give it two weeks notice and find another one. If your mail robot receives MIME mail like this, regard it as syntactically incorrect MIME, which it is.

Why do I say that? Well, in RFC-1521, the syntax of a boundary is given quite clearly:

      boundary := 0*69<bchars> bcharsnospace
        
      bchars := bcharsnospace / " "
      
      bcharsnospace :=    DIGIT / ALPHA / "'" / "(" / ")" / "+" /"_"
                   / "," / "-" / "." / "/" / ":" / "=" / "?"

All of which means that a valid boundary string cannot have newlines in it, and any newlines in such a string in the message header are expected to be solely the result of folding the string (i.e., inserting to-be-removed newlines for readability and line-shortening only).

Yet, there is at least one brain-damaged user agent out there that composes mail like this:

      MIME-Version: 1.0
      Content-type: multipart/mixed; boundary="----ABC-
       123----"
      Subject: Hi... I'm a dork!
      
      This is a multipart MIME message (yeah, right...)
      
      ----ABC-
       123----
      
      Hi there!

We have got to discourage practices like this (and the recent file upload idiocy where binary files that are part of a multipart MIME message aren't base64-encoded) if we want MIME to stay relatively simple, and MIME parsers to be relatively robust.

Thanks to Andreas Koenig for bringing a baaaaaaaaad user agent to my attention.

WARNINGS

binmode

New, untested binmode() calls were added in module version 1.11... if binmode() is not a NOOP on your system, please pay careful attention to your output, and report any anomalies. It is possible that "make test" will fail on such systems, since some of the tests involve checking the sizes of the output files. That doesn't necessarily indicate a problem.

If anyone wants to test out this package's handling of both binary and textual email on a system where binmode() is not a NOOP, I would be most grateful. If stuff breaks, send me the pieces (including the original email that broke it, and at the very least a description of how the output was screwed up).

AUTHOR

VERSION

$Revision: 3.203 $ $Date: 1997/01/22 08:40:01 $