![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
use MIME::Parser; # Create a new parser object: my $parser = new MIME::Parser; # Parse an input stream: $entity = $parser->read(\*STDIN) or die "couldn't parse MIME stream"; # Congratulations: you now have a (possibly multipart) MIME entity! $entity->dump_skeleton; # for debugging
There are also some convenience methods:
# Parse an in-core MIME message: $entity = $parser->parse_data($message) || die "couldn't parse MIME message"; # Parse already-split input (as "deliver" would give it to you): $entity = $parser->parse_two("msg.head", "msg.body") || die "couldn't parse MIME files";
In case a parse fails, it's nice to know who sent it to us. So...
# Parse an input stream: $entity = $parser->read(\*STDIN); if (!$entity) { # oops! my $decapitated = $parser->last_head; # last top-level head }
You can also alter the behavior of the parser:
# Parse contained "message/rfc822" objects as nested MIME streams: $parser->parse_nested_messages('REPLACE'); # Automatically attempt to RFC-1522-decode the MIME headers: $parser->decode_headers(1);
This is the class that contains all the knowledge for parsing MIME streams. It's an abstract class, containing no methods governing the output of the parsed entities: such methods belong in the concrete subclasses.
You can inherit from this class to create your own subclasses that parse MIME streams into MIME::Entity objects. One such subclass, MIME::Parser, is already provided in this kit.
Once you create a parser object, you can then set up various parameters before doing the actual parsing. Here's an example using one of our concrete subclasses:
my $parser = new MIME::Parser; $parser->output_dir("/tmp"); $parser->output_prefix("msg1"); my $entity = $parser->read(\*STDIN);
If set false, no attempt at decoding will be done.
With no argument, just returns the current setting.
Warning: some folks already have code which assumes that no decoding is done, and since this is pretty new and radical stuff, I have initially made ``off'' the default setting for backwards compatibility in 2.05. However, I will possibly change this in future releases, so please: if you want a particular setting, declare it when you create your parser object.
If so, then this is the method that your subclass should invoke during init. Use it like this:
package MyParser; @ISA = qw(MIME::Parser); ... sub init { my $self = shift; $self->SUPER::init(@_); # do my parent's init $self->interface(ENTITY_CLASS => 'MIME::MyEntity'); $self->interface(HEAD_CLASS => 'MIME::MyHead'); $self; # return }
With no VALUE, returns the VALUE currently associated with that ROLE.
# Parse an input stream: $entity = $parser->read(\*STDIN); if (!$entity) { # oops! my $decapitated = $parser->last_head; # last top-level head }
message/rfc822
: literally, the text of an embedded mail/news/whatever message. The normal
behavior is to save such a message just as if it were a
text/plain
document, without attempting to decode it. However, you can change this:
before parsing, invoke this method with the OPTION you want:
If OPTION is false, the normal behavior will be used.
If OPTION is true, the body of the message/rfc822
part is decoded (after all, it might be encoded!) into a temporary
filehandle, which is then rewound and parsed by this parser, creating an
entity object. What happens then is determined by the OPTION:
message/rfc822
entity, as though the message/rfc822
were a special kind of multipart
entity. However, the message/rfc822
header (and the content-type) is retained.
Warning: since it is not legal MIME for anything but multipart
to have a ``part'', the message/rfc822
message will appear to
have no content if you simply print()
it out. You will have to have to get at the reparsed body manually, by the MIME::Entity::parts()
method.
IMHO, this option is probably only useful if you're processing messages, but not saving or re-sending them. In such cases, it is best to not use ``parse nested'' at all.
message/rfc822
entity, as though the message/rfc822
``envelope'' never existed.
Warning: notice that, with this option, all the header information in the message/rfc822
header is lost. This might seriously bother you if you're dealing with a
top-level message, and you've just lost the sender's address and the
subject line. :-/
.
Note: where the parsed body parts are stored (e.g., in-core vs. on-disk) is not determined by this class, but by the subclass you use to do the actual parsing (e.g., MIME::Parser). For efficiency, if you know you'll be parsing a small amount of data, it is probably best to tell the parser to store the parsed parts in core. For example, here's a short test program, using MIME::Parser:
use MIME::Parser; my $msg = <<EOF; Content-type: text/html Content-transfer-encoding: 7bit
<H1>Hello, world!</H1>;
EOF $parser = new MIME::Parser; $parser->output_to_core('ALL'); $entity = $parser->parse_data($msg); $entity->print(\*STDOUT);
Simply give this method the paths to the respective files. These must be pathnames: Perl ``open-able'' expressions won't work, since the pathnames are shell-quoted for safety.
WARNING: it is assumed that, once the files are cat'ed together, there will be a blank line separating the head part and the body part.
Returns the parsed entity, or undef on error.
output_dir().
The INSTREAM can be given as a readable FileHandle, a globref'd filehandle
(like \*STDIN
), or as any blessed object conforming to the MIME::IO (or IO::) interface.
Returns a MIME::Entity, which may be a single entity, or an arbitrarily-nested multipart entity. Returns undef on failure.
You don't need to override this in your subclass. If you override it, however, make sure you call the inherited method to init your parents!
package MyParser; @ISA = qw(MIME::Parser); ... sub init { my $self = shift; $self->SUPER::init(@_); # do my parent's init # ...my init stuff goes here... $self; # return }
Should return the self object on success, and undef on failure.
If you want the parser to do something other than write its parts out to files, you should override this method in a subclass. For an example, see MIME::Parser.
Note: the reason that we don't use the ``interface'' mechanism for this is that your choice of (1) which body class to use, and (2) how its new() method is invoked, may be very much based on the information in the header.
multipart-body := preamble 1*encapsulation close-delimiter epilogue
encapsulation := delimiter body-part CRLF
delimiter := "--" boundary CRLF ; taken from Content-Type field. ; There must be no space between "--" ; and boundary.
close-delimiter := "--" boundary "--" CRLF ; Again, no space by "--"
preamble := discard-text ; to be ignored upon receipt.
epilogue := discard-text ; to be ignored upon receipt.
discard-text := *(*text CRLF)
body-part := <"message" as defined in RFC 822, with all header fields optional, and with the specified delimiter not occurring anywhere in the message body, either on a line by itself or as a substring anywhere. Note that the semantics of a part differ from the semantics of a message, as described in the text.>
From this we glean the following algorithm for parsing a MIME stream:
PROCEDURE parse INPUT A FILEHANDLE for the stream. An optional end-of-stream OUTER_BOUND (for a nested multipart message). RETURNS The (possibly-multipart) ENTITY that was parsed. A STATE indicating how we left things: "END" or "ERROR". BEGIN LET OUTER_DELIM = "--OUTER_BOUND". LET OUTER_CLOSE = "--OUTER_BOUND--". LET ENTITY = a new MIME entity object. LET STATE = "OK". Parse the (possibly empty) header, up to and including the blank line that terminates it. Store it in the ENTITY. IF the MIME type is "multipart": LET INNER_BOUND = get multipart "boundary" from header. LET INNER_DELIM = "--INNER_BOUND". LET INNER_CLOSE = "--INNER_BOUND--". Parse preamble: REPEAT: Read (and discard) next line UNTIL (line is INNER_DELIM) OR we hit EOF (error). Parse parts: REPEAT: LET (PART, STATE) = parse(FILEHANDLE, INNER_BOUND). Add PART to ENTITY. UNTIL (STATE != "DELIM"). Parse epilogue: REPEAT (to parse epilogue): Read (and discard) next line UNTIL (line is OUTER_DELIM or OUTER_CLOSE) OR we hit EOF LET STATE = "EOF", "DELIM", or "CLOSE" accordingly. ELSE (if the MIME type is not "multipart"): Open output destination (e.g., a file) DO: Read, decode, and output data from FILEHANDLE UNTIL (line is OUTER_DELIM or OUTER_CLOSE) OR we hit EOF. LET STATE = "EOF", "DELIM", or "CLOSE" accordingly. ENDIF RETURN (ENTITY, STATE). END
For reasons discussed in MIME::Entity, we can't just discard the ``discard text'': some mailers actually put data in the preamble.
A better solution for this case would be to set up some form of state machine for input processing. This will be left for future versions.
The revised implementation uses a temporary file (a la tmpfile()
) during parsing to hold the encoded portion of the current MIME document or part. This file is deleted
automatically after the current part is decoded and the data is written to
the ``body stream'' object; you'll never see it, and should never need to
worry about it.
Some folks have asked for the ability to bypass this temp-file mechanism, I suppose because they assume it would slow down their application. I considered accomodating this wish, but the temp-file approach solves a lot of thorny problems in parsing, and it also protects against hidden bugs in user applications (what if you've directed the encoded part into a scalar, and someone unexpectedly sends you a 6 MB tar file?). Finally, I'm just not conviced that the temp-file use adds significant overhead.
"\r\n"
). However, it is extremely likely that folks will want to parse MIME
streams where each line ends in the local newline character "\n"
instead.
An attempt has been made to allow the parser to handle both CRLF and newline-terminated input.
"7bit"
and "8bit"
decoders will decode both a "\n"
and a "\r\n"
end-of-line sequence into a "\n"
.
The "binary"
decoder (default if no encoding specified) still outputs stuff verbatim...
so a MIME message with CRLFs and no explicit encoding will be output as a
text file that, on many systems, will have an annoying ^M at the end of
each line... but this is as it should be.
If your mailer creates multipart boundary strings that contain newlines when they appear in the message body, give it two weeks notice and find another one. If your mail robot receives MIME mail like this, regard it as syntactically incorrect MIME, which it is.
Why do I say that? Well, in RFC-1521, the syntax of a boundary is given quite clearly:
boundary := 0*69<bchars> bcharsnospace bchars := bcharsnospace / " " bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" / "+" /"_" / "," / "-" / "." / "/" / ":" / "=" / "?"
All of which means that a valid boundary string cannot have newlines in it, and any newlines in such a string in the message header are expected to be solely the result of folding the string (i.e., inserting to-be-removed newlines for readability and line-shortening only).
Yet, there is at least one brain-damaged user agent out there that composes mail like this:
MIME-Version: 1.0 Content-type: multipart/mixed; boundary="----ABC- 123----" Subject: Hi... I'm a dork! This is a multipart MIME message (yeah, right...) ----ABC- 123---- Hi there!
We have got to discourage practices like this (and the recent file upload idiocy where binary files that are part of a multipart MIME message aren't base64-encoded) if we want MIME to stay relatively simple, and MIME parsers to be relatively robust.
Thanks to Andreas Koenig for bringing a baaaaaaaaad user agent to my attention.
binmode()
calls were added in module version
1.11... if binmode()
is not a NOOP on your system, please pay careful attention to your output, and
report any anomalies.
It is possible that "make test" will fail on such systems,
since some of the tests involve checking the sizes of the output files.
That doesn't necessarily indicate a problem.
If anyone wants to test out this package's handling of both binary and textual email
on a system where binmode()
is not a NOOP, I would be most
grateful. If stuff breaks, send me the pieces (including the original email
that broke it, and at the very least a description of how the output was
screwed up).
All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
$CommentsMailTo = "perl5@dcs.ed.ac.uk"; include("../syssies_footer.inc");?>