Pod::Parse - Parse perl's pod files.
THIS TK SNAPSHOT SHOULD BE REPLACED BY A CPAN MODULE
A module designed to simplify the job of parsing and formatting ``pods'',
the documentation format used by perl5. This consists of several different
functions to present and modify predigested pod files.
This is a work in progress, so I may have some stuff wrong, perhaps badly.
Some of my more reaching guesses:
-
An =index paragraph should be split into lines, and each line placed inside
an `X' formatting command which is then preprended to the next paragraph,
like this:
=index foo
foo2
foo3
foo2!subfoo
Foo!
Will become:
X<foo>X<foo2>X<foo3>X<foo2!subfoo>Foo!
-
A related change: that an `X' command is to be used for indexing data. This
implies that all formatters need to at least ignore the `X' command.
-
Inside an =command, no special significance is to be placed on the first
line of the argument. Thus the following two lines should be parsed
identically:
=item 1. ABC
=item 1.
ABC
Note that neither of these are identical to this:
=item 1.
ABC
which puts the ``ABC'' in a separate paragraph.
-
I actually violate this rule twice: in parsing =index commands, and in
passing through the =pragma commands. I hope this make sense.
-
I added the =comment command, which simply ignores the next paragraph
-
I also added =pragma, which also ignores the next paragraph, but this time
it gives the formatter a chance at doing something sinister with it.
This module has two goals: first, to simplify the usage of the pod format,
and secondly the codification of the pod format. While perlpod contains
some information, it hardly gives the entire story. Here I present ``the
rules'', or at least the rules as far as I've managed to work them out.
- Paragraphs: The basic element
-
The fundamental ``atom'' of a pod file is the paragraph, where a paragraph
is defined as the text up to the next completely blank line (``\n\n''). Any
pod parser will read in paragraphs sequentially, deciding what do to with
each based solely on the current state and on the text at the _beginning_
of the paragraph.
- Commands: The method of communication
-
A paragraph that starts with the `=' symbol is assumed to be a special
command. All of the alphanumeric characters directly after the `=' are
assumed to be part of the name of the command, up to the first whitespace.
Anything past that whitespace is considered ``the arugment'', and the
argument continues up till the end of the paragraph, regardless of newlines
or other whitespace.
- Text: Commands that aren't Commands
-
A paragraph that doesn't start with `=' is treated as either of two types
of text. If it starts with a space or tab, it is considered a verbatim
paragraph, which will be printed out... verbatim. No formatting changes
whatsover may be done. (Actually, this isn't quite true, but I'll get back
to that at a later date.)
A paragraph that doesn't start with whitespace or `=' is assumed to consist
of formmated text that can be molded as the formatter sees fit.
Reformatting to fit margins, whatever, it's fair game. These paragraphs
also can contain a number of different formatting codes, which verbatim
paragraphs can't. These formatting codes are covered later.
- =cut: The uncommand
-
There is one command that needs special mention: =cut. Anything after a
paragraph starting with =cut is simply ignored by the formatter. In
addition, any text before a valid command is equally ignored. Any valid `=' command will reenable
formating. This fact is used to great benefit by Perl, which is glad to
ignore anything between an `=' command and `=cut', so you can embed a pod
document right inside a perl program, and neither will bother the other.
- Reference to paragraph commands
-
- =cut
-
Ignore anything till the next paragraph starting with `='.
- =head1
-
A top-level heading. Anything after the command (either on the same line or
on further lines) is included in the heading, up until the end of the
paragraph.
- =head2
-
Secondary heading. Same as =head1, but different. No, there isn't a head3,
head4, etc.
- =over [N]
-
Start a list. The
N
is the number of characters to indent by. Not all formatters will listen to
this, though. A good number to use is 4.
While =over sounds like it should just be indentation, it's more complex
then that. It actually starts a nested environment, specifically for the
use of =item's. As this command recurses properly, you can use more then
one, you just have to make sure they are closed off properly by =back
commands.
- =back
-
Ends the last =over block. Resets the indentation to whatever it was
previously. Closes off the list of =item's.
- =item
-
The point behind =over and =back. This command should only be used between
them. The argument supplied should be consistent (within a list) to one of
three types: enumeration, itemization, or description. To exemplify:
An itemized list
=over 4
=item *
A bulleted item
=item *
Another bulleted item
=back
An enumerated list
=over 4
=item 1.
First item.
=item 2.
Second item.
=back
A described list
=over 4
=item Item #1
First item
=item Item #2 (which isn't really like #1, but is the second).
Second item
=back
If you aren't consistent about the arguments to =item, Pod::Parse will
complain.
- =comment
-
Ignore this paragraph
- =pragma
-
Ignore this paragraph, as well, unless you know what you are doing.
- =index
-
Undecided at this time, but probably magic involving X<>.
- Reference to formatting directives
-
- ...
-
Format text inside the brackets as bold.
- ...
-
Format text inside the brackets as italics.
-
-
Replace with a zero-width character. You'll probably figure out some uses
for this.
- And yet more that I haven't described yet...
-
This function takes a list of files as an argument. If no argument is
given, it defaults to the contents of @ARGV. Parse then reads through each
file and returns the data as a list. Each element of this list will be a
nested list containing data from a paragraph of the pod file. Elements
pertaining to ``=over'' paragraphs will themselves contain the nested
entries for all of the paragraphs within that list. Thus, it's easier to
parse the output of Parse using a recursive parses. (Um, did that parse?)
It is highly recommended that you use the output of Simplify, not Parse, as it's
simpler.
The output will consist of a list, where each element in the list matches
one of these prototypes:
- [0,0,0,0,$filename]
-
This is produced at the beginning of each file parsed, where
$filename
is the name of that file.
- [-1,0,0,0,$filename]
-
End of same.
- [1,$line,$pos,0,$verbatim]
-
This is produced for each paragraph of verbatim text.
$verbatim
is the text, $line
is the line offset
of the paragraph within the file, and $pos
is the byte offset.
(In all of the following elements, $pos
and $line
have identical meanings, so I'll skip explaining them each time.)
- [2,$line,$pos,$level,$heading]
-
Producded by a =head1 or =head2 command.
$level
is either 1 or
2, and $heading
is the argument.
- [3,$line,$pos,0,$item]
-
$item
is the argument from an =item paragraph.
- [4,$line,$pos,0,$index]
-
$index
is the argument from an =index paragraph.
- [6,$line,$pos,0,$text]
-
Normal formatted text paragraph.
$text
is the text.
- [7,$line,$pos,0,$pragma]
-
$pragma
is the argument from a =pragma paragraph.
- [8,$line,$pos,$indentation,$type,...]
-
This item is produced for each matching =over/=back pair.
$indentation
is the argument to =over, $type
is 1
if the embedded =item's are bulleted, 2 if they are enumerated, 3 if they
are text, and 0 if there are no items.
The ``...'' indicates an unlimited number of further elements which are
themselves nested arrays in exactly the format being described. In other
words, a list item includes all the paragraphs inside the list inside
itself. (Clear? No? Nevermind.)
- [9,$line,$pos,0,$cut]
-
$cut
contains the text from a =cut paragraph. You shouldn't
need to use this, but I _suppose_ it might be necessary to do special
breaks on a cut. I doubt it though. This one is ``depreciated'', as Larry
put it. Or perhaps disappreciated.
This procedure takes as it's input the convoluted output from
Parse(),
and outputs a much simpler array consisting of pairs
of commands and arguments, designed to be easy (easier?) to parse in your
pod formatting code.
It is used very simply by saying something like:
@Pod = Simplify(Parse());
while($cmd = shift @Pod) { $arg = shift @Pod;
#...
}
Where #... is the code that responds to any of the commands from the
following list. Note that you are welcome to ignore any of the commands
that you want to. Many contain duplicate information, or at least
information that will go unused. A formatted based on this data can be
quite simple indeed. (See pod2text for entirely too simple an example.)
- "filename"
-
The argument contains the name of the pod file that is being parsed. These
will be present at the start of each file. You should open an output file,
output headers, etc., based on this, and not when you start parsing.
- "endfile"
-
The end of the file. Each file will be ended before the next one begins,
and after all files are done with. You can do end processing here. The
argument is the same name as in ``filename''.
- "setline"
-
This gives you a chance to record the ``current'' input line, probably for
debugging purposes. In this case, ``current'' means that the next command
you see that was derived from an input paragraph will have start at the
arguments line in the file.
- "setloc"
-
Same as setline, but the byte offset in the input, instead of the line
offset.
- "pragma"
-
The argument contains the text of a pragma command.
- "text"
-
The argument contains a paragraph of formatted text.
- "verbatim"
-
The argument contains a paragraph of verbatim text.
- "cut"
-
A =cut command was hit. You shouldn't really need to listen for this one.
- "index"
-
The argument contains an =index paragraph. (Note: Current =index commands
are not fed through, but turned into X<> commands.)
- "head1"
-
- "head2"
-
The argument contains the argument from a header command.
- "setindent"
-
If you are tracking indentation, use the argument to set the indentation
level.
- "listbegin"
-
Start a list environment. The argument is the type of list (1,2,3 or 0).
- "listend"
-
Ends a list environment. Same argument as listbegin.
- "listtype"
-
The argument is the type of list. You can just record the argument when you
see one of these, instead of paying attention to listbegin & listend.
- "over"
-
The argument is the indentation. It's probably better to listen to the
``list...'' commands.
- "back"
-
Ends an ``over'' list. The argument is the original indentation.
- "item"
-
The argument is the text of the =item command.
Note that all of these various commands you've seen are syncronized
properly so you don't have to pay attention to all at once, but they are
all output for your benefit. Consider the following example:
listtype 2
listbegin 2
setindent 4
over 4
item 1.
text Item #1
item 2.
text Item #2
setindent 0
listend 2
back 0
listtype 0
=head2 Normalize
This command is normally invoked by Parse, so you shouldn't need to deal
with it. It just cleans up text a little, turning spare '<', '>', and '&' characters into HTML escapes (<, etc.) as well as generating warnings for some pod formatting mistakes.
A little more aggresive formating based on heuristics. Not applied by
default, as it might confuse your own heuristics.
This hash is exported from Pod::Parse, and contains default ASCII
translations for some common HTML escape sequences. You might like to use
this as a basis for an %HTML_Escapes
array in your own
formatter.
$CommentsMailTo = "perl5@dcs.ed.ac.uk"; include("../syssies_footer.inc");?>