So far we've managed to parse simple patterns that could have been specified with a simple regular
expression. Any parser for a nontrivial grammar will need other abilities as well; it will need to be
able to choose from a list of alternatives, to be able to repeat patterns, and to form nested scopes to
match other content within.
"Parser::MGC" provides a set of methods that take one or more "CODE" references that perform some parsing
step, and form a higher-level construction out of them. These can be used to build more complex parsers
out of simple ones. It is this recursive structure that gives "Parser::MGC" its main power over simple
one-shot regexp matching.
Any nontrivial grammar is likely to be formed from multiple named rules. It is natural therefore to split
the parser for such a grammar into methods whose names reflect the structure of the grammar to be parsed.
Each of the structure-forming methods which takes "CODE" references invokes each by passing in the parser
object itself as the first argument. This makes it simple to invoke sub-rules by passing references to
method subs themselves, because the parser object will already be passed as the invocant.
The following examples will build together into a parser for a simple C-like expression language.
OptionalRules
The simplest of the structure-forming methods, "maybe", attempts to run the parser step it is given and
if it succeeds, returns the value returned by that step. If it fails by throwing an exception, then the
"maybe" call simply returns "undef" and resets the current parse position back to where it was before it
started. This allows writing a grammar that includes an optional element, similar to the "?" quantifier
in a regular expression.
sub parse_type
{
my $self = shift;
my $storage = $self->maybe( sub {
$self->token_kw(qw( static auto typedef ));
} );
return MyGrammar::Type->new( $self->parse_ident, $storage );
}
RepeatedRules
The next structure-forming method, "sequence_of", attempts to run the parser step it is given multiple
times until it fails, and returns an "ARRAY" reference collecting up all the return values from each
iteration that succeeded. By itself, "sequence_of" can never fail; if the body never matches then it just
yields an empty array and consumes nothing from the input. This allows writing a grammar that includes a
repeating element, similar to the "*" quantifier in a regular expression.
sub parse_statements
{
my $self = shift;
my $statements = $self->sequence_of( sub {
$self->parse_statement;
} );
return MyGrammar::Statements->new( $statements );
}
Often it is the case that the grammar requires at least one item to be present, and should not accept an
empty parse of zero elements. This can be achieved in code by testing the size of the returned array, and
using the "fail" method. This could be considered similar to the "+" quantifier in a regular expression.
sub parse_statements
{
my $self = shift;
my $statements = $self->sequence_of( sub {
$self->parse_statement;
} );
@$statements > 0 or $self->fail( "Expected at least one statement" );
return MyGrammar::Statements->new( $statements );
}
Another case that often happens it that the grammar requires some simple separation pattern between each
parsed item, such as a comma. The "list_of" method helps here because it automatically handles those
separating patterns between the items, returning a reference to an array containing only the actual
parsed items without the separators.
sub parse_expression_list
{
my $self = shift;
my $exprs = $self->list_of( ",", sub {
$self->parse_expression;
} );
return MyGrammar::ExpressionList->new( $exprs );
}
AlternateRules
To handle a choice of multiple different alternatives in the grammar, the "any_of" method takes an
ordered list of parser steps, and attempts to invoke each in turn. It yields as its result the result of
the first one of these that didn't fail. This allows writing a grammar that allows a choice of multiple
different rules at some point, similar to the "|" alternation in a regular expression.
sub parse_statement
{
my $self = shift;
$self->any_of(
sub { $self->parse_declaration },
sub { $self->parse_expression; $self->expect( ';' ); },
sub { $self->parse_block_statement },
);
}
ScopingRules
The final structure-forming method has no direct analogy to a regular expression, though usually similar
structures can be found. To handle the case where some nested structure has to be handled between opening
and closing markers, the "scope_of" method can be used. It takes three arguments, being the opening
marker, a parser step to handle the contents of the body, and the closing marker. It expects to find each
of these in sequence, and returns the value that the inner parsing step returned.
However, what makes it more interesting is that during execution of the inner parsing step, the basic
token functions all take into account the closing marker. No token function will return a result if the
stream now looks like the scope closing marker. Instead, they'll all fail claiming to be at the end of
the scope. This makes it much simpler to parse, for example, lists of values surrounded by braces.
sub parse_array_initialiser
{
my $self = shift;
$self->scope_of( "{", sub { $self->parse_expression_list }, "}" );
}
During execution of the inner call to "parse_expression_list", any occurrence in the stream of the "}"
marker will appear to be the end of the stream, causing the inner call to stop at hopefully the right
place (barring other syntax errors), and terminating correctly.
perl v5.40.0 2024-11-21 Parser::MGC::Tutorial(3pm)