XML::SAX::ByRecord is a SAX machine that treats a document as a series of records. Everything before and
after the records is emitted as-is while the records are excerpted in to little mini-documents and run
one at a time through the filter pipeline contained in ByRecord.
The output is a document that has the same exact things before, after, and between the records that the
input document did, but which has run each record through a filter. So if a document has 10 records in
it, the per-record filter pipeline will see 10 sets of ( start_document, body of record, end_document )
events. An example is below.
This has several use cases:
• Big, record oriented documents
Big documents can be treated a record at a time with various DOM oriented processors like
XML::Filter::XSLT.
• Streaming XML
Small sections of an XML stream can be run through a document processor without holding up the
stream.
• Record oriented style sheets / processors
Sometimes it's just plain easier to write a style sheet or SAX filter that applies to a single record
at at time, rather than having to run through a series of records.
Topology
Here's how the innards look:
+-----------------------------------------------------------+
| An XML:SAX::ByRecord |
| Intake |
| +----------+ +---------+ +--------+ Exhaust |
--+-->| Splitter |--->| Stage_1 |-->...-->| Merger |----------+----->
| +----------+ +---------+ +--------+ |
| \ ^ |
| \ | |
| +---------->---------------+ |
| Events not in any records |
| |
+-----------------------------------------------------------+
The "Splitter" is an XML::Filter::DocSplitter by default, and the "Merger" is an XML::Filter::Merger by
default. The line that bypasses the "Stage_1 ..." filter pipeline is used for all events that do not
occur in a record. All events that occur in a record pass through the filter pipeline.
Example
Here's a quick little filter to uppercase text content:
package My::Filter::Uc;
use vars qw( @ISA );
@ISA = qw( XML::SAX::Base );
use XML::SAX::Base;
sub characters {
my $self = shift;
my ( $data ) = @_;
$data->{Data} = uc $data->{Data};
$self->SUPER::characters( @_ );
}
And here's a little machine that uses it:
$m = Pipeline(
ByRecord( "My::Filter::Uc" ),
\$out,
);
When fed a document like:
<root> a
<rec>b</rec> c
<rec>d</rec> e
<rec>f</rec> g
</root>
the output looks like:
<root> a
<rec>B</rec> c
<rec>C</rec> e
<rec>D</rec> g
</root>
and the My::Filter::Uc got three sets of events like:
start_document
start_element: <rec>
characters: 'b'
end_element: </rec>
end_document
start_document
start_element: <rec>
characters: 'd'
end_element: </rec>
end_document
start_document
start_element: <rec>
characters: 'f'
end_element: </rec>
end_document