XML::SAX::ByRecord - Record oriented processing of (data) documents

Authors

       •   Barry Slaymaker

       •   Chris Prather <chris@prather.org>

Copyright And License

       This software is copyright (c) 2013 by Barry Slaymaker.

       This is free software; you can redistribute it and/or modify it under  the  same  terms  as  the  Perl  5
       programming language system itself.

perl v5.34.0                                       2022-06-28                            XML::SAX::ByRecord(3pm)

Credit

       Proposed by Matt Sergeant, with advise by Kip Hampton and Robin Berjon.

Description

       XML::SAX::ByRecord is a SAX machine that treats a document as a series of records.  Everything before and
       after the records is emitted as-is while the records are excerpted in to little mini-documents and run
       one at a time through the filter pipeline contained in ByRecord.

       The output is a document that has the same exact things before, after, and between the records that the
       input document did, but which has run each record through a filter.  So if a document has 10 records in
       it, the per-record filter pipeline will see 10 sets of ( start_document, body of record, end_document )
       events.  An example is below.

       This has several use cases:

       •   Big, record oriented documents

           Big  documents  can  be  treated  a  record  at  a  time  with  various  DOM oriented processors like
           XML::Filter::XSLT.

       •   Streaming XML

           Small sections of an XML stream can be run through  a  document  processor  without  holding  up  the
           stream.

       •   Record oriented style sheets / processors

           Sometimes it's just plain easier to write a style sheet or SAX filter that applies to a single record
           at at time, rather than having to run through a series of records.

   Topology
       Here's how the innards look:

          +-----------------------------------------------------------+
          |                  An XML:SAX::ByRecord                     |
          |    Intake                                                 |
          |   +----------+    +---------+         +--------+  Exhaust |
        --+-->| Splitter |--->| Stage_1 |-->...-->| Merger |----------+----->
          |   +----------+    +---------+         +--------+          |
          |               \                            ^              |
          |                \                           |              |
          |                 +---------->---------------+              |
          |                   Events not in any records               |
          |                                                           |
          +-----------------------------------------------------------+

       The  "Splitter"  is an XML::Filter::DocSplitter by default, and the "Merger" is an XML::Filter::Merger by
       default.  The line that bypasses the "Stage_1 ..." filter pipeline is used for all  events  that  do  not
       occur in a record.  All events that occur in a record pass through the filter pipeline.

   Example
       Here's a quick little filter to uppercase text content:

           package My::Filter::Uc;

           use vars qw( @ISA );
           @ISA = qw( XML::SAX::Base );

           use XML::SAX::Base;

           sub characters {
               my $self = shift;
               my ( $data ) = @_;
               $data->{Data} = uc $data->{Data};
               $self->SUPER::characters( @_ );
           }

       And here's a little machine that uses it:

           $m = Pipeline(
               ByRecord( "My::Filter::Uc" ),
               \$out,
           );

       When fed a document like:

           <root> a
               <rec>b</rec> c
               <rec>d</rec> e
               <rec>f</rec> g
           </root>

       the output looks like:

           <root> a
               <rec>B</rec> c
               <rec>C</rec> e
               <rec>D</rec> g
           </root>

       and the My::Filter::Uc got three sets of events like:

           start_document
           start_element: <rec>
           characters:    'b'
           end_element:   </rec>
           end_document

           start_document
           start_element: <rec>
           characters:    'd'
           end_element:   </rec>
           end_document

           start_document
           start_element: <rec>
           characters:   'f'
           end_element:   </rec>
           end_document

Methods

       new
               my $d = XML::SAX::ByRecord->new( @channels, \%options );

           Longhand for calling the ByRecord function exported by XML::SAX::Machines.

Name

       XML::SAX::ByRecord - Record oriented processing of (data) documents

Synopsis

           use XML::SAX::Machines qw( ByRecord ) ;

           my $m = ByRecord(
               "My::RecordFilter1",
               "My::RecordFilter2",
               ...
               {
                   Handler => $h, ## optional
               }
           );

           $m->parse_uri( "foo.xml" );

Version

       version 0.46

Writing An Aggregator.

       To  be  written.  Pretty much just that "start_manifold_processing" and "end_manifold_processing" need to
       be provided.  See XML::Filter::Merger and it's source code for a starter.