HTML::LinkExtractor - Extract links from an HTML document

Author

       D.H (PodMaster)

       Please use http://rt.cpan.org/ to report bugs.

       Just  go  to  http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber  to  see a bug list and/or repot new
       ones.

Description

       HTML::LinkExtractor is used for extracting links from HTML.  It is very similar to HTML::LinkExtor,
       except that besides getting the URL, you also get the link-text.

       Example ( pleaseruntheexamples ):

           use HTML::LinkExtractor;
           use Data::Dumper;

           my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
           my $LX = new HTML::LinkExtractor();

           $LX->parse(\$input);

           print Dumper($LX->links);
           __END__
           # the above example will yield
           $VAR1 = [
                     {
                       '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
                       'href' => bless(do{\(my $o = 'http://perl.com/')}, 'URI::http'),
                       'tag' => 'a'
                     }
                   ];

       "HTML::LinkExtractor" will also correctly extract nested link-type tags.

License

       Copyright (c) 2003, 2004 by D.H. (PodMaster).  All rights reserved.

       This module is free software; you can redistribute it and/or modify it  under  the  same  terms  as  Perl
       itself.  The LICENSE file contains the full text of the license.

perl v5.36.0                                       2022-10-16                                 LinkExtractor(3pm)

Methods

"$LX->new([\&callback,[$baseUrl,[1]]])"
       Accepts 3 arguments, all of which are optional.  If for example you want to pass a $baseUrl, but don't
       want to have a callback invoked, just put "undef" in place of a subref.

       This is the only class method.

       1.  a  callback  ( a sub reference, as in "sub{}", or "\&sub") which is to be called each time a new LINK
           is encountered ( for @HTML::LinkExtractor::TAGS_IN_NEED this means
            after the closing tag is encountered )

           The callback receives an object reference($LX) and a link hashref.

       2.  and a base URL ( URI->new, so its up to you to make sure it's valid which  is  used  to  convert  all
           relative URI's to absolute ones.

               $ALinkP{href} = URI->new_abs( $ALink{href}, $base );

       3.  A  "boolean"  (just stick with 1).  See the example in "DESCRIPTION".  Normally, you'd get back _TEXT
           that looks like

               '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',

           If you turn this option on, you'll get the following instead

               '_TEXT' => ' I am a LINK!!! ',

           The  private  utility  function  "_stripHTML"   does   this   by   using   HTML::TokeParsers   method
           get_trimmed_text.

           You can turn this feature on an off by using "$LX->strip(undef || 0 || 1)"

   "$LX->parse($filename||*FILEHANDLE||\$FileContent)"
       Each time you call "parse", you should pass it a $filename a *FILEHANDLE or a "\$FileContent"

       Each time you call "parse" a new "HTML::TokeParser" object is created and stored in "$this->{_tp}".

       You shouldn't need to mess with the TokeParser object.

   "$LX->links()"
       Only  after  you  call  "parse"  will this method return anything.  This method returns a reference to an
       ArrayOfHashes, which basically looks like (Data::Dumper output)

           $VAR1 = [ { tag => 'img', src => 'image.png' }, ];

       Please note that if yo provide a callback this array will be empty.

   "$LX->strip([0||1])"
       If you pass in "undef" (or nothing), returns the state of the option.  Passing in a true or  false  value
       sets the option.

       If you wanna know what the option does see "$LX->new([\&callback, [$baseUrl, [1]]])"

Name

       HTML::LinkExtractor - Extract links from an HTML document

Synopsis

           ## the demo
           perl LinkExtractor.pm
           perl LinkExtractor.pm file.html othefile.html

           ## or if the module is installed, but you don't know where

           perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/LinkExtractor.pm}} "
           perl -MHTML::LinkExtractor -e' system $^X, $INC{q{HTML/LinkExtractor.pm}} '

           ## or

           use HTML::LinkExtractor;
           use LWP qw( get ); #     use LWP::Simple qw( get );

           my $base = 'http://search.cpan.org';
           my $html = get($base.'/recent');
           my $LX = new HTML::LinkExtractor();

           $LX->parse(\$html);

           print qq{<base href="$base">\n};

           for my $Link( @{ $LX->links } ) {
           ## new modules are linked  by /author/NAME/Dist
               if( $$Link{href}=~ m{^\/author\/\w+} ) {
                   print $$Link{_TEXT}."\n";
               }
           }

           undef $LX;
           __END__

           ## or

           use HTML::LinkExtractor;
           use Data::Dumper;

           my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
           my $LX = new HTML::LinkExtractor(
               sub {
                   print Data::Dumper::Dumper(@_);
               },
               'http://perlFox.org/',
           );

           $LX->parse(\$input);
           $LX->strip(1);
           $LX->parse(\$input);
           __END__

           #### Calculate to total size of a web-page
           #### adds up the sizes of all the images and stylesheets and stuff

           use strict;
           use LWP; #     use LWP::Simple;
           use HTML::LinkExtractor;
                                                               #
           my $url  = shift || 'http://www.google.com';
           my $html = get($url);
           my $Total = length $html;
                                                               #
           print "initial size $Total\n";
                                                               #
           my $LX = new HTML::LinkExtractor(
               sub {
                   my( $X, $tag ) = @_;
                                                               #
                   unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) {
                                                               #
           print "$$tag{tag}\n";
                                                               #
                       for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) {
                           if( exists $$tag{$urlAttr} ) {
                               my $size = (head( $$tag{$urlAttr} ))[1];
                               $Total += $size if $size;
           print "adding $size\n" if $size;
                           }
                       }
                   }
               },
               $url,
               0
           );
                                                               #
           $LX->parse(\$html);
                                                               #
           print "The total size of \n$url\n is $Total bytes\n";
           __END__

What'S A Link-Type Tag

       Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.

       Take  a  look  at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which
       can contain URI's (the links!!)

       Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for  which  the  '_TEXT'  attribute  is
       provided, like "<a href="#"> TEST </a>"

   Howcanthatbe?!?!
       I took at look at %HTML::Tagset::linkElements and the following URL's

           http://www.blooberry.com/indexdot/html/tagindex/all.htmhttp://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htmhttp://www.blooberry.com/indexdot/html/tagpages/a/applet.htmhttp://www.blooberry.com/indexdot/html/tagpages/a/area.htmhttp://www.blooberry.com/indexdot/html/tagpages/b/base.htmhttp://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htmhttp://www.blooberry.com/indexdot/html/tagpages/d/del.htmhttp://www.blooberry.com/indexdot/html/tagpages/d/div.htmhttp://www.blooberry.com/indexdot/html/tagpages/e/embed.htmhttp://www.blooberry.com/indexdot/html/tagpages/f/frame.htmhttp://www.blooberry.com/indexdot/html/tagpages/i/ins.htmhttp://www.blooberry.com/indexdot/html/tagpages/i/image.htmhttp://www.blooberry.com/indexdot/html/tagpages/i/iframe.htmhttp://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htmhttp://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htmhttp://www.blooberry.com/indexdot/html/tagpages/l/layer.htmhttp://www.blooberry.com/indexdot/html/tagpages/l/link.htmhttp://www.blooberry.com/indexdot/html/tagpages/o/object.htmhttp://www.blooberry.com/indexdot/html/tagpages/q/q.htmhttp://www.blooberry.com/indexdot/html/tagpages/s/script.htmhttp://www.blooberry.com/indexdot/html/tagpages/s/sound.htm

           And the special cases

           <!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd">
           http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm
           '!doctype'  is really a process instruction, but is still listed
           in %TAGS with 'url' as the attribute

           and

           <meta HTTP-EQUIV="Refresh" CONTENT="5; URL=http://www.foo.com/foo.html">
           http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm
           If there is a valid url, 'url' is set as the attribute.
           The meta tag has no 'attributes' listed in %TAGS.