Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Perl html parsing

 News

Books

Recommended Links Recommended Articles CPAN Perl namespaces  

 

Getopt::Std Getopt::Long Net::ftp Net::netrc Benchmark Net::Telnet File::Find  
 Expect.pm Reimplementation of Unix tools Pipes in Perl Debugging Tips Beautifiers Humor Etc

HTML pro-parsing tips

Thursday, 10 July by David Farrell

Perl has some fantastic modules for parsing HTML and one of the best is XML::LibXML. It's an interface to the libxml2 C library; super fast but also super-picky. I've often found XML::LibXML croaking on relatively simple - but incorrectly formed HTML. If you find this, do not give up! This article shares 3 simple techniques for overcoming malformed HTML when parsing with XML::LibXML.

Tip 1: turn on recovery mode

If XML::LibXML is croaking on a later part of the HTML, try turning on recovery mode, which will return all of the correctly parsed HTML up until XML::LibXML encountered the error.

use XML::LibXML;

my $xml = XML::LibXML->new( recover => 1 );
my $dom = $xml->load_html( string => $html );

With recovery mode set to 1, the parser will still warn about parsing errors. To suppress the warnings, set recover to 2.

Tip 2: sanitize the input first with HTML::Scrubber

Sometimes recovery mode alone is not enough - XML::LibXML will croak at the first whiff of HTML if there are two doctype declarations for example. In these situations, consider sanitizing the HTML with HTML::Scrubber.

HTML::Scrubber provides both whitelist and blacklist functions to include or exclude HTML tags and attributes. It's a powerful combination which allows you to create a custom filter to scrub the HTML that you want to parse.

By default HTML::Scrubber removes all tags, but in the case of a duplicate doctype declaration, you just need that one tag removed. Let's remove all div tags too for good measure:

use HTML::Scrubber;

my $scrubber = HTML::Scrubber->new( deny => [ 'doctype', 'div' ],
                                    allow=> '*' );
my $scrubbed_html = $scrubber->scrub($html);
my $dom = XML::LibXML->load_html( string => $scrubbed_html );

The "deny" rule sets the scrubber blacklist (what to exclude) and the "allow" rule specifies the whitelist (what to include). Here we passed an asterisk ("*") to allow, which means allow everything, but because we're denying div and doctype tags, they'll be removed.

Tip 3: extract a subset of data with a regex capture

If the subset HTML you want to parse has a unique identifier (such as an id attribute), consider using a regex capture to extract it from the HTML document. You can then scrub or immediately parse this subset with XML::LibXML.

For example recently I had to extract an HTML table from a badly-formed web page. Fortunately the table had an id attribute, which made extracting it with a regex a piece-of-cake:

if ( $html =~ /(<table id="t2">.*?<\/table>)/s ) {
    my $dom = XML::LibXML->load_html( string => $1 );
    ...
}

Note the use of the "s" modifier in the regex to match multiline. Many HTML pages contain newlines and you don't want your match fail because of that.

Conclusion

Hopefully these tips will make parsing HTML with XML::LibXML easier. My GitHub account has a web scraper script that uses some of these tips. If you're looking for an entirely different approach to parsing HTML, check out XML::Rabbit and HTML::TreeBuilder.


Enjoyed this article? Help us out and tweet about it!


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Aug 16, 2020] How to trim a line from leading and trailing blanks without using regex or non-standard modules

Aug 16, 2020 | perlmonks.org

on Aug 14, 2020 at 02:24 UTC ( # 11120704 = perlquestion : print w/replies , xml ) Need Help?? likbez has asked for the wisdom of the Perl Monks concerning the following question: Reputation: 4

Edit

Is there any way to trim both leading and trailing blanks in a text line (one of the most common operations in text processing; often implemented as trim function which BTW was present in Perl 6) without resorting to regular expressions (which are definitely an overkill for this particular purpose)? This is clearly an important special case.

So far the most common solution is to use something like $line =~ s/^\s+|\s+$//g which clearly is an abuse of regex.

See, for example, https://perlmaven.com/trim

Or install String::Util which is a not a standard module and as such creates difficulties in enterprise env.


hippo on Aug 14, 2020 at 06:46 UTC

Re: How to trim a line from leading and trailing blanks without using regex or non-standard modules
without resorting to regular expressions (which are definitely an overkill for this particular purpose)?

Sure, just write your own function to do it. Having written that you will then come to the conclusion that regular expressions are definitely not an overkill for this particular purpose.

This is clearly an important special case. ... which clearly is an abuse of regex.

You keep using that word. I don't think it means what you think it means.

🦛

LanX on Aug 14, 2020 at 03:28 UTC

Re: How to trim a line from leading and trailing blanks without using regex or non-standard modules

> which clearly is an abuse of regex.

Why is it an abuse of regex?

Problem is that \s is a meta character for any white-space not only blank " " , but only usable inside regex.°

So if you want the exact same semantic, it'll become far more complicated than this regex.

But better define your own trim() using a regex inside.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

°) compare Re^3: How to trim a line from leading and trailing blanks without using regex or non-standard modules

you !!! on Aug 14, 2020 at 19:39 UTC

Re^2: How to trim a line from leading and trailing blanks without using regex or non-standard modules


by you !!! on Aug 14, 2020 at 19:39 UTC Reputation: 6

So if you want the exact same semantic, it'll become far more complicated than this regex.

I agree. That's a good point. Thank you !

In other words it is not easy to design a good trim function without regex, but it is possible to design one that used regex, but treating the single quoted string as a special case

For example

trim(' ',$line)
vs
trim(/\s/.$line)
BTW this is impossible in Python which implements regex via library, unless you add a new lexical type to the Language (regex string instead of raw string that is used).

LanX on Aug 15, 2020 at 01:04 UTC

Re^3: How to trim a line from leading and trailing blanks without using regex or non-standard modules
by LanX on Aug 15, 2020 at 01:04 UTC > trim(/\s/.$line)

I doubt this is valid syntax.

you probably mean

trim( qr/\s/, $line)

see Re^3: How to trim a line from leading and trailing blanks without using regex or non-standard modules for a slightly better implementation

> this is impossible in Python

passing regex inside a string is fine in Perl, why not in Python?

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

kcott on Aug 14, 2020 at 09:35 UTC

Re: How to trim a line from leading and trailing blanks without using regex or non-standard modules

G'day likbez ,

I will usually reach for one of Perl's string handling functions (e.g. index , rindex , substr , and so on) in preference to a regex when that is appropriate; however, in this case, I would say that the regex makes for much cleaner code.

You could implement a trim() function using the guts of this code (which uses neither a regex nor any modules, standard or otherwise):

$ perl -E ' my @x = (" a b c ", "d e f ", " g h i", "j k l", " ", ""); say "*** Initial strings ***"; say "|$_|" for @x; for my $i (0 .. $#x) { my $str = $x[$i]; while (0 == index $str, " ") { $str = substr $str, 1; } my $str_end = length($str) - 1; while ($str_end == rindex $str, " ") { $str = substr $str, 0, $str_end; --$str_end; } $x[$i] = $str; } say "*** Final strings ***"; say "|$_|" for @x; ' *** Initial strings *** | a b c | |d e f | | g h i| |j k l| | | || *** Final strings *** |a b c| |d e f| |g h i| |j k l| || || [download]

If your question was genuinely serious, please Benchmark a trim() function using something like I've provided against another trim() function using a regex. You could obviously do the same for ltrim() and rtrim() functions.

[As others have either asked or alluded to, please explain phrases such as "definitely an overkill", "important special case" and "abuse of regex". Unfortunately, use of such language makes your post come across as some sort of trollish rant -- I'm not saying that was your intent, just how it presents itself.]

-- Ken

LanX on Aug 14, 2020 at 11:22 UTC

Re^2: How to trim a line from leading and trailing blanks without using regex or non-standard modules


by LanX on Aug 14, 2020 at 11:22 UTC

I suppose your solution works only for blank " " and not for other whitespace characters like "\n"

So it's not exactly the same like with \s °

DB<11> $a="x \n \n \n " DB<12> $a =~ s/\s+$// DB<13> x $a 0 'x' DB<14> [download]

The OP should be clearer about the semantics he wants.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

see also Re: How to trim a line from leading and trailing blanks without using regex or non-standard modules

kcott on Aug 15, 2020 at 11:02 UTC

Re^3: How to trim a line from leading and trailing blanks without using regex or non-standard modules
by kcott on Aug 15, 2020 at 11:02 UTC

G'day Rolf ,

That's a valid point. My main intent with that code was really to show the complexity of the solution when a regex or module were not used. Anyway, adding a little more complexity, you can trim whatever blanks you want:

$ perl -E ' my @blanks = (" ", "\n", "\r", "\t"); my @x = ( " a b c ", "d e f \r ", " \t g h i", "j k l", " ", "\n", "\n\nXYZ\n\n", "" ); say "*** Initial strings ***"; say "|$_|" for @x; for my $i (0 .. $#x) { my $str = $x[$i]; while (grep { 0 == index $str, $_ } @blanks) { $str = substr $str, 1; } my $str_end = length($str) - 1; while (grep { $str_end == rindex $str, $_ } @blanks) { $str = substr $str, 0, $str_end; --$str_end; } $x[$i] = $str; } say "*** Final strings ***"; say "|$_|" for @x; ' *** Initial strings *** | a b c | | e f | g h i| |j k l| | | | | | XYZ | || *** Final strings *** |a b c| |d e f| |g h i| |j k l| || || |XYZ| || [download]

You're quite correct about "The OP should be clearer ..." . The word 'blank' is often used to mean various things: a single space, multiple consecutive spaces, a whitepace character, multiple consecutive whitepace characters, and I have also seen it used to refer to a zero-length string. Similarly, the word 'space' can mean a single space, any gap between visible characters, and so on. So, as with many posts, we're left with guessing the most likely meaning from the context.

My belief, that a regex is a better option, strengthens as the complexity of the non-regex and non-module code increases. :-)

-- Ken

jwkrahn on Aug 14, 2020 at 03:58 UTC

Re: How to trim a line from leading and trailing blanks without using regex or non-standard modules

(IMHO) the most common solution is:

s/^\s+//, s/\s+$// for $line; [download]

Marshall on Aug 14, 2020 at 04:33 UTC

Re^2: How to trim a line from leading and trailing blanks without using regex or non-standard modules


by Marshall on Aug 14, 2020 at 04:33 UTC


s/^\s+|\s+$//g has been benchmarked. And I now think this is faster and "better" than 2 statements. There is one post at Re^3: script optmization that shows some benchmarks.

This is certainly not an "abuse" of regex. This is what regex is is for! The Perl regex engine continually becomes better and usually faster between releases.

perlfan on Aug 14, 2020 at 12:23 UTC

Re: How to trim a line from leading and trailing blanks without using regex or non-standard modules

> $line =~ s/^\s+|\s+$//g which clearly is an abuse of regex.

Why do you say that?

> trim function which BTW was present in Perl 6

You say this like it's a good thing. I bet there is also one in PHP.

karlgoethebier on Aug 14, 2020 at 12:34 UTC

Re^2: How to trim a line from leading and trailing blanks without using regex or non-standard modules


by karlgoethebier on Aug 14, 2020 at 12:34 UTC

You won

"The Crux of the Biscuit is the Apostrophe"

perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});' Help

LanX on Aug 14, 2020 at 14:43 UTC

Re^3: How to trim a line from leading and trailing blanks without using regex or non-standard modules
by LanX on Aug 14, 2020 at 14:43 UTC DB<33> sub trim { $_[1] //= qr/\s/; $_[0] =~ s/^[$_[1]]+|[$_[1]]+$// + g } DB<34> $a = $b = " \n . aaa . \n " DB<35> trim $a DB<36> trim $b, " " DB<37> x $a,$b 0 '. aaa .' 1 ' . aaa . ' DB<38> [download] Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery
Replies are listed 'Best First'.

[Aug 13, 2020] Perl: function to trim string leading and trailing whitespace

Jan 01, 2011 | stackoverflow.com

Ask Question Asked 9 years, 7 months ago Active 2 years, 11 months ago Viewed 129k times


> ,

Is there a built-in function to trim leading and trailing whitespace such that trim(" hello world ") eq "hello world" ?

Landon Kuhn , 2011-01-04 20:10:52

edited Jan 4 '11 at 20:20 asked Jan 4 '11 at 20:10 Landon Kuhn 58.5k 41 41 gold badges 98 98 silver badges 128 128 bronze badges

daxim ,

FYI: string equality in Perl is tested by the operator eq . – A. Rex Jan 4 '11 at 20:16

> ,

Here's one approach using a regular expression:
$string =~ s/^\s+|\s+$//g ;     # remove both leading and trailing whitespace

Perl 6 will include a trim function:

$string .= trim;

Source: Wikipedia

Mark Byers , 2011-01-04 20:13:55

edited Oct 28 '12 at 15:23 answered Jan 4 '11 at 20:13 Mark Byers 676k 155 155 gold badges 1464 1464 silver badges 1383 1383 bronze badges

kyle ,

I look this up about once a month. Too bad I can't upvote it each time. – kyle Oct 29 '14 at 19:31

Ether , 2011-01-04 20:33:47

This is available in String::Util with the trim method:

Editor's note: String::Util is not a core module, but you can install it from CPAN with [sudo] cpan String::Util .

use String::Util 'trim';
my $str = "  hello  ";
$str = trim($str);
print "string is now: '$str'\n";

prints:

string is now 'hello'

However it is easy enough to do yourself:

$str =~ s/^\s+//;
$str =~ s/\s+$//;

Marki555 ,

@mklement0 nor will it ever be. But this is not relevant, since everyone should be using modules from the CPAN. – Ether Jun 9 '15 at 21:12

> ,

UncleCarl ,

@Ether With all due respect, I really appreciate knowing that this is a non-core module. This post is talking about using a module in lieu of a fairly simple regex one-liner. If the module is core, I would be much more open to it. It is relevant in this case. – UncleCarl Mar 1 '18 at 16:57

> ,

There's no built-in trim function, but you can easily implement your own using a simple substitution:
sub trim {
    (my $s = $_[0]) =~ s/^\s+|\s+$//g;
    return $s;
}

or using non-destructive substitution in Perl 5.14 and later:

sub trim {
   return $_[0] =~ s/^\s+|\s+$//rg;
}

Eugene Yarmash ,

edited Aug 18 '17 at 13:50 Flow 21.4k 13 13 gold badges 89 89 silver badges 142 142 bronze badges answered Jan 4 '11 at 20:14 Eugene Yarmash 110k 29 29 gold badges 248 248 silver badges 313 313 bronze badges

> ,

add a comment

> ,

According to this perlmonk's thread :
$string =~ s/^\s+|\s+$//g;

brettkelly , 2011-01-04 20:13:55

answered Jan 4 '11 at 20:13 brettkelly 24.3k 8 8 gold badges 49 49 silver badges 66 66 bronze badges

> ,

add a comment

> ,

Complete howto in the perfaq here: http://learn.perl.org/faq/perlfaq4.html#How-do-I-strip-blank-space-from-the-beginning-end-of-a-string-

Nanne , 2011-01-04 20:15:16

edited Jan 6 '12 at 15:51 Michael Kristofik 30.2k 15 15 gold badges 69 69 silver badges 118 118 bronze badges answered Jan 4 '11 at 20:15 Nanne 60.5k 16 16 gold badges 106 106 silver badges 152 152 bronze badges

> ,

add a comment

> ,

For those that are using Text::CSV I found this thread and then noticed within the CSV module that you could strip it out via switch:
$csv = Text::CSV->new({allow_whitespace => 1});

The logic is backwards in that if you want to strip then you set to 1. Go figure. Hope this helps anyone.

Douglas ,

answered Dec 3 '14 at 16:44 Douglas 259 2 2 silver badges 15 15 bronze badges

> ,

add a comment

> ,

One option is Text::Trim :
use Text::Trim;
print trim("  example  ");

Recommended Links

Perl module - Wikipedia, the free encyclopedia

Perl Module Mechanics by Steven McDougall 2007 March 02

This page describes the mechanics of creating, compiling, releasing and maintaining Perl modules.

This is not a reference manual. Rather, it is a running account of how to do these things. More to the point, it is an account of how I do these things. Accordingly

See also Perl Module Anatomy by Steven McDougall. Short and not very informative notes about the contents of the .pm file

The very very short tutorial about modules in Perl [ New: 06/16/99 ]Mark-Jason Dominus a really short tutorial about Perl Modules.

The Seven Useful Uses of local

perltoot - Tom Christiansen's object-oriented tutorial for perl

IBM developerWorks: Parsing with Perl modules(Apr 30, 2000)

modules

Perl for Newbies - Part 3 - The Perl Beginners' Site

Perl Module Primer By Dan Ragle

Perl module - Wikipedia, the free encyclopedia

Installing Your Own Perl Modules

Creating (and Maintaining) Perl Modules

Amazon.com- Perl Modules- Eric Foster-Johnson- Books

Perl Module Review- Class--Trait - O'Reilly ONLamp Blog

O'Reilly Perl Center

Writing Apache Modules with Perl and C - by Lincoln D Stein - 744 pages
Learning Perl Objects References and Modules - by Randal L Schwartz - 228 pages

Reference

Man Pages:
perlmod Perl modules (packages and symbol tables)
perlref Perl references and nested data structures
perlobj Perl objects
perltoot Tom's object-oriented tutorial for perl
perlbot Bag'o Object Tricks (advanced stuff)

Websites:

genome-www.stanford.edu/perlOOP A little collection I put together as I was learning Object-oriented Perl programming. See especially the examples page.
www.perl.com/CPAN/CPAN.html CPAN homepage
search.cpan.org Search CPAN for a module by author, category, or module name
bioperl.org The Bioperl project - modules for bioinformatics


Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last updated: August 15, 2020