Can I use Perl regular expressions to match balanced text?
(contributed by brian d foy) Your first try should probably be the Text::Balanced module, which is in the Perl standard library since Perl 5.8. It has a variety of functions to deal with tricky text. The Regexp::Common module can also help by providing canned patterns you can use. As of Perl 5.10, you can match balanced text with regular expressions using recursive patterns. Before Perl 5.10, you had to resort to various tricks such as using Perl code in (??{}) sequences. Here’s an example using a recursive regular expression. The goal is to capture all of the text within angle brackets, including the text in nested angle brackets. This sample text has two “major” groups: a group with one level of nesting and a group with two levels of nesting.
Historically, Perl regular expressions were not capable of matching balanced text. As of more recent versions of perl including 5.6.1 experimental features have been added that make it possible to do this. Look at the documentation for the (??{ }) construct in recent perlre manual pages to see an example of matching balanced parentheses. Be sure to take special notice of the warnings present in the manual before making use of this feature. CPAN contains many modules that can be useful for matching text depending on the context. Damian Conway provides some useful patterns in Regexp::Common. The module Text::Balanced provides a general solution to this problem. One of the common applications of balanced text matching is working with XML and HTML. There are many modules available that support these needs. Two examples are HTML::Parser and XML::Parser. There are many others. An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and possibly nested single chars, like ` and ‘,
Although Perl regular expressions are more powerful than “mathematical” regular expressions, because they feature conveniences like backreferences (\1 and its ilk), they still aren’t powerful enough — with the possible exception of bizarre and experimental features in the development-track releases of Perl. You still need to use non-regex techniques to parse balanced text, such as the text enclosed between matching parentheses or braces, for example. An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and possibly nested single chars, like ` and ‘, { and }, or ( and ) can be found in http://www.perl.com/CPAN/authors/id/TOMC/scripts/pull_quotes.gz . The C::Scan module from CPAN contains such subs for internal usage, but they are undocumented.
Historically, Perl regular expressions were not capable of matching balanced text. As of more recent versions of perl including 5.6.1 experimental features have been added that make it possible to do this. Look at the documentation for the (??{ }) construct in recent perlre manual pages to see an example of matching balanced parentheses. Be sure to take special notice of the warnings present in the manual before making use of this feature. CPAN contains many modules that can be useful for matching text depending on the context. Damian Conway provides some useful patterns in Regexp::Common. The module Text::Balanced provides a general solution to this problem. One of the common applications of balanced text matching is working with XML and HTML . There are many modules available that support these needs. Two examples are HTML::Parser and XML::Parser. There are many others. An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and possibly nested single chars, like “‘” and
Historically, Perl regular expressions were not capable of matching balanced text. As of more recent versions of perl including 5.6.1 experimental features have been added that make it possible to do this. Look at the documentation for the (??{ }) construct in recent perlre manual pages to see an example of matching balanced parentheses. Be sure to take special notice of the warnings present in the manual before making use of this feature. CPAN contains many modules that can be useful for matching text depending on the context. Damian Conway provides some useful patterns in Regexp::Common. The module Text::Balanced provides a general solution to this problem. One of the common applications of balanced text matching is working with XML and HTML. There are many modules available that support these needs. Two examples are HTML::Parser and XML::Parser. There are many others. An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and possibly nested single chars, like CW”`” and