Regular Expressions with PHP and HTML

You can use regular expressions to locate specific tags in HTML. My last post gave some examples of this. Here I will explain how to execute the code, how to parse the HTML in different formats and how to compose regular expressions like the ones in the previous post. Finally we will create our own regular expression which pulls out the src variable from img tags in HTML code. This could be used to create a list of images in a given webpage.

1 – Regular expressions in PHP

To search a block of text for a number of possible matches you need to use the function preg_match_all. This uses the following format:
(text adapted from PHP Documentation)

preg_match_all ( string $pattern, string $subject, array &$matches [, int $flags [, int $offset]] )

2 – Parsing HTML

There are two ways of searching the HTML, using the literal HTML code in the regex e.g. ‘‘ or using the character codes such as ‘<IMG>’. To use the later apply the function htmlspecialchars to the html first. In this example we will use the literal HTML code. Some peoples expressions use one method some use the other.

3 – Composing regular expressions in PHP
(Advice adapted from Perl regular expressions for the common man)
PHP uses the same syntax for regular expressions as PERL. So resources for PERL regex will be useful.

Start off by enclosing your expression in the delimiters ‘/’ eg ‘/img/’. It won’t work otherwise.

Matching start of a word
/^start/ will match ‘starter’

Matching end of a word
end$/ will match ‘worldsend’

Matching options
/sh[ou]t/ will match shot or shut

Matching any character
/sh.t/ will match shit shot shut shxt

Matching a number of characters
/sh.+t/ will match shoooot shuuut, any until it reaches the t

Matching one or no characters
/sh.*t/ will match shot, sht, shuuut etc

Escaping characters
/a \+ b/ will match a + b

Make a search case insensitive
/sense/i will match sense, Sense or SENSE

Make the match stop at the first b
Imagine the subject is ‘a xxxx b ffff b’ the regex /a+.b/ will match the entire string – /a+?b/ will match ‘a xxxx b’

Extract characters from the subject to appear in the $matches array
use (), e.g. /text(.+)remove/ will take the word ‘pass’ out of textpassremove and return it along with textpassremove
Reuse extracted charcters in the same regex – use \1 for the first, \2 for the second, and escape the \ so therefore /\1 and /\2
e.g. /<(.+?)>.+<\/\1>/ means take the text between the < and the very next > found, and then find a <\ followed by a the same     text found in the first (.+?), followed by a closing >. So this would match all very simple HTML tags such as ‘test‘ or             ‘text‘. It won’t work for  more complex tags though.

Match HTML tags with variables
/<(.+?).*?>.+<\/\1>/i will match a tag

And to extract the img src tag, allowing for spaces between the < and the img
‘/<.*img.*src=”(.*?)”.*>/i’

Powered by ScribeFire.

2 Comments

mck182

Hey,

I came across this site when searching for php regexp info and there’s one mistake, which made me mad, as I was just trying to find quickly the regexp pattern I needed, which I found here, so I copied it and tried to use it right away, but it didn’t work as it should, so I started to go through my code and after like 15 minutes I read the rest of the text and suddenly it was clear :)

So please fix the line /a+.?b/ will match ‘a xxxx b’ as it should be /a.+?b/… it will save many worth minutes for people like me ;)

kirby.mark

Cheers, I’ve fixed that

Leave a comment: