Regular Expressions with PHP and HTML
You can use regular expressions to locate specific tags in HTML. My last post gave some examples of this. Here I will explain how to execute the code, how to parse the HTML in different formats and how to compose regular expressions like the ones in the previous post. Finally we will create our own regular expression which pulls out the src variable from img tags in HTML code. This could be used to create a list of images in a given webpage.
1 – Regular expressions in PHP
To search a block of text for a number of possible matches you need to use the function preg_match_all. This uses the following format:
(text adapted from PHP Documentation)
preg_match_all ( string $pattern, string $subject, array &$matches [, int $flags [, int $offset]] )
- pattern – The pattern to search for, as a string.
- subject – The input string – in our example the HTML code.
- matches – Just make up a variable name such as $matches and put it in here. It will need to be empty at first and then filled with the results of the search
- flags - Can be a combination of the following flags (note that it doesn’t make sense to use PREG_PATTERN_ORDER together with PREG_SET_ORDER)
- PREG_PATTERN_ORDER - Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on. This will be useful to us for extracting the src, it will be returned in the $matches[1] array if it is the first subpattern.
- PREG_SET_ORDER - Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on.
- PREG_OFFSET_CAPTURE - If this flag is added the character number within the string is also included in the results
- 2 – Parsing HTML
There are two ways of searching the HTML, using the literal HTML code in the regex e.g. ‘‘ or using the character codes such as ‘<IMG>’. To use the later apply the function htmlspecialchars to the html first. In this example we will use the literal HTML code. Some peoples expressions use one method some use the other.
3 – Composing regular expressions in PHP
(Advice adapted from Perl regular expressions for the common man)
PHP uses the same syntax for regular expressions as PERL. So resources for PERL regex will be useful.
Start off by enclosing your expression in the delimiters ‘/’ eg ‘/img/’. It won’t work otherwise.
Matching start of a word
/^start/ will match ‘starter’
Matching end of a word
end$/ will match ‘worldsend’
Matching options
/sh[ou]t/ will match shot or shut
Matching any character
/sh.t/ will match shit shot shut shxt
Matching a number of characters
/sh.+t/ will match shoooot shuuut, any until it reaches the t
Matching one or no characters
/sh.*t/ will match shot, sht, shuuut etc
Escaping characters
/a \+ b/ will match a + b
Make a search case insensitive
/sense/i will match sense, Sense or SENSE
Make the match stop at the first b
Imagine the subject is ‘a xxxx b ffff b’ the regex /a+.b/ will match the entire string – /a+?b/ will match ‘a xxxx b’
Extract characters from the subject to appear in the $matches array
use (), e.g. /text(.+)remove/ will take the word ‘pass’ out of textpassremove and return it along with textpassremove
Reuse extracted charcters in the same regex – use \1 for the first, \2 for the second, and escape the \ so therefore /\1 and /\2
e.g. /<(.+?)>.+<\/\1>/ means take the text between the < and the very next > found, and then find a <\ followed by a the same text found in the first (.+?), followed by a closing >. So this would match all very simple HTML tags such as ‘test‘ or ‘text‘. It won’t work for more complex tags though.
Match HTML tags with variables
/<(.+?).*?>.+<\/\1>/i will match a tag
And to extract the img src tag, allowing for spaces between the < and the img
‘/<.*img.*src=”(.*?)”.*>/i’
Powered by ScribeFire.


2 Comments
mck182
Hey,
I came across this site when searching for php regexp info and there’s one mistake, which made me mad, as I was just trying to find quickly the regexp pattern I needed, which I found here, so I copied it and tried to use it right away, but it didn’t work as it should, so I started to go through my code and after like 15 minutes I read the rest of the text and suddenly it was clear
So please fix the line /a+.?b/ will match ‘a xxxx b’ as it should be /a.+?b/… it will save many worth minutes for people like me
kirby.mark
Cheers, I’ve fixed that