When I first mashed together add-rel-lightbox, I used regular expressions to control and replace the relevent parts. This was my fault. Regexes were the only tool I had. It was a bad thing. Parsing HTML with regex is wrong.

[…]using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes _using rege_x as a tool to process HTML establishes a brea_ch between this world_ and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but _more corrupt) a mere glimp_se of the world of reg​**ex parsers for HTML will ins**​tantly transport a p_rogrammer's consciousness i_nto a w_orl_d of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​**l devour your HT**​ML parser, application and existence for all time like Visual Basic only worse[…]

The whole StackOverflow answer makes a very sane and reasonable argument for not using regexes to try and parse HTML, so I’d recommend taking a look. However, if you’re not totally convinced, perhaps you’d prefer Bring Me Your Regexes! I Will Create HTML To Break Them!. That did convince me.

PHP does have a DOM extension, so it’s possible to use that to operate on XML elements, but it all looked to complex and involved for me to find a way in, but S.C. Chen’s PHP Simple HTML DOM Parser made the whole thing fairly straight forward.

The whole, commented, revised code is up at Github, but while I have a few regular expression match calls, they’re all operating now on the returned HTML attributes from the DOM parser, instead of trying to pull put the details from raw HTML.

After including the Simple HTML DOM library, the script:

  • Finds any link wrapped images.

  • Checks that the link points to an image file and doesn’t have a “lightbox” relation proterty applied. (as a side effect, any link with rel=”nolightbox” will also be overlooked)

  • Adds the popup caption from the database if it’s a single hosted image, or part of a gallery.

  • And finally, adds “lightbox[post-(post_id)]” to the rel attribute of the link

This should be more robust than version 0.3, although I haven’t seen any problems with that version in practice. However it does also make the code much easier to extend or expand in the future.

While I’ve been meaning to do this for a while, I’ve only just got round to coding it up. Everything in the “DOMcontrolled” branch looks good in the initial testing, so the version is currently set at 0.4.RC1, which I’ll roll out after it’s survived more prodding.


See also: Coding Horror > Parsing HTML the Cthulhu Way