do()
does not.
—Theory
]]>my $dt = DateTime->new( some date ); no warnings 'redefine'; local *DateTime::now = sub { $dt->clone };# run the test and compare against $dt
If the module under test also used DateTime, I had predictable values for testing. Of course, edge cases and special conditions are depending on the value injected as 'now' (as morungos already mentioned). I don't know if this is the best way to do it but at least I have the feeling not to get surprises...
And if bugs arise, adding new test cases then is very simple.
Once you have that small part, there is a case for saying you should parse it as an HTML fragment and walk over the results. But usually you do not want that level of detail or flexibility. A regexp is often easy to write because you start with an example of what you are matching and then selectively replace certain parts with (.+) capturing groups; refine it a bit more as necessary until it works for all the example input you have. You know nothing about what the format will be when you fetch the page tomorrow, so you can't worry about it now. (It will usually be exactly the same, or else quite different requiring a rewrite.)
]]>I don’t know what you are trying to do but last I tried, I found it far easier to locate a tag on an HTML page with a CSS selector than to laboriously match out the text with a pattern (not to mention having to manually handle charsets, unescaping etc. in that case). Even XPath is still a lot easier.
Are you using a DOM and manually walking it with nested loops or something? I can’t imagine how else you’d find it even harder than using regexps.
]]>That said, a regexp approach can be declarative too. Because you can start by taking an example of the HTML you want to match and adding placeholders, it is immediately clear what HTML structure is matched - as long as you keep your regexp nicely formatted and commented using /x. I would paste in some example code I use to scrape Outlook Web Access, but this comment form doesn't allow plain text.
The ideal would be a kind of template language where you write pseudo-HTML with placeholders, this is then parsed into a structured query and that is matched against the parsed HTML document. There are various XML query languages, but I don't think any of them will cope well with the tag soup found on typical web pages.
]]>No amount of /x
will make regexps look less messy than a CSS selector. Aside from that, the regexp will break not only when the layout breaks, but for even the most minor variation, such as the quotes around an attribute changing from double to single, or being removed altogether; the order of attributes changing; a comment being inserted somewhere; etc. Some of these contingencies you can defend against, but at the price of uglying up the regexp badly.
Also, as I said, what you get from a regexp match is a fragment of HTML – even if it does not contain tags. At the least, you have to deal with encodings and entities yourself.
If you know the markup is not going to change in any such way and you are dealing with a highly restricted set of values – something like a script that scrapes your online status out of your WLAN router’s web interface or some such –, then sure, sometimes regexes are easier.
But they really don’t scale very far.
Try something like Web::Scraper or the screenscraping stuff in Mojo sometime if you haven’t. Believe me, you’re making yourself a lot of unnecessary work.
]]>I don't see a complete solution to the variable-renaming issue, but perhaps it can be included in the distance metric between string elements.
Finally, duplicate code is closely related to compressibility. If there's a compression algorithm implementation that allows fine-grained inspection of the compression tables & their statistics, that might be a fruitful area too.
]]>