user-pic

morungos

  • About: I blog about Perl. I also like pizza and coffee
  • Commented on Finding Duplicate Code in Perl
    I've done something similar, but across multiple programming languages. We typically had multi-gigabyte collections, and precise locations were harder, but the basic approach I used was very different, and probably worth discussing. First, I worked a line at a time...
  • Commented on Q: When not to use Regexp? A: HTML parsing
    Well, I have had to use regexes, without any sensible alternative, when we needed byte-level positioning information for the content. We were using early KinoSearch, and needed to tokenize HTML contents. Kino used byte-level positions at the time. We were...
  • Commented on Date arithmetic can be dangerous
    I once had a similar problem, then partly due to daylight savings times. It's an interesting question: should tests of date/time be based on the module testers environment, or on a specific environment established by the module author? My preferred...
  • Commented on RTF::Parser is looking for a new home
    I'd be happy to adopt RTF::Parser - we use it, and I have other Officey modules, so it would make sense. (Msg sent to cpan email address as well, just in case.)...
  • Commented on Custom Hacks and Comfort Levels
    I use a mocked-up C-style preprocessor for SQL. I add a few #ifdef style rules for conditional SQL. It allows for dialects, if you can live with the grottiness of the C-type syntax. Perhaps best, we have a trap for...
  • Commented on Why Doesn't the BBC Upgrade Their Software Often Enough?
    A great post on a huge problem, and we see another side to this. We're building a Catalyst product which has to be installed by around 10 different organizations, each of which outsources to shops with different protocols. In one...
Subscribe to feed Recent Actions from morungos

  • Ed Avis commented on Q: When not to use Regexp? A: HTML parsing

    Yes, I have sometimes found it easier to use HTML::TableExtract or other modules which truly parse the HTML and walk over its structure. It depends on how well-structured the page is and what you are trying to extract. A CSS selector is a nice declarative approach, and as long as you are matching based on particular attributes rather than 'the third row in the fourth table', you will be fairly robust against minor layout changes.

    That said, a regexp approach can be declarative too. Because you can start by taking an example of the HTML you want to match and adding placeholders, it…

  • Brad Gilbert commented on Q: When not to use Regexp? A: HTML parsing
  • Aristotle commented on Q: When not to use Regexp? A: HTML parsing

    Yes, that is what the link to “the cthulhu way” points to.

  • Aristotle commented on Q: When not to use Regexp? A: HTML parsing
    Because you can start by taking an example of the HTML you want to match and adding placeholders, it is immediately clear what HTML structure is matched - as long as you keep your regexp nicely formatted and commented using /x.

    No amount of /x will make regexps look less messy than a CSS selector. Aside from that, the regexp will break not only when the layout breaks, but for even the most minor variation, such as the quotes around an attribute changing from double to single, or being removed altogether; the order of attributes changing; a comment being…

  • Ken Williams commented on Finding Duplicate Code in Perl

    There are a million variants on edit-distance-like techniques, all under the "sequence alignment" umbrella. I'd actually recommend using a technique that finds the best local alignments between the text and itself, then any promising off-diagonal alignments can be pursued.

    I don't see a complete solution to the variable-renaming issue, but perhaps it can be included in the distance metric between string elements.

    Finally, duplicate code is closely related to compressibility. If there's a compression algorithm implementation that allows fine-grained inspection of th…

Subscribe to feed Responses to Comments from morungos

About blogs.perl.org

blogs.perl.org is a common blogging platform for the Perl community. Written in Perl with a graphic design donated by Six Apart, Ltd.