Finding Duplicate Code in Perl
When working with legacy code, it's useful to have a variety of tools that let you better understand your code base. For example, I recently wrote about finding unused subroutines. That heuristic approach was fine because I was still going to inspect the code manually rather than automatically remove the code.
So now I hacked out a rough "duplicate code finder" for Perl. It focuses on cut-n-drool code and has found more than I would have thought (even in my code!). If a developer changes variable names, it won't find it, but if I hacked around with B::Deparse
, I could fix that, too.
Here's the hack:
On a recent run, it took about 15 minutes to find duplicate code in a small project with 111 .pm files, though on DBIx::Class, it took less than five minutes on 375 files.
Patches welcome!
In other news, my "Beginning Perl" book has its first review on Amazon and it's five stars. I feel like I've dodged a bullet :)
Update: I just ran this against Catalyst, Moose and DBIx::Class. The first two ran quickly and were remarkably clean. DBIx::Class is still running and here's some sample output (not all of it):
My thoughts on improving this? Hack B::Deparse to emit the same variable name for every variable. Then I would cache two versions of the code: one with normal B::Deparse and the other with the hacked version. That way I could find more duplicated logic and then use the "normal" B::Deparse output to show the actual code with correct variable names. However, the code would run a bit slower as you would have to deparse twice.
Now this is going to be very useful. Thanks Ovid :)
Interesting approach. A few months back i worked on a copy-paste detector that uses the same approach as used for natural text: Levenshtein Distance.
In my case i ran all the relevant code through PPI to get subs and individual code blocks out of the document, then removed noise from them (whitespace, etc.) and then simply compared each text block to all other ones and output them sorted by Levenshtein distance.
It still needs some finetuning and doesn't run all that fast, but with a big body of legacy code already helped me find duplicate blocks that were copy-pasted and subtly altered; for example by changing how the values are returned, or how some vars are named.
I should publicize that sometime.
I do think the edit distance would be useful, but often find that duplicated code is simply duplicated. I would, of course, prefer to also find subtle variable renaming, and stuff, but so far this simplistic approach has been fast and useful.
I also think you should publicize your work :)
I have found that when using words as atoms, not chars, we could compute levenstein distance much faster. I think it is applicable to code comparison case.
Here is the function I am using:
Best regards
I don't know if this is any use, but it's a Perl module for finding the closest match to a string in an array:
https://metacpan.org/module/Text::Fuzzy
It's optimized for scanning over arrays to find the closest match.
I tested a wide variety of modules for calculating Levenshtein and i think i touched Text::Fuzzy too. The main problem is that there are two different algorithms (the normal and the Damerau variant) and varying compatibility with utf8. Right now i ended up using Text::LevenshteinXS, though it's been a while and i might need to reevaluate.
That kind of pureperl Levenshtein analysis would be too slow for the amount of code i'm working with.
However, you have a brilliant idea there: When reducing the text, i can cut out large amounts of it by replacing the names of functions/variables/etc. with unique identifiers using a minimal amount of characters. So, thanks for that idea. :D
Mithaldu,
I don't (yet) support edit distance, but I do support (and this is the default with the command line tool) handling renamed variables and subroutines. I have the beginnings of a distribution here.
Cheers, Ovid
Oh, and that distribution also has a --jobs switch so you can take advantage of multiple cores, if you have them. It's been very helpful for me.
I've done something similar, but across multiple programming languages. We typically had multi-gigabyte collections, and precise locations were harder, but the basic approach I used was very different, and probably worth discussing.
First, I worked a line at a time through the files. We couldn't parse, because of the use of different languages. Each line I used a hash (Adler32 for speed) with a little canonicalization of spaces and case. Then I used a rolling hash on the hashes (yes, I know, but it worked) with a window size of about 5, which seemed to be comparable to the PMD CPD -- I know that uses tokens rather than lines, but it was good enough. I could have used a rolling has on tokens directly, but identifying a mostly language independent set of tokens is far from trivial.
I'd planned to write a second pass to validate results, but other tasks took priority as the basics worked. Anyways, with the rolling hash, any collision at all implies a candidate duplicate window, which can then be tested in more detail, now knowing which lines and which files to assess. Using this as a prefilter technique might add a good amount of performance, and also let the process scale. As an interim, we needed to estimate the amount of duplication as a metric, and this approach was sufficient for that, and could process gigabytes as fast as they came off the disk.
I'd intended to roll this from Perl into a pure C utility. That's still in my dev folder, unfortunately.
There are a million variants on edit-distance-like techniques, all under the "sequence alignment" umbrella. I'd actually recommend using a technique that finds the best local alignments between the text and itself, then any promising off-diagonal alignments can be pursued.
I don't see a complete solution to the variable-renaming issue, but perhaps it can be included in the distance metric between string elements.
Finally, duplicate code is closely related to compressibility. If there's a compression algorithm implementation that allows fine-grained inspection of the compression tables & their statistics, that might be a fruitful area too.