Repeated Code

mako132 on 2002-10-25T02:12:16

As we ended this week's code review session, the facilitator remarked that next week's code (two large CGIs of 2600 and 3900 lines each) would be difficult to get through. I replied that they are actually about a third of the real size because they consist of so much repeated code and HTML.

I got to thinking there must be some way I could find all the repeated code before I start reviewing it and so I wrote a script which gives output like:

Line 309 repeated on line(s) 610, 752, 841, 891, 1100, 1484, 1502, 1546, 1567 Line 310 repeated on line(s) 611, 753, 842, 892, 1101, 1485, 1503, 1547, 1568 Line 311 repeated on line(s) 612, 754, 843, 893, 1102, 1486, 1548, 1569 Line 312 repeated on line(s) 613, 755, 844, 894, 1103, 1487, 1504, 1549, 1570 Line 313 repeated on line(s) 614, 759, 848, 898, 1104, 1488, 1508, 1553, 1574 Line 314 repeated on line(s) 615, 760, 849, 899, 1105, 1489, 1509, 1554, 1575 Line 315 repeated on line(s) 616, 1510, 1555 Line 316 repeated on line(s) 617, 762, 851, 901, 1491, 1511, 1556, 1577 Line 317 repeated on line(s) 618, 763, 852, 902, 1108, 1492, 1512, 1557, 1578 Line 318 repeated on line(s) 619, 764, 853, 903, 1109, 1493, 1513, 1558, 1579

The above is an example of a block of 10 lines of code repeated 8 times, with a minor variation.

So with a report like this, I can take a printout of code (w/line numbers), a highlighter pen and block out the large repeated code blocks.

Naturally, my script detects things like "$i = 0;" or "=cut", but such code is usually not found in large blocks of lines, so it doesn't stand out.

Now I'm thinking how to take this forward - perhaps process the original file and insert footnotes or glyphs like [snip] in place of the repeated code.


Subantive content of Perl programs

TorgoX on 2002-10-25T04:41:03

That is very clever!

A random idea: measuring the degree of redundancy in a Perl source file. I think it'd have to be something other that just simple text redundancy, since you dan't want short symbol names (or hash key) being favored over long ones.

Re:Subantive content of Perl programs

mako132 on 2002-10-25T14:36:16

It was interesting to run it against things like XML/Parser.pm, CGI.pm, LWP.pm. LWP.pm has only 3 repeated lines (of size 5 char or more).

Of course the statistic needs to be weighted by the number of lines and length of the lines.

What is debatable is whether one should toss single-line idioms such as

@out = sort {$a $b} @list;

which might occur in several places, into a single subroutine.

Re:Subantive content of Perl programs

jdavidb on 2002-10-25T16:58:41

I think a feature to detect the longest sequence of repeated lines would be useful, too.

Visualizing it

dws on 2002-10-25T19:22:39

I wonder if you might get additional traction by charting the data. I'm thinking along the lines of using GD, and dropping a dot at each "X repeats at Y" point. You'd end up with an upper diagnoal matrix that should show diagonal stripes where repeats occur.

Algorithm::Diff, colored output

barries on 2002-10-25T19:48:13

A suggestion: to prevent wear and tear on those poor highlighters, You might want to emit HTML colorized output (use a templating system if you want to let others develop their own output formats, TT2 and HTML::Template serve opposite ends of the complexity spectrum, not to mention things like HTML::Mason).

I also wanted to mention mjd's Algorithm:::Diff (which Text::Diff uses) in case that would help you find longest common sequences.

- Barrie

Re:Algorithm::Diff, colored output

mako132 on 2002-10-25T20:16:16

Definitely would do so if I either had a color printer or a laptop with a large monitor to take to the code review meeting. My 10" just isn't big enough.

I suspected someone like mjd would have worked this before...but taking a look at the two modules you reference, I can't immediately see how you'd apply them to a single file.

Re:Algorithm::Diff, colored output

barries on 2002-10-25T20:32:04

Algorith::Diff could be used to, given a chunk of code that you've already identified as repeated, look for "similar enough" chunks of the file. It calculates longest common subsequences, so that can be used to identify chunks of repeated code, is all (another poster mentioned something which triggered me to think of A::D).

You could also use it to diff the lines that were tweaked between the original code chunk and the copy-paste-tweaked code chunk. Lots of visual diffs do that sort of char-by-char diffing.

I mentioned Text::Diff as code that uses Algorithm::Diff and, thinking of it now, you could use it as an output formatter by making a copy of the original file and substituting the original multiline chunk of code in for each place you found an altered copy of it. This would allow you to run it through the doctored file and the original file through Text::Diff in Universal or Table mode to get a side-by-side view. But since you have a limited display, that's probably not useful.

I've been tempted to make a templatized output formatter for Text::Diff with intraline diffing, but ETIME.

- Barrie