自动重构工具来查找类似的 Java/Javascript 重复源代码?
我正在寻找一种工具来查找 Java/Javascript 的重复或相似代码.我无法说出similar"的确切定义,但我希望该工具足够智能并给我一些重构代码的建议,例如,
(1) A 类和 B 类具有相似的方法(例如,有 5 个方法具有相同的方法名称、参数和相似的实现出现在两个类中),那么应该建议将这些相似的方法移到基类中.
(2) 类 A 多次在不同的地方有相似的代码行,工具应该建议将这些相似的代码行移动到一个方法中.
我试过
[表格来自:Roy、Cordy、Koschke:代码克隆检测技术和工具的比较和评估:一种定性方法,计算机编程科学,第 74 卷第 7 期,2009 年 5 月.本文概述了许多不同的克隆检测方法并评估其有效性.]
[PMD 未列出,但显然使用 Rabin-Karp 字符串匹配,基于文本"根据上表,而不是AST匹配.]
关于 OP 的要求:
CloneDR(实际上我不知道任何工具)不会在多个方法中找到一组相似的方法,如果这些方法在不同的类中以不同的顺序出现.在这种情况下,CloneDR 更有可能将单个方法报告为克隆;最终结果是一样的.如果成员在不同的类中以相同的顺序依次出现,它将找到这样一个集合,就像一个类主体从另一个类主体被批量复制时发生的那样.
跨多种方法的相似代码块很常见.生成的报告显示了相似代码块的相关性,包括代码的抽象版本,它本质上是方法体所需的参数化代码块.
I'm looking for a tool to find duplicate or similar code of Java/Javascript. I can't tell the exact definition of "similar", but I wish the tool is smart enough and give me advices to refactor the code, e.g.,
(1) class A and class B have imilar methods (e.g., there 5 methods have same method name, arguments and similar implementation appearing in both classes), then it should advise to move these similar methods into a base class.
(2) class A has similar code lines at different places multiple times, the tool should advise to move these similar code lines into a single method.
I tried PMD which can find duplicate code lines but it's not clever enough. It did not find out those similar source codes which is widely spreaded in one my projects.
Is there such tool?
解决方案Our CloneDR tool finds duplicated code by comparing abstract syntax trees from parsers. (It comes in language-specific versions for many languages, including Java and JavaScript).
This means it can find cloned code in spite of format changes and modifications of the body of the clone, both of which are often done while cloning. Found clones match language concepts such as expression, declaration, statements, functions, and even classes. Clones that are similar are reported along with the differences/variation points as proposed parameters.
It can find clone sets with multiple instances (we've some applications with hundreds of clones of a single bit of code), and it can find clones across many source files.
It produces HTML reports that are directly readable by people, and XML reports that can be processed by other downstream tools. (You can see some sample HTML reports via the link).
Similarity is hard to define, and in fact you can define it in many ways. CloneDR defines it as the ratio of identical elements (technically, AST nodes) across a clone set divided by the total number of elements across the clone set. This ratio is a value between 0 and 1. It is compared against a threshold; we've found that 95% is surprisingly robust as threshold in terms of the quality of reported clones.
It is useful to establish a minimum size for interesting clones. a*b
is a clone of x*y
(with 2 parameters) but isn't useful to report because it is too small. CloneDR also uses a size threshold which we call "line count", but in fact is the size of the clone in elements divided by the average number of elements per line across the entire code base. This produces clones which usually have more lines than the threshold, but it will find clones for enormous expressions that are within a line. We've found that 5-6 "lines" is also fairly robust in terms of reported clone quality.
This table shows how effective the AST matching approach of CloneDR is compared to many other clone detection tools (ranking it "very well"). The only one that comes close is CCDIML …. which is an academic re-implementation of the CloneDR approach. There are other approaches (namely PDG-based approaches) which can detect clones that are scattered about more effectively, but in practice, in my personal experience, people that clone code don’t usually cut the cloned part into a bunch of separate parts to scatter them about; they are just too lazy. YMMV.
[Table from: Roy, Cordy, Koschke: Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach , Science of Computer Programming, Volume 74 Issue 7, May, 2009. This paper sketches many different clone detection approaches and evaluates their effectiveness.]
[PMD isn't listed, but apparantly using Rabin-Karp string matching, "text based" according to the above table, rather than AST matching.]
Re OP's requirements:
CloneDR (and in fact no tool I know) will NOT find a set of similar methods across multiple methods, if those methods occur in different orders in different classes. In this case, CloneDR is more likely to report the individual methods as clones; the net result is the same. It will find such a set if the members occur sequentially in the same order in the different classes, as happens when one class body has been wholesale copied from another.
Similar code blocks across multiple methods is quite commonly detected. The generated report shows how the the similar code blocks are related, including an abstracted version of the code which is essentially the parameterized code block you need for a method body.
相关文章