i-nth - Data clone detection and visualization in spreadsheets

Authors

Felienne Hermans, Ben Sedee, Martin Pinzger, & Arie van Deursen

Abstract

Spreadsheets are widely used in industry: it is estimated that end-user programmers outnumber programmers by a factor 5. However, spreadsheets are error-prone, numerous companies have lost money because of spreadsheet errors. One of the causes for spreadsheet problems in the prevalence of copy-pasting.

In this paper, we study this cloning in spreadsheets. Based on existing text-based clone detection algorithms, we have developed an algorithm to detect data clones in spreadsheets: formulas whose values are copied as plain text in a different location.

To evaluate the usefulness of the proposed approach, we conducted two evaluations. A quantitative evaluation in which we analyzed the EUSES corpus and a qualitative evaluation consisting of two case studies. The results of the evaluation clearly indicate that:

Data clones are common.
Data clones pose threats similar to those code clones pose.
Our approach supports users in finding and resolving data clones.

Sample

The clone detection pop-up shows the copy-paste dependency for our example. On the formula side, we show where the data is copied and on the data side, we indicate the source.

The key contributions of this paper are as follows:

The definition of data clones in spreadsheets.
An approach for the automatic detection and visualization.
An implementation of that approach into our existing spreadsheet analysis toolkit Breviz.
A quantitative evaluation of the proposed clone detection algorithm on the EUSES corpus.
A real-life evaluation with 31 spreadsheet from a Dutch non-profit organization and 1 from academia.

Publication

2012, Delft University of Technology, Software Engineering Research Group, Technical Report Series

Full article

Data clone detection and visualization in spreadsheets