i-nth - Fuse: A reproducible, extendable, internet-scale corpus of spreadsheets

Authors

Titus Barik, Kevin Lubicky, Justin Smith, John Slankasy, & Emerson Murphy-Hill

Abstract

Spreadsheets are perhaps the most ubiquitous form of end-user programming software. This paper describes a corpus, called FUSE, containing 2,127,284 URLs that return spreadsheets (and their HTTP server responses), and 249,376 unique spreadsheets, contained within a public web archive of over 26.83 billion pages.

Obtained using nearly 60,000 hours of computation, the resulting corpus exhibits several useful properties over prior spreadsheet corpora, including reproducibility and extendability. Our corpus is unencumbered by any license agreements, available to all, and intended for wide usage by end-user software engineering researchers.

In this paper, we detail the data and the spreadsheet extraction process, describe the data schema, and discuss the trade-offs of FUSE with other corpora.

Sample

This figure shows the log-log distributions of formula cells and input cells across all of FUSE. Formula cells and input cells have markedly different distributions.

We also gathered over 450 metrics, including the number of times a given Excel function (such as SUM or VLOOKUP) is used, the total number of input cells (i.e., cells that are not formulas), the number of numeric input cells, and the most common formula used.

Publication

2015, IEEE/ACM Working Conference on Mining Software Repositories (MSR), May

Full article

Fuse: A reproducible, extendable, internet-scale corpus of spreadsheets