i-nth - A grammar for spreadsheet formulas evaluated on two large datasets

Authors

Efthimia Aivaloglou, David Hoepelman, & Felienne Hermans

Abstract

Spreadsheets are ubiquitous in the industrial world and often perform a role similar to other computer programs in many different domains. However, there does not exist a reliable grammar that is concise enough to facilitate research on spreadsheet formula code bases.

This paper presents a grammar for spreadsheet formulas that is compatible, is compact enough to feasibly implement with a parser generator, and produces parse trees suited for further manipulation and analysis.

We evaluate the grammar against more than one million unique formulas extracted from the well known EUSES and Enron spreadsheet datasets, successfully parsing 99.99%. Additionally, we utilize the grammar to analyze these datasets and measure the frequency of usage of language features in spreadsheet formulas.

Finally, we identify smelly constructs and edge cases in the syntax of formulas.

Sample

This is the syntax diagram of the Formula production rule, with most of its production rules expanded.

Publication

2015, 15th IEEE International Working Conference on Source Code Analysis and Manipulation, September

Full article

A grammar for spreadsheet formulas evaluated on two large datasets