Authors
Patrick Koch
Abstract
Increasing availability and popularity of end user programming environments has enabled a vast number of people to craft custom solutions for their computation needs. Electronic spreadsheets in particular are commonly used as a programming tool for computerized calculation and modelling. While spreadsheets often fulfil critical tasks, they are mostly developed by domain experts with little programming experience. Those users are seldom aware of quality issues in their spreadsheet programs that may nevertheless lead to faults with significant adverse effects.
In recognition of this problem, researchers have worked over the last decades to introduce a variety of techniques to avoid, find, and fix quality issues and faults within spreadsheets. In particular, combinations of visualization and static analysis approaches like spreadsheet smells demonstrate promising results by providing easy to process feedback for users about detected problems. Prominent, established approaches are, however, limited in their effectiveness due to a number of shortcomings. They do not infer existing structural information that is inherent to spreadsheet programs in a consistent manner, make seldom use of such structural properties, and are generally constrained to focus on a limited set of detectable issues.
In the course of this thesis, we therefore present a selection of measures that address the stated drawbacks:
- (1) We set up a formal language for describing common spreadsheet properties and introduce an algorithmic procedure that infers structures of different granularities that serve as proxy to measure inherent structural characteristics. We then outline guidelines to use these proxies to formulate beneficial measures, and demonstrate their use by formulating novel measures as well as revisions of existing quality measures.
- (2) We design and discuss a three-pronged conceptual approach for augmenting contemporary spreadsheet systems with the assistance of structural information. We present the application of such concepts alongside the introduction of Fritz, our research tool for static analysis and visualization of spreadsheets. The tool implements the proposed structure inference process and allows measurement of a variety of structure-based and common metrics. It provides structural informations via graphical representation embedded in a spreadsheet grid, context-relevant measurements in form of auxiliary tables, and warnings about common quality issues by means of graphical indicators.
- (3) We establish a systematic approach that uses specific spreadsheet measurements to train machine learning (ML) models for the task of predicting faulty cells in spreadsheets. We first evaluate the approach utilizing smell metrics and simple classifiers. Based on this groundwork, we then generalize the process to include a wider range of metrics, to support a variety of ML models, and to use more datasets for evaluation. During the final analysis, based on a custom formalized catalog of 64 spreadsheet metrics, the method is able to correctly identify more than 70% of existing faults within three tested spreadsheet collections.
In summary, this work improves means of safeguarding spreadsheet quality and correctness by introducing well-defined measures of structural spreadsheet properties and demonstrating the beneficial uses thereof. Such structure metrics can also be used in a presented methodology for fault prediction in spreadsheets that shows promising results and lends itself to inclusion in common spreadsheet processors.
Sample
We have an example spreadsheet which calculates a car loan. The Value View, the default representation of cells within a spreadsheet environment, is given on top. Based on this perspective, the example makes a straightforward impression.
However, when investigating the Formula View, as given in the bottom, it becomes apparent that the end result depends on a mesh of interwoven and in part inconsistent calculations.
Such calculation structures are hardly uncommon and emphasize the need for sophisticated tools that assist end users in maintaining the quality and correctness of their spreadsheets.
Given the circumstances outlined above, it is of little surprise that spreadsheets, even when used for critical tasks, are often error prone.
[Note: To further emphasise the point, even this simple example contains an error: The Value View and the Formula View for column E are inconsistent.]
Publication
2019, Ph.D thesis, University of Klagenfurt, September
Full article
For good measure: Structural measures for the improvement of spreadsheet quality and correctness