20 December 2017
With the hype of deep learning neural nets, and machine learning algorithms, it's easy to forget that most of the work in data science involves accessing and preparing data for analysis. Indeed, not all data is Kaggle-ready. The reality is: data is often far from perfect.
Do your consultant (and budget) a favor and follow these rules-of-thumb when using spreadsheets to collect and organize your data:
- Do not rely on spreadsheet formatting to indicate associations in your data.
- Never merge spreadsheet cells.
- Always use Data Validation tools for data entry.
- Never (ever!) delete rows of data if you want the data excluded from the analysis.
- Create a key that explains each column of data in a table.
- Preserve the integrity of the data by separating the data from the analysis.
- Use a fixed spreadsheet template and collect data in a series of spreadsheet files (rather than a series of tabs in a file).
Full article: 7 rules for spreadsheets and data preparation for analysis and machine learning