Take a look at the title of this article by the Associated Press (link):
“Britain’s missing 16,000 coronavirus cases thought linked to use of Excel spreadsheets for tracking”.
And here’s why: “‘Glitch’ believed to be connected to ‘well-known’ file size limitations on popular spreadsheet software”.
The popular spreadsheet software in question is Excel. And we’re talking about a technical glitch that caused nearly 16,000 new coronavirus infections to go unreported. That’s major.
How could that be? How could anyone track anything meaningful using Excel, and how could it be a surprise that data went missing?
Are companies storing data in Excel? The same mutable Excel that anyone can open, edit freely, and then occasionally accidentally close, leaving nary a trace that anything was changed?
In this week’s Tech Tuesday I will talk about my take on Excel and why it should not be used as a database.
Excel – A Love / Hate Relationship
I’ve had a long love/hate relationship with Excel, since first using it at Abbott Labs along with macros in Visual Basic, in an attempt to replace a very robust, expensive analysis tool called Matlab. It sort of worked. But my work wasn’t locked down in such a way that I was confident it would be there the next morning, just as I’d left it.
Data are like young children; they need to be monitored closely or they’ll disappear! Sometimes forever.
Back then, at Abbott, the only way to be really sure no one would (or could) edit the data stored in my Excel files, was to save the data they contained in a different format altogether. I usually opted for HTML files because they “showed” the data in tables, visually similar to Excel, but were impossible for anyone to open and edit.
But keeping a bunch of Excel (or other types of) files as a proxy for real data storage, especially without backups and version control, is a horrible idea. Yet I still see companies do it all the time.
Let’s look at one reason why this type of data storage can be such a bad idea. Take for example the case where data were piped directly into Excel from a web service that collects purchase data.
That data would likely be a mess, full of uncleaned values in need of attention before trying to glean insights. Think of, for example, missing values or duplicate entries.
Hence, there needs to be a process to “clean” the data before analysis. Excel provides a horrible foundation for this type of work. What’s worse, it’s very painful to replicate work done in Excel the next time the analysis needs to be run. In other words, cleaning or preparing data for analysis in Excel represents the opposite of a robust repeatable process!
On top of that, Excel has frightfully small limits, not only to the number of rows but to the types of processing it’s capable of doing. Frankly, it’s simply not a data storage or robust analysis tool for any type of scalable use.
I believe that Excel often works best at the end of a data pipeline when there is clean data to be graphed or, dare I say, analyzed. Using it as a storage or early analysis tool when data aren’t in good condition, is a mistake.
For these reasons, I haven’t used Excel professionally for meaningful data analyses in nearly two decades, yet I have to interface with it nearly every week because of its widespread use.
On the one hand, Excel is an indispensable tool for a wide array of tasks and general question-answering. On the other hand, its utility and the small learning curve are exactly why companies tend to overuse it, sometimes leading to horrible data loss or invalid analyses – as was the case in Britain.
There are many easy-to-use mature database solutions available right now that provide only slightly more of a learning curve than Excel. Some good examples are Google’s BigQuery and Amazon’s Redshift.
What do you think? Am I being too hard on this handy tool, or are you with me on my fight against spreadsheet data science? 🙂
In case you hold the former view, esteemed Coursera has a highly-rated course called Introduction to Data Analysis Using Excel.
Hit reply; this is quite a charged topic and I’d love to hear your thoughts on it.
Have a good week!
Data Science Rap-Video? Try not to Love This!
Who thought data scientists could rap? Here’s an amazing rap battle between a data scientist and a classical statistician, arguing for predictive algorithmic models versus inferential data models respectively. This is seriously good. Thanks for the suggestion, MM.
The Simpsons is the Best Television Show to Take Your English Vocabulary to the Next Level
With 32 seasons entertaining millions of people, The Simpsons is considered one of the greatest TV shows of all time. But have you ever thought that this TV show could make you take your English vocabulary to the next level? This is true, especially if you’re learning English as a second language. In this article, data science enthusiast Frank Andrade proves this to you by comparing the vocabulary within dialogues in 83 popular TV shows, including Friends, Game of Thrones, The Big Bang Theory, Grey’s Anatomy and more.
Data Quality: Garbage in Garbage Out
Fixing typos and the likes is a $4.5 billion market. Poor data quality leads to bad decision making and business opportunities lost. Gartner regularly surveys the cost of bad data and for years it remains at >$10 million per enterprise, even after spending $200,000 annually on data quality tools. Read more about this here.