The hero of this story is a fictional character, any resemblance to reality is NOT accidental. The idea of this post is coming from a beer (well, beers) with friends where we came across the topic of data munging (data-preparation by birthname) and how much misery it can be. Laughing, we bid on each other by throwing in wilder and wilder data quality monsters because of which we used to tear out our hair during the projects.
We reached a point when we could see the Data-Frankenstein materialising in front of our – bit clouded – eyes. Here are the most important attributes of the monster to make it recognizable for those unfortunates who run into it:
- Data tables split into multiple pieces, tipically by year, but in worst cases by month
- Tables without headers, or the header is embedded in every 50 rows
- The number of columns in each extract is almost random
- Judging by the value sets, the order of the columns is mixed up
- The delimiter is presented in the free text field thus the content is shifted when importing the data into the analytical tool
- The formats of obvious columns (like date or financial fields) are totally different
- The end of critically important fields (such as ID) is cut off
- By the result of a well-meaning join the records are duplicated
- There are characters in the free text field we have never seen before
- The apostrophe is missing from one end of the text
Here comes the question: What to do when coming across a freak like this? There are multiple possibilities:
- Run! – It is the first to think of but a brave data fella does not flinch.
- Return it to the sender! – Ask for a version which is at least acceptable. If possible get the database dump file or get close to the colleague doing the extracting and you might help yourself out of a few unpleasant tasks.
- Fight it! – At the end you will have no choice but to roll up your sleeve and do it. It is not easy, but after some nerve-racking data cleaning cases you will have scripts and procedures to make the beast into a harmless kitty. By this struggle you will get to know the data and your brain is already working on the different reports and variables to create with the domesticated Frankenstein.
- Do not over do it!- It is surprisingly easy to sink in this kind of work but always have your eyes on the target and do not spend time on fixing fields that are useless from the business point of view.
Ps.: We thought about creating a dataset like this artificially but realized that it is actually a complex task. After all, the systems creating these data monsters were developed for many years.
Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.