Data-Frankenstein: Data Quality horror stories


The hero of this story is a fictional character, any resemblance to reality is NOT accidental. The idea of this post is coming from a beer (well, beers) with friends where we came across the topic of data munging (data-preparation by birthname) and how much misery it can be. Laughing, we bid on each other by throwing in wilder and wilder data quality monsters because of which we used to tear out our hair during the projects.

We reached a point when we could see the Data-Frankenstein materialising in front of our – bit clouded – eyes. Here are the most important attributes of the monster to make it recognizable for those unfortunates who run into it:

  • Data tables split into multiple pieces, tipically by year, but in worst cases by month
  • Tables without headers, or the header is embedded in every 50 rows
  • The number of columns in each extract is almost random
  • Judging by the value sets, the order of the columns is mixed up
  • The delimiter is presented in the free text field thus the content is shifted when importing the data into the analytical tool
  • The formats of obvious columns (like date or financial fields) are totally different
  • The end of critically important fields (such as ID) is cut off
  • By the result of a well-meaning join the records are duplicated
  • There are characters in the free text field we have never seen before
  • The apostrophe is missing from one end of the text

Here comes the question: What to do when coming across a freak like this? There are multiple possibilities:

  1. Run! – It is the first to think of but a brave data fella does not flinch.
  2. Return it to the sender! – Ask for a version which is at least acceptable. If possible get the database dump file or get close to the colleague doing the extracting and you might help yourself out of a few unpleasant tasks.
  3. Fight it! – At the end you will have no choice but to roll up your sleeve and do it. It is not easy, but after some nerve-racking data cleaning cases you will have scripts and procedures to make the beast into a harmless kitty. By this struggle you will get to know the data and your brain is already working on the different reports and variables to create with the domesticated Frankenstein.
  4. Do not over do it!- It is surprisingly easy to sink in this kind of work but always have your eyes on the target and do not spend time on fixing fields that are useless from the business point of view.

Ps.: We thought about creating a dataset like this artificially but realized that it is actually a complex task. After all, the systems creating these data monsters were developed for many years. 

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

Data: Power or Democracy?


We are witnessing an interesting battle in huge enterprises.

Managers on different levels have realized for a while that if they are the ones who access the data assets of their company than they can strengthen their positions best. Many have also recognized that it is worth taking the lead when it comes to data warehouses and analytical systems. By being the source of the initiative they can shape the data structures according to their own perspectives. Not only does it help them to manage the important processes but they can also gain advantage due to the fact that sooner or later others will turn to them for information. (Not many calculated the extra workload it means for their departments as – regardless of their original role – they will become data providers).

Data has become a factor of power in the eyes of many managers.

On the other hand, workflows have become more data intensive than ever before, even for those who work at the bottom of the organizational hierarchy. Most of the modern organizations have recognized that it is worthwhile to extend the rules of democracy to the usage of data and to allow their employees to access the data. Thus, fewer steps are required in the workflows and decisions are made more quickly. What is more, employees will become more motivated as they see the general goals clearer. By noticing this trend, all significant Business Intelligence (BI) suppliers have moved towards self-service systems. The experts can get answers to more and more complex problems with an interface that is easy to understand and to operate – the data as a factor of power has started to slip out of the hands of overbearing managers. Interestingly, the start-ups in the field of Big Data have set similar goals. They provide access to Big Data datasets through a simple interface; in this way, no special knowledge is required to analyse data.

I believe that the democratization of data is irreversible. Although it is possible to use politics in a clever way within the rules of democracy, different tools are necessary than in the era of absolute monarchies…

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

The picture was taken by the Royal Navy.