data sources

Typical days in data science: data preparation

It is cool to be a data scientist.

It is not clear from a short introduction that this attractive looking job is actually donkey work for the most part because of data preparation.


As one survey – published in Forbes – shows, 76% of data scientists enjoy these kind of tasks the least, however they spend around 80% of their time with them. It’s interesting that this rate was similar 10 years ago when we wrote about this in our book released back then.

While spending that much time with data cleansing we create special, unique expressions to describe it: data digging, massaging, play-doh-ing, plucking… Still I know only a handful of people who left their professions, got bored or started to hate working with data because of this. We love cooking so we do the vegetable pealing as well. Besides, the quality of data cleansing does matter a lot. The final result often depends on this phase – the withered parts have to be cut out while the delicious bites are being processed.

Meanwhile the technology that supports data analysis is developing in a dizzying pace. There are a lot of developments that target data cleansing to make it less time consuming to give data experts more time to spend with the actual analysis. Venture capital flows to startups who concentrate on Big Data interpretation, but naturally the giants of the data industry also work on their own solutions.

So can we hope that the 80% data cleansing – 20% analysis ratio of working time is going to pass away? I doubt it and I do not expect any significant changes within the next years.

Tools that support data cleansing are going to get better and better. However this will result in involving data sources that we would not even think about using today. Faster road vehicles did not only result in spending less time with travel, but ended up allowing us to reach farther destinations as well.

There is one more area that has the promise of doing the data cleansing phase: there are initiatives to use artificial intelligence (AI) for data interpretation. Of course, AI sets foot in more and more fields, for example within a few years there will be less need for drivers.

There will always be chefs even when machines help them with vegetable pealing. And there will be data analysts as well, with more and more useful tools to support their work.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

The picture is owned by Cathy Scola. (

Data! Scared?

Since I have been making my living from working with data for 15 years, I was frequently asked lately: “Should we be afraid of the trend that is becoming more and more obvious from every direction: the era of data is coming (or is it already here?)”


Biased for natural and understandable reasons, I am looking at this new “megatrend” not with fear but enthusiasm, while, being aware of reality, I am only laughing at the sensationalist tabloid news.

I find it appropriate that based on the initiative of the Government of Germany, everybody calls this phenomenon “Industry 4.0”, that is, the fourth industrial revolution. After the steam-engine, mass-production and robotics, today the data exchange between production lines and supply chains, and the optimization based on the exchange leads to the considerable improvement of efficiency.

The term is appropriate because it shows that humanity experienced something similar before – although it led to significant changes in the society, altogether it brought development. It is appropriate because it does not underestimate the importance of these changes and does not delimitate the Big Data phenomenon as the craziness of “geeks”.

The development of technology has made the production and processing of data way cheaper than ever before. It has opened up new opportunities (pressures?) for more and more industries to deal with data. While it used to be the privilege of telecommunication and financial corporations, nowadays the usage of self-generated and third party data sources are determinative of the competitiveness of not only the aforementioned manufacturing companies but also companies in trading, logistics, media, and even agriculture.

To conclude, I can only highlight: let us not be afraid of the data but get the most out of it as there will be losers of this revolution as well…

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.