Typical days in data science: data preparation

It is cool to be a data scientist.

It is not clear from a short introduction that this attractive looking job is actually donkey work for the most part because of data preparation.


As one survey – published in Forbes – shows, 76% of data scientists enjoy these kind of tasks the least, however they spend around 80% of their time with them. It’s interesting that this rate was similar 10 years ago when we wrote about this in our book released back then.

While spending that much time with data cleansing we create special, unique expressions to describe it: data digging, massaging, play-doh-ing, plucking… Still I know only a handful of people who left their professions, got bored or started to hate working with data because of this. We love cooking so we do the vegetable pealing as well. Besides, the quality of data cleansing does matter a lot. The final result often depends on this phase – the withered parts have to be cut out while the delicious bites are being processed.

Meanwhile the technology that supports data analysis is developing in a dizzying pace. There are a lot of developments that target data cleansing to make it less time consuming to give data experts more time to spend with the actual analysis. Venture capital flows to startups who concentrate on Big Data interpretation, but naturally the giants of the data industry also work on their own solutions.

So can we hope that the 80% data cleansing – 20% analysis ratio of working time is going to pass away? I doubt it and I do not expect any significant changes within the next years.

Tools that support data cleansing are going to get better and better. However this will result in involving data sources that we would not even think about using today. Faster road vehicles did not only result in spending less time with travel, but ended up allowing us to reach farther destinations as well.

There is one more area that has the promise of doing the data cleansing phase: there are initiatives to use artificial intelligence (AI) for data interpretation. Of course, AI sets foot in more and more fields, for example within a few years there will be less need for drivers.

There will always be chefs even when machines help them with vegetable pealing. And there will be data analysts as well, with more and more useful tools to support their work.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

The picture is owned by Cathy Scola. (www.flickr.com)

Dangerous little elephant?


The elephant is said to be the most dangerous animal on the African Savanna. Nevertheless, the icon of the quasi standard Big Data processing tool Hadoop is a cute little elephant (by the way, the technology was named after a stuffed animal of the son of Doug Cutting, one of the inventors of Hadoop). Yet, in my experience, most corporations are cautious about, or even afraid of this new technology.

However, Hadoop based systems can provide computing capacity for a fraction of the cost compared to traditional suppliers.

Why do companies not jump at this opportunity? After many conversations and a few pilot projects, I came to the following conclusions:

  • Access management

With a little exaggeration, Hadoop allows basically anybody to access any data. Or one may have no access at all. Companies prefer more sophisticated access management than this.

  • Operational management

For most IT operational management teams, keeping track of the versions of their current technology is a challenge in itself. In addition, they are afraid of a differently operating system. Instead of the usual supplier support, Hadoop is mostly free but offers no support. (In contrast, Cloudera and Hortonworks, which, by the way, also have Hungarian developer teams, provide support service but not for the Hungarian IT budgets.)

  • User interface

Currently, specified knowledge is necessary for the upload and query of data using Hadoop. Although many are working on the creation of an easy-to-use interface for Hadoop, it is probably for this reason that no standard has yet emerged from the different solutions.

Nowadays, traditional companies get seriously interested in Hadoop only when some circumstances require them to change the current technology. If there is no such influence, Hadoop usually stays in the “interesting experimentation” category. Only the most committed ones belong to the exceptions.

Meanwhile, the big database vendors (MicrosoftOracleIBMSAPSAS, etc.) are all working on the taming of the Hadoop elephant, and I think they will be successful.


Hadoop is 10 years old. I am convinced that by the time it reaches adolescence, it will understand the world of business much better and will change it as well…

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.


What does a cow drink?

What does a cow drink? Milk?

What does a data analyst eat? Pie chart?


The more important data analysis gets the more companies would like to put together a team of data analysts. To hire and, more importantly, to keep good data analysts, it does not hurt to know that they are primarily motivated by “interesting projects”. Pie charts will not be enough.

There is a growing demand for data scientists around the world. The USA has only half the number of data scientists than the market demands. Moreover in 2016 it became the most wanted job on Glassdoor. This effect is also experienced in Hungary: it is becoming harder and harder to find the right persons for certain positions.

A good data analyst is not only open to the business problems but also willing to “get his hands dirty” by writing query/analytic lines of code. The true data analyst is not a “geek” or a “scientist” (in the strict meaning of the word) but someone who understands what results are the most effective to improve the business.

I don’t exactly know the situation around the world, but here in Hungary in my experience the most important skill is business understanding. If the candidate is receptive for the technology then learning it is much easier than improving business skills.

A good data scientist wants to see his/her work to drive actions and hates working in vain. An international corporation hired a lot of data scientists in Hungary with the promise of interesting projects but ended up losing most of its team because the projects came slower than expected from California. Also a lot of people left the Budapest team of an online service company who turned from B2C to a B2B strategy thus the importance of data analysis decreased.

A lot depends on the creative atmosphere therefore it is a particularly hard challenge to put together and motivate a small team of data analysts. If workload shifts toward “prosy” reporting tasks from business-critical analysis, then the valuable employees would consider leaving the company for something more interesting.

With a correct level of income the secret to keep the “data ninjas” lies not in the money but in diverse and inspiring projects.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

Data: Power or Democracy?


We are witnessing an interesting battle in huge enterprises.

Managers on different levels have realized for a while that if they are the ones who access the data assets of their company than they can strengthen their positions best. Many have also recognized that it is worth taking the lead when it comes to data warehouses and analytical systems. By being the source of the initiative they can shape the data structures according to their own perspectives. Not only does it help them to manage the important processes but they can also gain advantage due to the fact that sooner or later others will turn to them for information. (Not many calculated the extra workload it means for their departments as – regardless of their original role – they will become data providers).

Data has become a factor of power in the eyes of many managers.

On the other hand, workflows have become more data intensive than ever before, even for those who work at the bottom of the organizational hierarchy. Most of the modern organizations have recognized that it is worthwhile to extend the rules of democracy to the usage of data and to allow their employees to access the data. Thus, fewer steps are required in the workflows and decisions are made more quickly. What is more, employees will become more motivated as they see the general goals clearer. By noticing this trend, all significant Business Intelligence (BI) suppliers have moved towards self-service systems. The experts can get answers to more and more complex problems with an interface that is easy to understand and to operate – the data as a factor of power has started to slip out of the hands of overbearing managers. Interestingly, the start-ups in the field of Big Data have set similar goals. They provide access to Big Data datasets through a simple interface; in this way, no special knowledge is required to analyse data.

I believe that the democratization of data is irreversible. Although it is possible to use politics in a clever way within the rules of democracy, different tools are necessary than in the era of absolute monarchies…

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

The picture was taken by the Royal Navy.