data science

Data enrichment


In practical data mining it is a common experience that there is much more room to enhance forecast by introducing new aspects, than reshaping the type of the model used. Sometimes it is simplified by saying: “more data, better forecasts”. More data here should not mean more gigabytes, but new approaches, which ensure that we can describe clients’ behavior in-depth. For example, if we want to predict the expected purchases, we may include data of the customer service system, not just build upon past purchases.

This is data enrichment. A quite onerous work.

On the one hand, you have to forget the rule which says: “data scientists’ work is to make the best forecast out of the given data”. Instead, you have to think about new data options which help you to get new aspects.

On the other hand, significant resources should be invested in data preparation. Because, if we had valuable data that is easy to get; probably we would already have it. Therefore, we must be careful with the data sources we intend to use. It could easily take tremendous time to acquire a data source, merge it with our present data, and finally adapt it for analysis.

Regarding data enrichment, it may seem to be an obvious solution to get data from the internet – at first sight. Why does this solution get preferred? Surprisingly, many times inner company data are much more difficult to access (need to be requested from other units, writing order documents, gather legal department acceptances, etc.) Purchasing data is often not possible from professional data services because of the price, the difficulty of the procurement process – or simply because of the lack of appropriate data.

Public data, which can be obtained easily, is – on the contrary – the Wild West itself. A separate post is needed to discuss the challenges if we choose this solution.

I personally would prefer the middle ground. I think that the best form of data enrichment would be if different companies shared their data with each other. It is not impossible, also legally, complying data protection legislation. Personal data protection can be complied if the shared data is never client-level but refers to micro segments. Micro segments are formed by categories like geo-demographic factors (age, gender, education, etc.), income, social status, etc.

For example, a utility can share the average invoice value, a telecommunication company the mobile data usage or a bank the card transactions. For giving this data, they could ask for money from their partners. I have already encountered such agreements on the market, but only on a pilot basis.

What kind of data would you share gladly? What kind of data would you pay for in return, which could make your work more effective?

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 70 passionate employees.


Typical days in data science: data preparation

It is cool to be a data scientist.

It is not clear from a short introduction that this attractive looking job is actually donkey work for the most part because of data preparation.


As one survey – published in Forbes – shows, 76% of data scientists enjoy these kind of tasks the least, however they spend around 80% of their time with them. It’s interesting that this rate was similar 10 years ago when we wrote about this in our book released back then.

While spending that much time with data cleansing we create special, unique expressions to describe it: data digging, massaging, play-doh-ing, plucking… Still I know only a handful of people who left their professions, got bored or started to hate working with data because of this. We love cooking so we do the vegetable pealing as well. Besides, the quality of data cleansing does matter a lot. The final result often depends on this phase – the withered parts have to be cut out while the delicious bites are being processed.

Meanwhile the technology that supports data analysis is developing in a dizzying pace. There are a lot of developments that target data cleansing to make it less time consuming to give data experts more time to spend with the actual analysis. Venture capital flows to startups who concentrate on Big Data interpretation, but naturally the giants of the data industry also work on their own solutions.

So can we hope that the 80% data cleansing – 20% analysis ratio of working time is going to pass away? I doubt it and I do not expect any significant changes within the next years.

Tools that support data cleansing are going to get better and better. However this will result in involving data sources that we would not even think about using today. Faster road vehicles did not only result in spending less time with travel, but ended up allowing us to reach farther destinations as well.

There is one more area that has the promise of doing the data cleansing phase: there are initiatives to use artificial intelligence (AI) for data interpretation. Of course, AI sets foot in more and more fields, for example within a few years there will be less need for drivers.

There will always be chefs even when machines help them with vegetable pealing. And there will be data analysts as well, with more and more useful tools to support their work.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

The picture is owned by Cathy Scola. (