Data enrichment


In practical data mining it is a common experience that there is much more room to enhance forecast by introducing new aspects, than reshaping the type of the model used. Sometimes it is simplified by saying: “more data, better forecasts”. More data here should not mean more gigabytes, but new approaches, which ensure that we can describe clients’ behavior in-depth. For example, if we want to predict the expected purchases, we may include data of the customer service system, not just build upon past purchases.

This is data enrichment. A quite onerous work.

On the one hand, you have to forget the rule which says: “data scientists’ work is to make the best forecast out of the given data”. Instead, you have to think about new data options which help you to get new aspects.

On the other hand, significant resources should be invested in data preparation. Because, if we had valuable data that is easy to get; probably we would already have it. Therefore, we must be careful with the data sources we intend to use. It could easily take tremendous time to acquire a data source, merge it with our present data, and finally adapt it for analysis.

Regarding data enrichment, it may seem to be an obvious solution to get data from the internet – at first sight. Why does this solution get preferred? Surprisingly, many times inner company data are much more difficult to access (need to be requested from other units, writing order documents, gather legal department acceptances, etc.) Purchasing data is often not possible from professional data services because of the price, the difficulty of the procurement process – or simply because of the lack of appropriate data.

Public data, which can be obtained easily, is – on the contrary – the Wild West itself. A separate post is needed to discuss the challenges if we choose this solution.

I personally would prefer the middle ground. I think that the best form of data enrichment would be if different companies shared their data with each other. It is not impossible, also legally, complying data protection legislation. Personal data protection can be complied if the shared data is never client-level but refers to micro segments. Micro segments are formed by categories like geo-demographic factors (age, gender, education, etc.), income, social status, etc.

For example, a utility can share the average invoice value, a telecommunication company the mobile data usage or a bank the card transactions. For giving this data, they could ask for money from their partners. I have already encountered such agreements on the market, but only on a pilot basis.

What kind of data would you share gladly? What kind of data would you pay for in return, which could make your work more effective?

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 70 passionate employees.


I don’t like average, I vote for median

Attention! Statistical topic coming! Attention! It’s so easy to interpret that it’s no excuse to skip it!


We live in the era of data: it is mass produced, it is quite easy to access and can be processed extremely fast. We should not lose the focus in this accelerated process from the purpose of the KPIs created from the data. Are we sure the KPIs and statistical measures looked practical 20-30 years ago are still go best?

Let’s have a look at the most popular and the trickiest statistical indicator, the simple average.  The general view is that the average salary describes well the typical salary of a population. However the average is usually much higher than the more representative median. To understand what median means, think about a group of kids lined up based on their height. The height of the kid standing in the middle of the line is the median height of the group. Based on this the median salary in Hungary is the point at which half of the population earns more, and half of it earns less. Based on EUROSTAT data the median salary in Hungary in 2016 was 4772 EUR while the average salary was 5397 EUR. We would like to believe that the “average costumer” manages 5397 EUR, but unfortunately the 4772 EUR is much more accurate. Median tends to be lower, but tends to be truer than average.   

We experience much bigger differences in the enterprise datasets, especially if the calculation is based on fewer clients or wider range. For example: we have 100 businesses in a database with a typical income around 1 million USD. In this case, if a company with an income of 100 million USD enters the dataset the average goes to double just because of that one firm! Still, the group is better represented by the value of the median which is still 1 million USD.

Another example is how to decrease the average call time of 10 minutes of a call center employer who makes 100 calls with 30 seconds each. A crafty analyst who knows about the nature of average would suggest to skip the only 1-hour-long call and so, the target is reached! But did the performance of the employer changed? No, because the median call time remained the same.

It is advised to remember how fragile the average can be to even only one data quality fault or to any extremity.

The average has become more popular against the median because it is easy to calculate from the values and the number of records, therefore it was easy to handle way before the era of computers and databases.

But we live in the era of databases when it is worth to use the more accurate median in the business reports to describe our clients.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.