GDPR in real life – experiences of a GDPR-readiness project from a data scientist’s perspective


With slight exaggeration, alarm bells has been ringed for years for data management companies: 25th May 2018 is the date when GDPR, the General Data Protection Regulation of EU enters into force. After getting familiar with the topic, now the implementation should be in focus. Depending on company size, some enterprises have to work out huge changes both in their data management and data analysis processes. So what kind of tasks and challenges shall a company face to make data warehouses GDPR-ready? I’m going to tell the story using one of our clients as an example, a company that started to prepare for the GDPR process more than a year ago.

What kind of tasks does GDPR-readiness generate from data asset’s point of view? First, the data elements should be assessed that need to be protected and relating data management processes should be checked. We could say, it’s not surprising in such cases that some processes pop up which need to be refined or established.      

Only after solving these questions could we start the data warehousing work. We created a central meta database that describes where personal data can be found, what kind of data management they are involved and what purposes they are used for. In the meta database we also described the above-mentioned data management processes, and we set the data specified parameters used. 

During GDPR preparation, what can a lawyer’s most trivial advice be specialized in data protection? Here comes Dr. Gáspár Frivaldszky’s answer, a lead auditor at ABT: according to the principle of data minimization the most obvious solution is that we get rid of personal data. Anonymization is an excellent way to reach this goal. Namely, GDPR shall not be applied to anonymized data – we call a dataset anonymized if the subjects are no longer identifiable. Anonymized data is not personal data. In this respect, Personal Data Protection Commission of Singapore released a useful guide a couple of weeks ago.

The second big challenge was that we had to figure out how to meet the GDPR requirements and still avoid ruining our business goals at the same time during anonymization. For example, if we have to mask a subject’s mother’s name, it may cause less problem in customer-based reports but if the subject’s transaction history or geo-demographical data becomes inaccessible that might block an automatized process, report or client-segmentation. 

The next step is data cleansing. Here we also have to pay attention to business aspects. Data cleansing processes are strongly based on personal data, because we create client segments or joins based on this data. Given anonymized and masked subjects, adjustments and modification are needed in data pairing algorithms. One possible solution is to “freeze” client segments and their attributes derived from earlier masked subjects. Thus, these groups or attributes keep their previous state and they no longer (or less frequently) participate in daily data cleaning process. 

To give a brief summary after several months of work I can say that the GDPR process requires a long preparation and enormous work at any company with at least thousands of customers. However, besides complying with the law, the outcome provides more transparent data management and processes, as well as more precise and structured data for the company. The data warehousing project of GDPR preparation has shown another important experience: you have to take time to ensure that every participant speaks the same language: people on the business side, focusing on correct client service and sales targets; lawyers dealing with new data management rules and data scientists, being responsible for implementation.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 70 passionate employees.


Data enrichment


In practical data mining it is a common experience that there is much more room to enhance forecast by introducing new aspects, than reshaping the type of the model used. Sometimes it is simplified by saying: “more data, better forecasts”. More data here should not mean more gigabytes, but new approaches, which ensure that we can describe clients’ behavior in-depth. For example, if we want to predict the expected purchases, we may include data of the customer service system, not just build upon past purchases.

This is data enrichment. A quite onerous work.

On the one hand, you have to forget the rule which says: “data scientists’ work is to make the best forecast out of the given data”. Instead, you have to think about new data options which help you to get new aspects.

On the other hand, significant resources should be invested in data preparation. Because, if we had valuable data that is easy to get; probably we would already have it. Therefore, we must be careful with the data sources we intend to use. It could easily take tremendous time to acquire a data source, merge it with our present data, and finally adapt it for analysis.

Regarding data enrichment, it may seem to be an obvious solution to get data from the internet – at first sight. Why does this solution get preferred? Surprisingly, many times inner company data are much more difficult to access (need to be requested from other units, writing order documents, gather legal department acceptances, etc.) Purchasing data is often not possible from professional data services because of the price, the difficulty of the procurement process – or simply because of the lack of appropriate data.

Public data, which can be obtained easily, is – on the contrary – the Wild West itself. A separate post is needed to discuss the challenges if we choose this solution.

I personally would prefer the middle ground. I think that the best form of data enrichment would be if different companies shared their data with each other. It is not impossible, also legally, complying data protection legislation. Personal data protection can be complied if the shared data is never client-level but refers to micro segments. Micro segments are formed by categories like geo-demographic factors (age, gender, education, etc.), income, social status, etc.

For example, a utility can share the average invoice value, a telecommunication company the mobile data usage or a bank the card transactions. For giving this data, they could ask for money from their partners. I have already encountered such agreements on the market, but only on a pilot basis.

What kind of data would you share gladly? What kind of data would you pay for in return, which could make your work more effective?

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 70 passionate employees.


I don’t like average, I vote for median

Attention! Statistical topic coming! Attention! It’s so easy to interpret that it’s no excuse to skip it!


We live in the era of data: it is mass produced, it is quite easy to access and can be processed extremely fast. We should not lose the focus in this accelerated process from the purpose of the KPIs created from the data. Are we sure the KPIs and statistical measures looked practical 20-30 years ago are still go best?

Let’s have a look at the most popular and the trickiest statistical indicator, the simple average.  The general view is that the average salary describes well the typical salary of a population. However the average is usually much higher than the more representative median. To understand what median means, think about a group of kids lined up based on their height. The height of the kid standing in the middle of the line is the median height of the group. Based on this the median salary in Hungary is the point at which half of the population earns more, and half of it earns less. Based on EUROSTAT data the median salary in Hungary in 2016 was 4772 EUR while the average salary was 5397 EUR. We would like to believe that the “average costumer” manages 5397 EUR, but unfortunately the 4772 EUR is much more accurate. Median tends to be lower, but tends to be truer than average.   

We experience much bigger differences in the enterprise datasets, especially if the calculation is based on fewer clients or wider range. For example: we have 100 businesses in a database with a typical income around 1 million USD. In this case, if a company with an income of 100 million USD enters the dataset the average goes to double just because of that one firm! Still, the group is better represented by the value of the median which is still 1 million USD.

Another example is how to decrease the average call time of 10 minutes of a call center employer who makes 100 calls with 30 seconds each. A crafty analyst who knows about the nature of average would suggest to skip the only 1-hour-long call and so, the target is reached! But did the performance of the employer changed? No, because the median call time remained the same.

It is advised to remember how fragile the average can be to even only one data quality fault or to any extremity.

The average has become more popular against the median because it is easy to calculate from the values and the number of records, therefore it was easy to handle way before the era of computers and databases.

But we live in the era of databases when it is worth to use the more accurate median in the business reports to describe our clients.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

Data-Frankenstein: Data Quality horror stories


The hero of this story is a fictional character, any resemblance to reality is NOT accidental. The idea of this post is coming from a beer (well, beers) with friends where we came across the topic of data munging (data-preparation by birthname) and how much misery it can be. Laughing, we bid on each other by throwing in wilder and wilder data quality monsters because of which we used to tear out our hair during the projects.

We reached a point when we could see the Data-Frankenstein materialising in front of our – bit clouded – eyes. Here are the most important attributes of the monster to make it recognizable for those unfortunates who run into it:

  • Data tables split into multiple pieces, tipically by year, but in worst cases by month
  • Tables without headers, or the header is embedded in every 50 rows
  • The number of columns in each extract is almost random
  • Judging by the value sets, the order of the columns is mixed up
  • The delimiter is presented in the free text field thus the content is shifted when importing the data into the analytical tool
  • The formats of obvious columns (like date or financial fields) are totally different
  • The end of critically important fields (such as ID) is cut off
  • By the result of a well-meaning join the records are duplicated
  • There are characters in the free text field we have never seen before
  • The apostrophe is missing from one end of the text

Here comes the question: What to do when coming across a freak like this? There are multiple possibilities:

  1. Run! – It is the first to think of but a brave data fella does not flinch.
  2. Return it to the sender! – Ask for a version which is at least acceptable. If possible get the database dump file or get close to the colleague doing the extracting and you might help yourself out of a few unpleasant tasks.
  3. Fight it! – At the end you will have no choice but to roll up your sleeve and do it. It is not easy, but after some nerve-racking data cleaning cases you will have scripts and procedures to make the beast into a harmless kitty. By this struggle you will get to know the data and your brain is already working on the different reports and variables to create with the domesticated Frankenstein.
  4. Do not over do it!- It is surprisingly easy to sink in this kind of work but always have your eyes on the target and do not spend time on fixing fields that are useless from the business point of view.

Ps.: We thought about creating a dataset like this artificially but realized that it is actually a complex task. After all, the systems creating these data monsters were developed for many years. 

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

Sandboxes instead of walls


The mission of date warehouses (DWH) is to be the source of “undoubted” information. To have the “single version of truth” they operate with strict rules. However, this leads to the problem that they are not able to adapt to a rapidly changing business environment. Therefore most of the time it is essential to attach some “explanations” or “corrections” to the reports created from the DWH to amend it with information from other sources.  No wonder that executive reports are usually created manually in MS Excel or PowerPoint. In this case, careful hands add important information from outside the regulated DWH to the DWH reports. The concept of “DWHs collect all information” is nothing else, but: an utopia.

Therefore, it can be said that managing a company based on “hard” data (i.e. standard reports) only, is not possible. However, decisions should not be based exclusively on “soft” data (i.e. coming from unregulated channels).

The optimal solution is to take both sources into account.

If IT departments were willing to admit that both “hard” and “soft” worlds have solid grounds, they could support the union of the two by providing “sandboxes” for the various business units.

Business units would be able to load their own data into the sandbox without any regulationand would be able to access even the DWH. No more need for data “downloads” or “exports” from the DWH, no more need for mediocre crafts – e.g. think about the vlookup function in Excel – to merge the two sources.

Clearly, sandboxes are server-side environments with a performance that is better than desktop computers’. It’s obvious that these environments have to be administered (CPU performance, storage distribution, version upgrades…) but the IT department does not take responsibility over the content. Sandboxes could also include Big Data infrastructure giving the possibility to business users to get familiar with unstructured data as well.

It is important to look after the work in the sandbox – both the technical and organizational aspects – to notice if different units are working on similar tasks. In this case, coordination of these approaches are necessary.

It is not from the devil to place “professional” scheduler functions similar to professional ETL tools into the sandbox to support the regularly executed commands. Running a process put together by business analysts every night or every Monday morning fits into the concept of sandboxes.  Obviously, the maintenance and error handling of these processes are managed by the business users.

It must also be recognized that sandboxes come with risks of data security and access management. As the data structure is not as strict as of DWHs, access management is also more unbound. The problem of one of the users having access to data relevant only to his region is hard to resolve. On the other hand, there is less probability that data will end up outside the regulated areas as attached to emails, on shared drives, etc.

If the sandbox is created and operated well enough, the big moment will come. Success is – as I interpret –when polished data manipulating processes advance into the DWHs by the business units. This demonstrates that pure purpose overcomes the power oriented aspect: a business unit shares its results with the whole organization.

With this, corporate data assets are enriched by something really valuable. And explorers can take another step further.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

The picture is created by Johan Eklund. (