GDPR in real life – experiences of a GDPR-readiness project from a data scientist’s perspective


With slight exaggeration, alarm bells has been ringed for years for data management companies: 25th May 2018 is the date when GDPR, the General Data Protection Regulation of EU enters into force. After getting familiar with the topic, now the implementation should be in focus. Depending on company size, some enterprises have to work out huge changes both in their data management and data analysis processes. So what kind of tasks and challenges shall a company face to make data warehouses GDPR-ready? I’m going to tell the story using one of our clients as an example, a company that started to prepare for the GDPR process more than a year ago.

What kind of tasks does GDPR-readiness generate from data asset’s point of view? First, the data elements should be assessed that need to be protected and relating data management processes should be checked. We could say, it’s not surprising in such cases that some processes pop up which need to be refined or established.      

Only after solving these questions could we start the data warehousing work. We created a central meta database that describes where personal data can be found, what kind of data management they are involved and what purposes they are used for. In the meta database we also described the above-mentioned data management processes, and we set the data specified parameters used. 

During GDPR preparation, what can a lawyer’s most trivial advice be specialized in data protection? Here comes Dr. Gáspár Frivaldszky’s answer, a lead auditor at ABT: according to the principle of data minimization the most obvious solution is that we get rid of personal data. Anonymization is an excellent way to reach this goal. Namely, GDPR shall not be applied to anonymized data – we call a dataset anonymized if the subjects are no longer identifiable. Anonymized data is not personal data. In this respect, Personal Data Protection Commission of Singapore released a useful guide a couple of weeks ago.

The second big challenge was that we had to figure out how to meet the GDPR requirements and still avoid ruining our business goals at the same time during anonymization. For example, if we have to mask a subject’s mother’s name, it may cause less problem in customer-based reports but if the subject’s transaction history or geo-demographical data becomes inaccessible that might block an automatized process, report or client-segmentation. 

The next step is data cleansing. Here we also have to pay attention to business aspects. Data cleansing processes are strongly based on personal data, because we create client segments or joins based on this data. Given anonymized and masked subjects, adjustments and modification are needed in data pairing algorithms. One possible solution is to “freeze” client segments and their attributes derived from earlier masked subjects. Thus, these groups or attributes keep their previous state and they no longer (or less frequently) participate in daily data cleaning process. 

To give a brief summary after several months of work I can say that the GDPR process requires a long preparation and enormous work at any company with at least thousands of customers. However, besides complying with the law, the outcome provides more transparent data management and processes, as well as more precise and structured data for the company. The data warehousing project of GDPR preparation has shown another important experience: you have to take time to ensure that every participant speaks the same language: people on the business side, focusing on correct client service and sales targets; lawyers dealing with new data management rules and data scientists, being responsible for implementation.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 70 passionate employees.


Sandboxes instead of walls


The mission of date warehouses (DWH) is to be the source of “undoubted” information. To have the “single version of truth” they operate with strict rules. However, this leads to the problem that they are not able to adapt to a rapidly changing business environment. Therefore most of the time it is essential to attach some “explanations” or “corrections” to the reports created from the DWH to amend it with information from other sources.  No wonder that executive reports are usually created manually in MS Excel or PowerPoint. In this case, careful hands add important information from outside the regulated DWH to the DWH reports. The concept of “DWHs collect all information” is nothing else, but: an utopia.

Therefore, it can be said that managing a company based on “hard” data (i.e. standard reports) only, is not possible. However, decisions should not be based exclusively on “soft” data (i.e. coming from unregulated channels).

The optimal solution is to take both sources into account.

If IT departments were willing to admit that both “hard” and “soft” worlds have solid grounds, they could support the union of the two by providing “sandboxes” for the various business units.

Business units would be able to load their own data into the sandbox without any regulationand would be able to access even the DWH. No more need for data “downloads” or “exports” from the DWH, no more need for mediocre crafts – e.g. think about the vlookup function in Excel – to merge the two sources.

Clearly, sandboxes are server-side environments with a performance that is better than desktop computers’. It’s obvious that these environments have to be administered (CPU performance, storage distribution, version upgrades…) but the IT department does not take responsibility over the content. Sandboxes could also include Big Data infrastructure giving the possibility to business users to get familiar with unstructured data as well.

It is important to look after the work in the sandbox – both the technical and organizational aspects – to notice if different units are working on similar tasks. In this case, coordination of these approaches are necessary.

It is not from the devil to place “professional” scheduler functions similar to professional ETL tools into the sandbox to support the regularly executed commands. Running a process put together by business analysts every night or every Monday morning fits into the concept of sandboxes.  Obviously, the maintenance and error handling of these processes are managed by the business users.

It must also be recognized that sandboxes come with risks of data security and access management. As the data structure is not as strict as of DWHs, access management is also more unbound. The problem of one of the users having access to data relevant only to his region is hard to resolve. On the other hand, there is less probability that data will end up outside the regulated areas as attached to emails, on shared drives, etc.

If the sandbox is created and operated well enough, the big moment will come. Success is – as I interpret –when polished data manipulating processes advance into the DWHs by the business units. This demonstrates that pure purpose overcomes the power oriented aspect: a business unit shares its results with the whole organization.

With this, corporate data assets are enriched by something really valuable. And explorers can take another step further.

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

The picture is created by Johan Eklund. (www.flickr.com)

Data: Power or Democracy?


We are witnessing an interesting battle in huge enterprises.

Managers on different levels have realized for a while that if they are the ones who access the data assets of their company than they can strengthen their positions best. Many have also recognized that it is worth taking the lead when it comes to data warehouses and analytical systems. By being the source of the initiative they can shape the data structures according to their own perspectives. Not only does it help them to manage the important processes but they can also gain advantage due to the fact that sooner or later others will turn to them for information. (Not many calculated the extra workload it means for their departments as – regardless of their original role – they will become data providers).

Data has become a factor of power in the eyes of many managers.

On the other hand, workflows have become more data intensive than ever before, even for those who work at the bottom of the organizational hierarchy. Most of the modern organizations have recognized that it is worthwhile to extend the rules of democracy to the usage of data and to allow their employees to access the data. Thus, fewer steps are required in the workflows and decisions are made more quickly. What is more, employees will become more motivated as they see the general goals clearer. By noticing this trend, all significant Business Intelligence (BI) suppliers have moved towards self-service systems. The experts can get answers to more and more complex problems with an interface that is easy to understand and to operate – the data as a factor of power has started to slip out of the hands of overbearing managers. Interestingly, the start-ups in the field of Big Data have set similar goals. They provide access to Big Data datasets through a simple interface; in this way, no special knowledge is required to analyse data.

I believe that the democratization of data is irreversible. Although it is possible to use politics in a clever way within the rules of democracy, different tools are necessary than in the era of absolute monarchies…

Hiflylabs creates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

The picture was taken by the Royal Navy.