On the 27th day of April 2016, the GDPR was adopted and was then published in the Official Journal of the European Union the 4th day of May 2016 . The clock started on the 24th of May 2016, counting down to the 25th day of May 2018, when that regulation will take the full force of law in all Member States of the European Union (EU).
The first thing to clarify is that the GDPR is a European regulation, not a directive. The difference is as follows:
|Defines what the EU Member States shall implement in their legislation. A directive is received and implemented in many different flavours by each Member State as each government sees fit.
||It is a legal instrument that replaces any conflicting legislation on the same matter in all EU Member States. It has the force of law without any government intervention of EU Member States.
This difference is substantial, and still, in March 2018 I was told by many Security Officers that the 25th of May was not that important, because “…anyway our government has to legislate to implement it so that it can take effect”. The feeling is that most of the companies are in denial about the GDPR coming into force and bringing fines up to €20 millions for those found non-compliant.
Once they understand that it is happening and it’s just in a month, then the denial moves to what is necessary to become compliant, trying to find any possible excuse not to intervene properly.
This new form of denial is by far more dangerous because it introduces false confidence in what the interpretation of the legal text must be.
In this paper, we are now starting a very practical and pragmatic journey in those parts of the legal instrument that affect software architecture. I will avoid all IT related issues, leaving them for a different study as they merit a separate analysis due to the complexity of making an IT infrastructure GDPR compliant.
While reading this paper, there is one thing that must be in your mind every time you say “do I really have to do this?”: imagine to be in a courtroom, on the stand, being sued for having breached the GDPR, and the plaintiff’s attorney asks you the big question, “Did you or did you not implement the State of the Art in Data Protection? Could you or could you not, in your professional opinion, do more to prevent the data leakage?”. Remember that in a court of law you cannot lie unless you want to bear heavy consequences, and the GDPR is all about implementing and keeping the State of the Art in the field of Data Protection in our solutions. Also remember that the plaintiff will hire a consultant to prove that you did not implement such State of the Art in the field of Data Protection in your solution, so you better think twice every time you are tempted to dismiss a data protection standard.
2 Data Protection
When we talk about Data Protection, we often do not stop to think about the actual meaning of it, nor its relationship with Privacy. Moreover, there is a tendency to protect the data after a breach, and this approach is no longer acceptable under the GDPR.
2.1 Privacy Vs Data Protection
The first notion we must digest is the difference between Privacy and Data Protection. There is often some confusion about these concepts, so the best option I have is to start our journey by clarifying them both.
We, Data Subjects, have right to privacy, and this is granted by various legal instruments in many different flavours depending on jurisdictions, but the overall take away for us is that Privacy is a Right. All the legal instruments establishing this fundamental right, from International Conventions down to local ByLaws, have tried to guarantee in some form the protection of our privacy.
2.1.2 Data Protection
There is only one way to guarantee our right to privacy: protecting our data. This is a simple concept, easy to digest: Data Protection is the mean by which Privacy is guaranteed.
Now, a mean is normally realized by using tools, and in our case, this applies. Data Protection is implemented in software architecture and design by selecting tools, pattern and standards that facilitate Data Protection, and where the tool is missing we create one.
2.2 Data Protection by Design and by Default
This mandate established by Article 25 of the GDPR is the most crucial piece of legislation for us to understand and fully digest. This is also where almost all business actors I talked with (product owners, account managers, CEOs and so on) tend to look for some escape route. Let us dig into the first phrase of Art. 25(1):
Taking into account the state of the art, the cost of implementation and the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural persons posed by the processing, the controller shall…
Do we all notice the same way that part that reads: “…the cost of implementation…”? Now, the mantra business people keep reciting sounds like this: “It costs too much to protect these data using the state of the art of technology! The GDPR itself says that!” Ok, that is why we have recitals in all complex legislation, to help us to understand those passages that could drive to misinterpretation of the law and consequent fines. Recital 26 clarifies this point about costs to consider, and it makes sense from a technical perspective as well:
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
As we can see, the costs to consider are those representing the effort for an attacker to “identify the natural person”. We, technical people, use this approach all the time, and our goal is always to ensure that breaking our security measures is so costly and entails so much development that no person in his/her right mind will ever even start trying to break in. Recital 83 uses a simpler form, like Art. 25(1), Art. 17(2) and 32(1).
Naturally, as it is clear to the reader, I am playing devil’s advocate here. Certainly, it makes sense to consider the actual cost of implementation of the necessary measures to protect the data, but this applies to balancing the choices, reducing the cost by implementing a level of security adequate but among those that can be implemented with the less possible cost.
Nowhere in the GDPR, we have been given a choice to opt for not protecting the data because it costs too much in our view: such an attitude would cost heavily to the company in the case gets sued for non-compliance.
So, now that we have clarified that pain-point, let us understand what the meaning of Data Protection “by Design and by Default” is.
2.2.1 By Design
The wording of Art. 25 is all about establishing by design all the necessary measures to protect personal data. This clarifies that before discussing any implementation, the design of the solution must preemptively consider how to protect the data, and specifically applying the Six Principle established in Article 5(1):
- Lawfulness, fairness and transparency
- Purpose Limitation
- Data minimisation
- Storage limitation
- Integrity and confidentiality
The technical means to achieve such compliance often mentioned in the GDPR are:
- security of processing
We, the Architects, are tasked to architect and design our solutions implementing those principles, not moving forward if we are not satisfied that we did all we can given the context and the state of the art of the technology.
2.2.2 By Default
Stating “by default” seems redundant, but it is another very important concept that the legislator has established.
In our design, all behaviours must default to the strictest rules of Data Protection. Quite often in the past, the default behaviour in software was represented by the minimum set, the basics. Now we are called to ensure that if we don’t know or don’t have enough information at any point in our solution and its processes, then we apply the strictest rules, this to guarantee that no data will ever be processed or leaked by mistake.
The Privacy by Design approach is characterized by proactive rather than reactive measures. It anticipates and prevents privacy invasive events before they happen. PbD does not wait for privacy risks to materialize, nor does it offer remedies for resolving privacy infractions once they have occurred – it aims to prevent them from occurring. In short, Privacy by Design comes before-the-fact, not after. 
3 Data Processing
A special mention is necessary for processing the data. Let us first make clear what processing means, especially when law interpretation comes into play.
If I take note on a piece of paper of a person name and address, I am processing that person’s data. If I take that note and I put it in my wallet, I am again processing those data. When it comes to computing, then the notion is easy, taking the definition from the Oxford dictionary:
Operate on data by means of a program.
Any operation involving the data in any way represents Data Processing, not just the use of data in a computational algorithm relates to those data. This is stressed in the GDPR in multiple passages: we must limit data processing to the strict minimum that is necessary to carry out the operations that have been authorised by the data subject.
In principle, if for a given operation I need just first name and last name, I am supposed to retrieve only those from the database, not a full record including email and address: I am supposed to create a data view that satisfies that minimalistic data access.
This because the more data we move around in memory the more data we put at risk in case of an attacker watching our memory operations, and this brings us to data encryption, but first it is worth to clarify the meaning of State of the Art.
3.1 State of The Art
We have seen that the GDPR mandates to use the state of the art of current technology. This is mentioned in four places: Recitals 78 and 83, Articles 25(1) and 32(1).
The definition of State of the Art is a moving target: it changes over time. The GDPR established that Data Protection must be guaranteed by implementing the State of the Art of technology at the time of processing. Therefore we have now a legal obligation to keep our software solutions current, fully up to date. Implementing the State of the Art of technology for Data Protection today does not guarantee that in three months time the solution is not obsoleted by new findings.
3.2 Database encryption
The fact that we must architect, design, implement and maintain using the State of the Art of technology has a huge impact on the database we use.
There are two main levels of DB encryptions: at rest and in memory. The encryption at rest was the state of the art until live DB encryption was introduced, obsoleting de facto the simple at rest option because that encompasses both, at rest and in-memory data encryption.
While I write this paper, there are only two databases that I know of featuring both encryptions, Microsoft SQL Server starting at version 2016 and Oracle Database Server with the Transparent Data Encryption module, so at this point in time, any other database seems to be non-GDPR-compliant. This represents a big problem for us because many solutions are based on non-compliant databases, and there is no known roadmap for those to become compliant. In truth, some free DB will not have enough funding ever to be compliant, so they are necessarily destined to be dropped from any solution dealing with European citizens’ personal data.
The situation is different for NoSQL databases: none of them, in my knowledge, currently offers both at rest and in memory encryption. Therefore encryption at rest is the state of the art feature as for now, encompassing many of the most commonly used NoSQL DBs, like Mongo DB (only Enterprise with WiredTiger Engine) and Cosmos DB.
3.3 Database structure and pseudonymization
Once we have selected the correct database, we have to think about its structure. When we talk security, we all know the old say about not keeping all your eggs in one basket, and that applies quite well in this case.
Designing databases using the 3rd Normal Form (3NF) or Boyce-Codd Normal Form (BCNF) seems to be no longer enough. The principle of pseudonymization established in the GDPR cannot be easily implemented within our usual DB design practices.
The principle of pseudonymization is known to us with the more familiar term of steganography. What we must do is to hide information by replacing it with a pseudonym, thus hiding the data in plain sight in a way that the attacker sees different and meaningful information in its place.
Let us take this number, imagining it is a valid social security number:
Storing this as a steganographic information can be represented, for instance, as a list of US cities:
Rome, Denver, Washington, Arlington, Lebanon, Madison, Greenville
This results by applying the following translation table:
Obviously, we need more columns for the digit occurrences, but this gives you a practical example of using steganography to apply pseudonyms to data, to the end of making them useless to an intruder.
This implies that the column where we store the social security number should be defined in the DB something like VisitedPlaces, so we close the circle about fooling the intruder.
We also must tackle the issue of where we store the translation table. Storing that in the same DB along with the data we apply the table to would be an error. Therefore we should put the translation tables in separate storage, possibly an encrypted NoSQL DB.
One more thing needs to be told to secure the steganographic system: remember to replace the translation table regularly. To make this operation efficient you need a column to signify which translation table has been used to encode the column, and this can be any conventional value. For instance, if you replace the table every week, then you can have the week number stored. This is necessary because replacing the translation table entails parsing all the records in the table, and it could be a long operation executed at low priority. Therefore if the table is significant, at some point we will have some value still encoded using last week table, and we have to know this unless we lock the whole table while we replace the encoder, but that would be bad practice, especially for big tables parsed at low priority.
3.4 Issues with in-memory data
We know how often an intruder is nothing but a memory observer, a program that monitors the RAM looking for data it can recognise and snatch, often referred to as memory sniffers or memory scrapers. Against these attacks, we have the DB in memory encryption, but at some point in time we do have to retrieve the data to use them computationally, and that is the moment we become vulnerable again. This happens even sooner when using a NoSQL DB because, as we have determined, they only offer encryption at rest.
It is evident that we cannot use the data if we do not have them in clear text, so at some point, we will be anyway vulnerable if a sniffer is watching our process, therefore the trick is to keep the data in clear text for the shortest possible amount of time. In case we need those data for multiple processes then we must keep it in an encrypted memory area and decrypt them only for the microseconds necessary to our computational need.
Nowadays technology allows us to encrypt and decrypt on the fly at a non-significant time cost from a human user perspective. Therefore this is the best approach we can apply using today’s cryptographic state of the art libraries.
3.5 Sensitive data
Sensitive data are listed in Article 9(1):
“racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation”
When in need to process this type of data we must ensure a better and stronger cryptography, so the RSA key pair must be no less than 2048 bits, and 512 bits for block cyphers.
The most important thing about data, and especially about sensitive data, is to ask ourselves: do we need to process them? Are these data necessary to carry out the service? Do we have an explicit consent to process the data subject’s sensitive data?
Do remember that we can no longer collect data with generic or implicit consent: we must ensure that the data subject is fully aware of the fact that we are processing these data, (s)he must explicitly consent to the processing using a form that also clarifies the scope and timeframe of the data processing. Moreover, we must securely store the consent, and in case we must process the data for a longer period a new consent is necessary, especially for sensitive data.
3.6 True random numbers generators
Every time we mention cryptography, we are implying the generation of random numbers, and those shall be cryptographically secure.
State of the art for cryptographically secure random numbers generation is hardware cards using shot noise, photoelectric effects and other electron or photon-based systems.
These cards should be compliant with NIST SP800-90 B  and C (still in draft) , like the Comscire PQ4000KS  or SwiftRNG Pro . Given the low cost and high availability reached by these true random number generators, there is no longer any justification to avoid implementing the technology: we can safely state that the era of pseudo-random numbers based on algorithms instead of dedicated hardware is over, hence we must adopt this technology to be compliant with the GDPR.
3.7 Integrity and security of processing
Four are the principal elements that must be considered to guarantee data integrity and security of its processing:
- Always implement referential integrity constraints at the database level
- Never delete physically any data unless they have reached the limit of their storage period: use logical delete only (a boolean column Is_Deleted)
- Define a clear backup policy and comply with it
- Use Test Driven Development and have a QA team working in isolation from the development team
None of the four points above is more important than another one: all have a considerable impact on being able to guarantee that the system we architect, design and implement is as robust as the GDPR requires it.
If in a court of law will be proven that the product has weak data integrity and that there was no standard QA during all phases of the software lifecycle, your company will most likely be found in breach of the GDPR.
3.7 Securely storing passwords