Have you taken a few minutes to think about the way you work with live databases? What about those under development environments? Organizations handle an enormous volume of personal data on their data platforms and the digitized, physical electronic documents they hold. About 90% of the documents that companies store have some kind of personal information.

Are you acting appropriately to protect sensitive information, as required by law? Data obfuscation can help you comply with the GDPR. In this article, we tell you how.

How does GPDR affect the integrity of the development and testing departments?

From the momento the European Data Protection Regulation (GDPR) was effective, it was intended to make Europe one of the safest areas in the world for the development of a trade of confidence, establishing data privacy as a fundamental right of individuals. It is estimated that with this regulation, live since May 2018, sanctions for violating the fundamental right to the protection of personal data of European citizens, can reach up to 20 million euros or 4% of annual turnover of companies and institutions.

Organizations often need to debug production errors, which many times in the past has been done directly pointing development applications to the production database. This, which is in itself a malpractice, is one of the measures that comes to regulate the GPDR with force and is precisely one of the things we should never do.

Another common malpractice is that of copying production data stored in production databases to non-production or test databases. This is done to realistically complete the application functionality test and cover real-time scenarios or test cases, minimizing production errors or defects. As a result of this practice, just as the previous case, a non-production environment can become an easy target for cybercriminals or malicious employees looking for sensitive data that can be exposed or stolen. Because a non-production environment is not as tightly controlled or managed as the production environment, the GDPR sets clear rules about these environments and how the data in them should be treated, prohibiting the use of real sensitive data in them.

NOTE: Accessing production data from non-productive environments is completely prohibited by law.

Can we debug by connecting to our production database?

No, but we can debug against a production database conveniently obfuscated, so that, to all intents and purposes, it is the production database (same records, keys, relationships and interdependencies…) but that it is totally impossible to obtain sensitive data in any way.

As of today, SQL Server does not have a native mechanism to support GPDR regulation in the scenario we are discussing (debug using production data). An attempt was made with “Static data masking” in SQL Server 2019, which was finally aborted with no tentative resolution date. None of the features offered by SQL Server as a security and masking technology covers these scenarios:

  • Dynamic data masking
    • No use, since a developer with access to the data could disable the feature with the appropriate permissions and thus access the information, which is stored unencrypted.
  • Encryption
    • Transparent data encryption.
      • Only encrypt I/O support. The data is therefore available and accessible with the appropriate credentials.
    • Always encrypted
      • Regardless of the fact that activating Always encrypted has quite a few limitations to take into account, this feature finally produces that when activated, if we connect with the application that has the registered driver, the data will be accessed normally (just what is not wanted) and if not, we will see a super long binary code or nulls in the encrypted cells, something that traditionally one does not want to see when debug data against “production”.
    • Encryption keys
      • A developer with sufficient credentials can access that key and decrypt it. If this is not the case, we will be as in the previous case, with cells that will show a binary churro or nulls in the encrypted cells, which you don’t want either when to wait to debug
    • Row level security
      • No use, since a developer with access to data and sufficient credentials could deactivate itself and thus access information, which is stored unencrypted.

What is data obfuscation and how is it applied?

In a technological environment, data obfuscation is the process of replacing existing sensitive information in test/development environments with information that looks like real production information, but without the appearance of real data (that violates the GDPR). Therefore, data obfuscation techniques are used to protect data by randomizing sensitive data contained in non-production environments, or masking identifiable information with realistic values, allowing companies to mitigate the risk of data exposure and thus comply with the GDPR.

To make this technique really interesting for testing and development environments:

  • The randomization of data should not generate confusing data, but coherent with what is expected.
Real data Obfuscated data
Name: Enrique Catalá Bañuls errorXfajavsñf asfjevstñs fjekscto00xx
Name: Enrique Catalá Bañuls okAlejandro Bernabeu Imaginario
Street Avenida pintor baeza, 12, 6ºA errorfjffj fñarkf toff10, 100f, 5X
Street: Avenida pintor baeza, 12, 6ºA okCalle andromeda, 14, 2ºA
ID: 53234565-F errorXXXXXXXX-X
ID: 53234565-F ok456543456-F  (being an ID that passes ID validations, but random in itself)

That is, in any example, a randomized ID sample, it should generate a valid ID, according to the ID format, which passes validation rules that the application itself can support. This is where the solution we have developed in SolidQ, DatabaseObfuscator, comes into action:

  • Randomisation should be carried out within a reasonable amount of time.
    • It is of no use to us that it is achieved, if it takes 24h to complete the process and the data are no longer relevant.
  • Randomization should only be performed on sensitive data, which can be changing and also be among tens of thousands of columns in thousands of tables.
    • Handicrafts should be avoided and the human factor should be trusted as little as possible. The system should be self-sufficient

What is  DatabaseObfuscator?

Database Obfuscator is a solution offered by SolidQ that performs the obfuscation process based on a set of rules that depend on the type of sensitive information being stored. To implement this service, we have designed and implemented a database tool that is capable of detecting which columns to obfuscate – which will obfuscate following transparent anonymization patterns (nobody will know when they see the result that it has been anonymized) – and which will also do it so as fast as the hardware on which it runs allows it.

Main benefits of DatabaseObfuscator:

  • Automatic analyzer that detects which columns have to be obfuscated in order to obtain a obfuscated DDBB that complies with GPDR.
  • Fully configurable obfuscation dictionaries to satisfy any scenario with real data and generate meaningful values for the client application
  • Obfuscation at the maximum speed supported by the storage cabin where the data to be obfuscated is located.
  • Optional support for the creation of testing databases, such as a subset of data that meets referential integrity
    • Example: Create a database of 5% of the production size, obfuscated and consistent…to be used as a docker image by development teams.
  • Pause and summary of the obfuscation process
  • Modular API, with Windows and Linux support

NOTE: Database Obfuscator uses a series of rules to randomize data based on dictionaries, using TVFs and special trace flags to obtain the maximum performance of the tool when generating and updating data. 

In these examples of deployment I detail the times used in the obfuscation process:

  • 2h 38m obfuscating a 1Tb database of OLTP information
  • 32 minutes generating a 58Gb BBDD from the previous 1Tb BBDD, which finally ends up as a docker image ready to be consumed by the development team.

Would you like to know more about our Database Obfuscator solution? Find more information in our dedicated section GDPR & Compliance Audit

Enrique Catalá

Microsoft Data Platform MVP & Mentor at SolidQ
I am Mentor and Microsoft Data Platform MVP at SolidQ. I am Microsoft Certified Trainer (MCT) andfocused in SQL Server motor relation, where i have successfully led more than 100 projects, not just in Spain, but also in EEUU,Mexico, Austria, etc.

I am the main architect of SolidQ Solutions called HealthCheck, SQL2Cloud, SCODA and SolidQ SSIS Generator. Appart from that, I am regular speaker of SolidQ Summit.
Enrique Catalá