Companies are increasingly choosing cloud services such as Azure or AWS that normally provide a flexible, profitable and scalable option to carry out their operations without the restrictions imposed by on-premise technologies.
However, before uploading all our data to the cloud, we must be well aware of GDPR (General Data Protection Regulation), a European regulation aiming to protect individuals in regards to how their personal details are processed and the free movement of data.
According to Schedule 5, section 3.8.1 of the Anonymisation guide published by the Spanish Data Protection Agency (AEPD as per its Spanish acronym), one of the anonymisation techniques that are compliant with the GPDR are the hash algorithms:
Encryption algorithms are undoubtedly useful in microdata anonymisation, particularly hash algorithms. Hash algorithms are a method that may be applied to specific data in order to generate a unique or almost unique key to represent data. For example, if we want to conceal or anonymise any data, we could use a hash algorithm such as SHA1 or MD5. By applying the algorithm to a specific data, we can obtain a key or digital fingerprint that can be used to replace the actual data. Hash algorithms create a digital fingerprint that makes it impossible to rebuild the original data from the digital fingerprint. Therefore, in computing terms, a single bit variation in the originally stored data would result in a completely different key or digital fingerprint.
They also allow users to generate the same identical fingerprint for the same identical data or microdata. However, it shall never be possible to obtain the original data using the digital fingerprint, thereby ensuring full confidentiality when processing data, as it is a one-way mathematical function. Any resulting keys obtained during the application of a hash algorithm are commonly known as “digital fingerprints” as these provide a unique identification for a specific data or microdata.
Therefore, if we needed, for example, to include a user’s ID in a vlookup to retrieve sales data, access to a platform, etc., we would not be able to use the ID in the original DB or datamart. We wouldn’t be able to use anything that could be detected in a potential data filtering and reveal personal data. That is why we can use hash algorithms as one of the possible techniques available.
What is a hash?
Usually known as a “hash”, a cryptographic hash function is a mathematical procedure that takes an arbitrary block of data and returns a fixed-size bit string. Regardless of the length of the input data, the output hash value will always have the same length.
This function is used for endless purposes, such as password encryption, file comparison, blockchain verification, detection of copyright infringement or concealing data that must not be revealed, as in our example.
Although there are different types of hash, some of the most common are MD2, MD5, SHA, SHA1, SHA2_256, SHA2_512.
Below is an example of the outputs offered by some of the hash algorithms for the input value ‘SolidQ’.
As shown above, every hash will return a different output string with a different size and complexity than the rest.
Hacking the hash
Among the most common ways of guessing passwords, which is every hacker’s favorite pastime, are dictionary and brute-force attacks.
A dictionary attack uses a file containing words, phrases, common passwords, and other strings that are likely to be used as a password or hash, in this case. Each word in the file is hashed, and its hash is compared to the password hash. If they match, that word is the password. These dictionary files are constructed by extracting words from large bodies of text, and even from real databases of passwords.
Brute-force attack. These attacks are very computationally expensive and are usually the least efficient in terms of hashes cracked per processor time.
There are also further, more effective methods than the previous ones, such as Lookup and Rainbow tables.
Lookup tables are an extremely effective method for cracking many hashes of the same type. These work by pre-computing the hashes of the passwords in a password dictionary and store them, and their corresponding password, in a lookup table data structure. A good implementation of a lookup table can process hundreds of hash lookups per second, even when they contain many billions of hashes.
Rainbow tables are like lookup tables, except that they sacrifice hash cracking speed to make the lookup tables smaller, making them more effective. It has never been so easy.
Lookup tables and rainbow tables only work because each password is hashed the exact same way. We can prevent these attacks by adding further entropy when these are generated, i.e., by randomising each hash. This can be achieved by appending, prepending or interspersing a string to the identifier that we wish to anonymise before hashing. These strings are known as Salts need to be as random as possible. Below is the Spanish Data Protection Agency’s recommendation to use this type of operations:
If the AAA key is used to replace an individual’s name, we could apply a hash algorithm adding some random text such as AAA1k2j3j as a secret key for the AAA username (1k2j3j) in order to strengthen the confidentiality chain. In this case, it may be worth considering the possibility of having a secret key policy in order to carry out the anonymisation. If we applied an MD5 hash algorithm to the AAA string, it would be rather simple to reidentify using brute-force methods. However, by adding text or an extra key to the username, it would require further effort in order to reidentify the individual.
In regards to the salt hashing techniques, these can be as complex as desired, by adding the salt before, after or in between the identifier. And we must also try and obtain a salt that is as random as possible. In SQL Server, we have the RAND function that can be provided with a different seed each time it is executed. And, to make it even more complicated, the hash can be applied n-times on the output, always bearing in mind that the best hashes are usually the slowest to be obtained, in order to ensure that they are not worth hacking. While, on the other hand, we need to ensure that our data loads as fast as possible. Therefore, it is best to settle this dilemma as early as possible into our project lifetime in order to avoid any unpleasant surprises further down the line.
Starting with SQL Server 2008, the hashbytes function can be used to apply several algorithms such as:
MD2 | MD4 | MD5 | SHA | SHA1 | SHA2_256 | SHA2_512
Below is a simple example of salt hashing. Although, as previously discussed, the salt must be random, this is only provided as an indication.
DECLARE @Identificador nvarchar(4000); DECLARE @Salt nvarchar(4000); DECLARE @ID_Salt nvarchar(4000); DECLARE @Salt_ID nvarchar(4000); SET @Identificador = CONVERT(nvarchar(4000),'11000'); SET @Salt = CONVERT(nvarchar(4000),'8AcK707H3Fu7UR3'); SET @ID_Salt = @Identificador+@Salt; SET @Salt_ID = @Salt+@Identificador; SELECT @ID_Salt Value,'MD2'Type_Of_Hash, HASHBYTES('MD2', @ID_Salt) Hash_Value UNION SELECT @Salt_ID Value,'MD2'Type_Of_Hash, HASHBYTES('MD2', @Salt_ID) Hash_Value UNION SELECT @ID_Salt Value,'MD5'Type_Of_Hash, HASHBYTES('MD5', @ID_Salt) Hash_Value UNION SELECT @Salt_ID Value,'MD5'Type_Of_Hash, HASHBYTES('MD5', @Salt_ID) Hash_Value UNION SELECT @ID_Salt Value,'SHA1'Type_Of_Hash, HASHBYTES('SHA1', @ID_Salt) Hash_Value UNION SELECT @Salt_ID Value,'SHA1'Type_Of_Hash, HASHBYTES('SHA1', @Salt_ID) Hash_Value UNION SELECT @ID_Salt Value,'SHA2_256'Type_Of_Hash, HASHBYTES('SHA2_256', @ID_Salt) Hash_Value UNION SELECT @Salt_ID Value,'SHA2_256'Type_Of_Hash, HASHBYTES('SHA2_256', @Salt_ID) Hash_Value UNION SELECT @ID_Salt Value,'SHA2_512'Type_Of_Hash, HASHBYTES('SHA2_512', @ID_Salt) Hash_Value UNION SELECT @Salt_ID Value,'SHA2_512'Type_Of_Hash, HASHBYTES('SHA2_512', @Salt_ID) Hash_Value
There is also a MultiHash method that can be found in this URL.
We can use multiple algorithms including MD5 and SHA. We can also select the columns to be included, one or more of which could contain the salt and also create any hashes that we wish.
The System.Security.Cryptography namespace provides a rather wide range of cryptographic services, including hashing. In this case, we can find HMAC derived algorithms, that are especially recommended by the Spanish Data Protection Agency (AEPD as per its Spanish acronym).
HMAC algorithms based on RFC2014 (Rule For Comments 2014) are a good option that can be used combined with several hash algorithms such as MD5 before applying a cryptographic algorithm to the digital fingerprint and the hash key in order to create a brand new digital fingerprint or key based upon a secret key.
Using HMAC in combination with non-trivial secret keys and a strict key destruction policy can be useful to ensure that the anonymisation process cannot be reversed. However, if the keys used in combination with HMAC are stored, these can also be used to generate pseudonymised data that will require a subsequent reidentification. Hash mechanisms with secret keys may be useful to mask data. However, there must be a procedure to allow for safe elimination of these keys, and it must be possible to confirm that the procedure has been followed in order to ensure that the process cannot be reversed.
Our purpose with this post is to provide an introduction to this subject. We expect that each company will formulate their own cyphering strategy with the support of their legal team. DPOs will also be required to make the relevant decisions in order to ensure compliance with the organisation’s own data protection regulations.
I hope that you have enjoyed reading it and, should you wish to find out more about any hot topics like these, please do not hesitate to contact us. Our team includes some of the most prominent worldwide experts in services such as Azure, Data Science, AI & Data Analytics, Business Intelligence, data protection, etc.
During the last 4 years of experience, after an initial stage in web design, I have focused my work as BI Consultant in different technologies. I have worked in different business models such as Supply Chain, Financial, Health, Database Performance and Textile.