Anonymisation, De-identification and Pseudonymisation

TP Transcription Limited and University Transcriptions are expert academic transcribers and preferred suppliers to a large number of universities in the UK, Ireland and around the world. This article is about the difference between anonymising, applying de-identification techniques or pseudonymisation to research interviews, focus groups, patient notes or other free-text data containing personal information. 

For details of our options for anonymisation and pseudonymisation please click here.

Anonymisation

Anonymisation is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified.

An individual may be directly identified from their name, address, postcode, telephone number, photograph, image, or other unique personal characteristic.

An individual may be indirectly identified when particular information is linked together with other sources of information. This can include their place of work, job title, salary, a particular diagnosis or condition, an event (eg a disaster) or presence at a location at a specific time.

The main reason given for most anonymisation relates to GDPR (data protection regulations). Once data is completely anonymised and individuals are no longer identifiable, the data will not fall within the scope of the GDPR and it becomes easier to use, hence the regular request by our academic clients for some element of anonymisation, depending on the project in question. There are of course plenty of other reasons for it, including a promise by researchers to interviewees that their identity will not be revealed at any point.

While there may be incentives for some organisations to process data in anonymised form, this technique may devalue the data, so that it is no longer of useful for some purposes. Therefore, before anonymization consideration should be given to the purposes for which the data is to be used.

Free Anonymisation Example

We can remove names and/or places, which can be highlighted so you can decide which to remove/leave in, or we can anonymise.

“Hello, my name is <Anna> and I live in <Manchester>”

or

“Hello, my name is <Name> and I live in <Place>”

Please contact us to discuss your requirements – we are very experienced at all forms of anonymisation. Usually we do not charge for the service, provided we do it as we transcribe, but other anonymisation will require an extra charge due to the time it takes for us to filter the text once we have transcribed the recording.

Why Anonymise?

Very often if researchers need to share participant notes or interview transcripts the data will need anonymising.

The UK Information Commissioner’s Office lists the following reasons for considering anonymisation:

  • developing greater public trust and confidence that data is being used for the public good, while privacy is protected
  • incentivising researchers and others to use anonymous information instead of personal data, where possible
  • economic and societal benefits deriving from the availability of rich data sources

The best way to protect your participant’s privacy may be not to collect certain identifiable information at all – easier said than done when interviewing of course! The second best is anonymisation which allows data to be shared whilst protecting participant’s personal information.

Anonymisation should be considered in the context of the whole project and how it can be utilised alongside informed consent and control of access to data. Of course if a participant consents to their data being shared then the use of anonymisation may not be required. We strongly recommend asking the question – a blanket use of anonymisation without any reason can be time consuming for all concerned.

Anonymisation methods

The Consortium of European Social Science Data Archives has produced a best practice guide for anonymising quantitative and qualitative data. They have also generated a guide to other sources of guidance.

Summary of best practices for anonymising quantitative data (CESSDA)

  • Removing or aggregating variables or reducing the precision or detailed textual meaning of a variable;
  • Aggregate or reduce the precision of a variable such as age or place of residence;
  • Generalise the meaning of a detailed text variable by replacing potentially disclosive free-text responses with more general text;
  • Restrict the upper or lower ranges of a continuous variable to hide outliers if the values for certain individuals are unusual or atypical within the wider group researched.

Summary of best practices for anonymising qualitative data (CESSDA)

  • Using pseudonyms or generic descriptors to edit identifying information, rather than blanking-out that information;
  • Plan anonymisation at the time of transcription or initial write-up;
  • Use pseudonyms or replacements that are consistent throughout the research team and the project.
  • Use ‘search and replace’ techniques carefully so that unintended changes are not made, and misspelt words are not missed;
  • Identify replacements in text clearly, for example with [brackets] or using XML tags such as <seg>word to be anonymised</seg>;
  • Create an anonymisation log (also known as a de-anonymisation key) of all replacements, aggregations or removals made and store such a log securely and separately from the anonymised data files.

GDPR – EU and UK

The EU regulation Recital 26 defines anonymous information, as ‘…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable’.

The key to anonymising data is that GDPR does not apply to anonymised information and this is why it is such an important function. No GDPR application means the data can be utilised in less restrictive ways.

The ICO’s Code of Conduct on Anonymisation provides a further guidance on anonymisation techniques – including the suggestion of applying a ‘motivated intruder’ test for ensuring the adequacy of de-identification techniques.

Further Reading

De-Identification

De-identification is the process of removing or obscuring personally identifiable information (PII) from a text or dataset. This data tends to include names, locations and contact details. The process can be approached in a number of ways, but the output is often along the lines of:

a. the masking of PII with labels (“my name is Anna” becomes “my name is <NAME>”)
b. the replacement of PII with dummy data (“my name is Anna” becomes “my name is Alan”)

Example of de-identification:

Original text:

Speaker 1: hi John, good to see you again. How was your weekend? Did you get up to much?

Speaker 2: Yes, all good thanks. I was with my mum in Chelmsford. We saw that Harry Potter film. What’s it called? Then got a couple of drinks at the Slug & Lettuce in Ilford.

Speaker 1: That’s close to your flat, right?

Speaker 2: Yes about ten minutes away from my flat in James Street. It was my mum’s birthday on Sunday, She’s got a new job at Aldi in Romford.

De-identified text:

Speaker 1: hi PER, good to see you again. How was your weekend? Did you get up to much?

Speaker 2: Yes, all good thanks. I was with my mum in LOC. We saw that Harry Potter film. What’s it called? Then got a couple of drinks at the Slug & Lettuce in LOC.

Speaker 1: That’s close to your flat, right?

Speaker 2: Yes about ten minutes away from my flat in ADD. It was my mum’s birthday on Sunday, She’s got a new job at PLA in LOC.

NLM-Scrubber – Free De-identification Tool

The NLM-Scrubber is a free clinical text de-identification tool designed and developed at the National Library of Medicine in the US. The aim of the tool is to enable clinical scientists in the US to access clinical health information that is not associated with the patient by following the Safe Harbor principles outlined in the HIPAA Privacy Rule. HIPAA stands for Health Insurance Portability and Accountability Act 1996, which is essentially the US version of GDPR applicable to clinical data in the USA.

The tool can be used by all researchers – link is here – https://lhncbc.nlm.nih.gov/scrubber/download.html

NB: outputs reliant on pre-trained models should always be checked for errors – and you may well need to apply de-identification manually to your texts. We can assist – please ask for de-identification when placing your order.

Pseudonymisation

Pseudonymisation is not the same as anonymisation. The definition follows by essentially it involves removing any personal data and replacing with a code that can then be re-attached at any point by someone with the original data and the replacement code.

Pseudonymisation is defined within EU GDPR as “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual” (Article 4(3b)).

Example of Pseudonymisation of Data (taken from the Irish Data Protection Commission website):

 Student NameStudent NumberCourse of Study
Original DataJoe Smith12345678History
Pseudonymised DataCandidate 1XXXXXXXXHistory

Pseudonymisation is not GDPR Exempt

Pseudonymisation essentially means that anyone who has access to specific data is able to identify the data subject by cross referencing. Unsurprisingly, unlike anonymisation, pseudonymisation techniques will not exempt data controllers from the scope of GDPR. However the process does help academic institutions meet their data protection obligations under UK and EU GDPR, particularly the principles of ‘data minimisation’ and ‘storage limitation’ (Articles 5(1c) and 5(1)e), and processing for research purposes for which ‘appropriate safeguards’ are required.

To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments (EU Recital 26).

Recital 26 provides that “Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.”

Both the above sections of Recital 26 mean that pseudonymised personal data can still fall within scope of the GDPR. The UK Information Commissioner’s Office has given an example of pseudonymisation and refers to its use in this circumstances being good practice for the purpose of data protection.

Pseudonymisation Example

A delivery/courier firm processes personal data about its drivers’ mileage, journeys and driving frequency. It holds this personal data for two purposes:

  • to process expenses claims for mileage; and
  • to charge their customers for the service.

For both of these, identifying the individual couriers is crucial.

However, a second team within the organisation also uses the data to optimise the efficiency of the courier fleet. For this, the identification of the individual is unnecessary.

Therefore, the firm pseudonymises the data by replacing identifiers (drivers’ names, job titles, location data and driving history) with a non-identifying equivalent such as a reference number which, on its own, has no meaning.

The members of this second team can only access this pseudonymised information. The delivery firm can of course, as the data controller, link the material back to the identified individuals.

The motivated intruder test

Where ‘de-identified’ or pseudonymised data is in use there is always a residual risk of re-identification, hence the GDPR regulations still being applicable. The motivated intruder test can be used to assess the likelihood of this. Once assessed, a decision can be made on whether further steps to de-identify the data are necessary. By applying this test and documenting the decisions, the study will have evidence that the risk of disclosure has been properly considered; this may be a requirement if the study is audited. 

Advice on applying the motivated intruder test (MIT) involves researchers thinking about who an intruder might be (internal or external) and what their motivations might be: for example a disgruntled employee, an individual attempting to discredit the research team or an investigative journalist and look at what measures are being taken to protect the data from these threats.

Further Article on Anonymisation – University Transcriptions (our sister site)

Our Accreditations

We are Cyber Essentials Plus audited annually and we hold the Cyber Essentials and Cyber Essentials Plus certificates. We are UKAS ISO 27001:2022 audited and accredited and ISO 9001 & ISO 14001 systems accredited company. We are members of the American Translators Association and we are assessed for GDPR compliance annually by IASME (Cyber Assurance Level 1).

10% Profits to Charity

10% of our profits are donated to the Ten Percent Foundation, a charitable trust registered in the UK. Since 2000 over £150,000 has been donated to projects in Africa and the UK. Click here for details.