Information

Data linkage for research in Scotland

Data linkage allows for the true value of our existing data to be realised.


Additional safeguards

There are a number of data linkage safeguards that can be used during a project to ensure that privacy is considered at every step of the process. For example:

Proportionate risk management

Data users and data controllers must assess the level of risk associated with the data linkage they are undertaking and manage these responsibly. This ensures efficient working, reducing the amount of time and money spent while maintaining a responsible approach to data linkage by ensuring consideration of privacy at an individual level. This is termed ‘proportionate risk management’.
 
It is not in the public interest to undertake unnecessary anonymisation work if the risk is low, as it increases public spend with very little or no benefit. This is why ‘pseudonymisation’ is a common approach to ensuring privacy is maintained when undertaking data linkage for research or statistical purposes, as it takes the most identifying (but not all) fields within a dataset and replaces them with one or more artificial identifiers, or pseudonyms.

Additional, unnecessary work is also time consuming and can substantially delay the production of results which are in the public interest and should therefore be produced and reported in a timely manner to support, for example, better service provision across Scotland. Where a risk is identified, it must be addressed adequately.

Statistical disclosure control

This forms part of the data linkage quality assurance process and ensures that the results of the research (termed ‘outputs’) do not include information that could be used to identify an individual. Before the researcher can remove any data or results from the secure environment of the safe haven, it is reviewed by a ‘Research Coordinator’ to ensure that it does not contain identifiable information within it.

For example, a data linkage project is established to identify if people living in certain areas are more likely to contract a specific disease. This will allow for the targeted roll-out of a disease prevention programme across Scotland. The results of the data linkage project highlight that only one person in a remote rural location suffers from the disease. As this information could be used to identify the individual, it is removed from the results ahead of publication.

Separation of functions

The best way to explain ‘separation of functions’ is to work through an example. The following illustrative example demonstrates how and why having data controllers, indexers and linkers all separated helps protect privacy while achieving data linkage.
 
A researcher wants to study the relationship between health and employment. Her specific research question requires her to analyse four variables: occupation, income, blood count and medication. In order to do the research it is necessary to link data from two different organisations.
 
In the first organisation, the ‘Health Data Controller’, controls a dataset which contains information about blood count and medication. In the second organisation, the ‘Employment Data Controller’ controls a dataset which contains information about employment and income.
 
The ‘Health Data Controller’ takes a copy of his dataset. He attaches a number that he makes up for this purpose, called an Indexing Number, to each record and sends this along with the name, address and date of birth of the people in his dataset to an ‘Indexer’. He does not send any information about blood count or medication to the Indexer.
 
The ‘Employment Data Controller’ takes a copy of her dataset. She also creates and attaches an Indexing Number to each record and sends that along with the name, address and date of birth of the people in her dataset to the same ‘Indexer’. She does not send any information about employment or income to the Indexer.

Note that she makes up the Indexing Number for her data completely independently of the Health Data Controller – they are different indexing numbers.

The ‘Indexer’ links the two datasets she has received together based on the names, addresses and dates of birth in the datasets. She creates two ‘Linking Numbers’, one for the health data and one for the employment data, and keeps a look-up table of those linking numbers.

The ‘Indexer’ sends the health Indexing Number back to the ‘Health Data Controller’ with the Health Linking Number attached.

The ‘Indexer’ also sends the employment Indexing Number back to the ‘Employment Data Controller’ with the employment linking number attached.

The ‘Indexer’ then sends the linking number look-up table to the linker, and safely destroys the two copies of the names, addresses and date of births she originally received – she is left with nothing.

The ‘Health Data Controller’ now attaches the health linking number he has been sent onto a copy of his dataset using the indexing number he had made up so he knows which linking number matches to which record. He then sends blood count, medication and linking number to the ‘Linker’. He does not provide the ‘Linker’ with the names, addresses and date of births of the people in the dataset, or the indexing number. He destroys his copy of the linking number.

And the ‘Employment Data Controller’ does the same thing: she attaches the employment linking number she has been sent onto a copy of her dataset using the indexing number she had made up so he knows which linking number matches to which record. She then sends employment, income and linking number to the ‘Linker’. She does not provide the ‘Linker’ with the names, addresses and date of births of the people in the dataset, or the indexing number. She destroys his copy of the linking number.

The ‘Linker’ can now join together blood count, medication, employment and income together using the linking number look-up table. He then deletes the linking numbers and the lookup table and adds a new unique and meaningless ‘Project Number’ to each record. He deposits this in a safe haven for the researcher to access.

At the end of this process:

  •  The ‘Health Data Controller’ has learned nothing new about the people in his dataset. He hasn’t seen any information about their employment or income.
  • The ‘Employment Data Controller’ has learned nothing new about the people in her dataset. She hasn’t seen any information about their employment or income.
  • The ‘Indexer’ saw a lot of names, addresses and dates of birth, but did not see any information about those people’s income, employment, blood count or medication.
  • The ‘Linker’ and the researcher see data about income, employment, blood count and medication but no names, addresses or dates of birth.
Back to top