Long term survey strategy: mixed mode research report

Findings from research exploring mixed mode survey designs in the context of the Scottish Government’s general population surveys. The report details information on key issues, potential mitigations and remaining trade-offs, and includes 21 case studies on relevant surveys.


8. Impacts of changing or mixing modes on trends and time series

The ability to measure change over time was seen as a key component of the value of all three of the major Scottish Government general population surveys by stakeholders interviewed for this study. Each of the three surveys have been used to generate trends and time series that now stretch back over several decades. For example:

  • SHeS data is used to track changes in the health of the population over time, both to understand how health service needs are likely to change, and to understand the impact of external factors, such as policy interventions
  • SHS is used to look at changing satisfaction in public services, and SHCS is used to look at changes in the energy efficiency of housing stock, and
  • SCJS is used to look at the changing levels and nature of crime and victimisation over time.

Understanding the likely impact of changing or mixing modes on these trends and time series is therefore a key issue.

The first thing to note in this regard is that mode impacts on trends are a direct result of the issues discussed in the previous four chapters – and in particular, the mode-specific measurement effects discussed in chapter 6. Second, the scale of the impact on trends may vary between different items/questions in the same survey. The types of question most at risk of mode effects, as discussed in chapter 6, are also the items most at risk of discontinuity as a result of changing modes. Straightforward, factual questions are less susceptible to change as a result of mode transition, as long as the final weighted profile of respondents remains similar, while questions on sensitive topics and attitudinal questions may be more susceptible to mode effects and therefore to significant discontinuities in trends as a result of mode transition. It is also important to remember that, as discussed in chapter 6, mode effects are not always predictable or consistent, meaning that there may be factual questions on the Scottish Government surveys that receive significantly different responses when asked via different modes, and attitudinal questions where response patterns remain similar. Moreover, other factors can impact on observed trends even in the absence of changing modes – for example, falling response rates could impact on sample composition in a way that impacts trends, public trust in government could fall and impact on honesty of reporting to certain trend questions, or public understanding of a concept could change in such a way that a question changes what it is measuring over time.

Understanding and addressing the impacts of mode change on trends

Parallel testing

There is a broad consensus in the literature and across experts interviewed for this study that ‘parallel testing’ is the ‘gold standard’ for understanding mode effects – and by extension the likely impact on trends – of alternative mode designs. Parallel testing, or ‘parallel runs’, involves running a survey with the new design at the same time as continuing to run the survey with the previous design so that the results can be directly compared. For example, if the Scottish Government were considering moving one of its three major general population surveys to a push-to-web survey with telephone follow-up of nonresponders, a parallel run would involve implementing this new design alongside continuing to deliver the survey using its current face-to-face approach. Detailed comparisons can then be conducted to understand the extent and nature of any differences in estimates.

In practice, the extent and nature of pre-testing that has been conducted for surveys that have changed mode in the UK has varied considerably. For example, the UK Labour Force survey has run a full-scale parallel run of the Transformed Labour Force Survey alongside the main survey for over two years, and has announced that this parallel run will continue until at least early 2025 with potential changes to the TLFS methodology to follow.[46] In contrast, the Participation Survey, which replaced Taking Part, was introduced without any parallel testing (although the approach taken was very similar to that used on the Community Life Survey, which had involved parallel runs). In part, this was a result of timing (the new survey started during the pandemic), but it also reflected resourcing considerations. In general, however, most of the large-scale probability surveys reviewed for this study that had, or were considering, transitioning to a new mode or combination of modes had undertaken some form of parallel testing before changing mode (see case studies in Annex A).

Calibration

Parallel runs are arguably essential in fully understanding the extent to which estimates are likely to be different when collected via different modes, as well as helping researchers identify changes to survey design that could help to reduce differences (for example, using TDM or RCD approaches, as discussed in chapter 6). However, on their own they do not correct for differences between modes that, in turn, may create discontinuities in survey trends.

In order to attempt to control or correct for such differences, modelling to recalibrate the data is required. In principle, this could either attempt to reweight the ‘old’ data in a time series to match data collected via the new mode(s), or it could attempt to weight the new data to match the older time series.

However, in practice, recalibration of survey trend data to attempt to account for changes in mode has been very rare, although some European surveys, such as the Dutch Crime and Victimisation survey, use this approach (see case study in Appendix A). Most of the calibration discussed in the literature on mixed mode surveys, such as that proposed by Pfefferman (Pfefferman and Preminger, 2021) or van den Brakel (Buelens and van den Brakel, 2014), is primarily to account for mode effects within mixed mode surveys (i.e. to account for the impact in differing distributions of responses by different modes being used at the same time), rather than to account for transition to a new mode or modes.

On the rare occasions where methodologists have attempted to calibrate data to adjust for change in mode over time, the emphasis is normally on calibrating the historical data to match the ‘new’ data, rather than adjusting the current data. This is preferred to ensure that the data going forwards is as accurate as possible and to avoid having multiple estimates (i.e. calibrated and uncalibrated) in new data releases. ONS has done some work in this area, adjusting legacy Labour Force Survey estimates to newer estimates[47]. However, there were no other examples of such calibration techniques having been applied in practice on any of the UK-based cross-sectional surveys that had transitioned to a new mode or mode reviewed for this study.

The experts interviewed for this study emphasised the challenges associated with such approaches. There was a view that it is not appropriate to model old data based on a new mode design, as the initial participants would not have been given the option of completing the survey via a different mode. Attempting this type of calibration also involves various assumptions – including that the relationship between mode of participation and response would be exactly the same over time – which some experts considered highly questionable, since factors such as comfort with and patterns of internet use are changing over time. For example, older people and those from more deprived areas are more likely to have internet access than they were a few years ago, and therefore assumptions about how comfortable these groups are taking part online need to change over time. It was pointed out that it is not necessarily possible to say whether the old or new data is more ‘accurate’, so whichever direction you decide to calibrate in might be open to question. Finally, as mode effects may apply at an item level rather than a whole survey level, different calibrations might be required for different measures within a single survey. On a practical level, methodologists highlighted that this type of modelling is very complex and resource intensive, and might simply not be feasible at a large scale.

“It is undoable to adjust the full set of outputs. Usually, it is limited to a set of key statistics and at general population level or some main subdomains.” (Expert interviewee 2)

It also introduces undesirable additional complexity for data users, who would have to grapple with additional estimates and weights for analysing the ‘old’ data (assuming the approach adopted was to recalibrate previous estimates to ‘match’ the current data).

Overall, the literature and expert interviews both indicate that while calibration to try to avoid discontinuity in the time series is technically possible (albeit with some concerns about the assumptions underpinning this), it is widely viewed as practically infeasible. Rather, the consensus view across experts interviewed for this study appeared to be that “once you have transitioned, there is an effective break in the time series”. This is reflected in technical reports and advice provided on analysing data from before and after a mode transition. For example, researchers at the Food Standards Authority, writing about the transition from the face-to-face Food and You to the push-to-web and paper Food and You 2 wrote:

“The change in methodology has allowed us to future proof the Food and You survey, but has also meant that there is a break in the data time-series, and findings between Food and You 2 and Food and You 1 are not directly comparable.”[48]

Maintaining time series vs other priorities?

If statistical solutions, like calibration, are not available or practical, there are two main alternatives for handling the discontinuity introduced by changing mode(s).

The first option is to avoid changing mode altogether. Concern about breaking trends was a key reason why the Childcare and Early Years survey has not transitioned after a push-to-web pilot (although a further push-to-web pilot is currently underway). A variation on this would be to maintain the current mode design alongside a new design – for example, the Crime Survey for England and Wales (CSEW) has recently transitioned to a panel design, maintaining the initial face-to-face interview at wave 1 (preserving time series) and adding a telephone follow-up at wave 2 (to achieve a bigger sample) (see case studies in Annex A).

The second option is to accept that change mode is likely to result in a break in the time series and to communicate the implications of this clearly and transparently to stakeholders and data users. This was the approach favoured by most of the experts interviewed for this study. The abruptness of any break in time series can also be mitigated by parallel runs which, particularly if undertaken over several years, can provide the first few points in a new time series prior to a survey fully transitioning mode(s). Indeed, some have argued that if transition to an online or mixed mode design is likely at some point in the future, it may be better to move sooner rather than later in order to start developing future trends. For example, in his assessment of options for the Gambling Survey for Great Britain, Sturgis (2024) argues:

“A final factor currently pulling surveys to online self-completion is that this transition seems inevitable for most surveys at some point in the coming years anyway. Given the likely continuation and exacerbation of the problems hampering conventional modes of surveying, there is a strong case that transitioning from conventional to online modes should be implemented sooner rather than later. Another way of considering this is that, while moving surveys online will reduce backward comparability, it has the offsetting benefit of improving comparability with surveys that will be carried out in the future.”

Whether or not breaking trends is considered acceptable will of course depend on survey stakeholder priorities. In this context, it is worth noting that the Scottish Government survey stakeholders interviewed for this study were divided in whether they viewed a potential break in the time series as a ‘red line’ in terms of any change in mode, or whether they were prepared to accept this if it would help secure other priorities (in particular, a bigger sample size), provided it remained possible to track change in future years. A related stakeholder view was that some questions in the three surveys were maintained to retain a time series but were arguably of less relevance to current policy needs, given societal and policy changes over time. From this perspective, a break in trends could potentially allow for the topics and question wording on the surveys to be reviewed, to ensure that the most useful questions for current needs are included.

Similar tensions and competing views were apparent across the expert interviews conducted for this study. While some, particularly those involved in delivering long-standing surveys, emphasised the importance of time series data to users, another view was that “the time series is important, but it shouldn’t be put on a pedestal.” It was pointed out that in social surveys there is often very little change from one year to the next, so time series are perhaps not always particularly useful in informing policy. Moreover, it was argued that in reality, many social surveys regularly break time series already by changing topics or question wording.

In practice, the approach survey researchers have taken to handling the impact on mode change on trends is perhaps not quite as stark as the two options above suggest. Across the expert interviews and case studies, there was also evidence of a third option, whereby if parallel testing suggests that there is little difference in particular estimates between old and new designs, researchers continue to use the time series, albeit with some caution and caveating to acknowledge the potential impacts of mode change. Of course, this is only an option for those measures where parallel testing does not suggest significant differences by mode. Where mode effects are apparent and cannot be mitigated by careful redesign or other measures, the best option appears to be clarity and transparency, accompanied by robust stakeholder engagement around the trade-offs involved – something we return to in chapter 12.

Summary framework to help guide consideration of future mode on SHS, SHeS and SCJS: Time series
Priority considerations / issues Potential mitigations Remaining issues and trade-offs
Cross-cutting issues Impacts on trends are a direct result of the issues discussed in previous sections – particularly mode effects, but also issues relating to sample composition and representativeness. The more that can be done to mitigate mode effects and maintain sample representativeness, the greater the potential to reduce impacts on time series. As noted previously, mode effects can be unpredictable and cannot be completely avoided.
The scale of impact on trends may vary between questions in the same survey, with those most at risk of mode effects also at most risk of discontinuity. Work to assess the likelihood of mode effects will also help in identifying those questions where maintaining time series is likely to be an issue.
Understanding the impact of changing mode on trends was seen as a crucial element in informing decisions about mode change Parallel testing is seen as the gold standard for understanding mode effects (and by extension, likely impacts on trends). This also has the advantage of starting to build a new time series (while the ‘old’ mode is still being used). The time and resource available will impact on the degree of parallel testing that is feasible before reaching a decision. For example, ONS have run a full-scale parallel run of the Transformed Labour Force Survey for over two years and have announced this will extend to at least early 2025. In contrast, the Participation survey, which replaced Taking Part, was introduced without any parallel testing.
The consensus across experts is that it is not, in practice, possible to use statistical techniques to avoid a break in the time series once you transition to a different mode design. In theory, recalibration of survey trends to account for changes in modes is an option. In practice, this has rarely been done and was regarded by experts as infeasible at large scale, resource intensive, introduces additional complexity for users, and involves assumptions that are open to question. Recalibration of all measures to maintain time series is not likely to be a feasible practical option. Recalibration of even a small number of measures may not be practical (or desirable) given the issues identified. The consensus view was that changing mode means accepting an effective break in the time series.
Avoiding a break in the time series may be a reason not to change mode (or to maintain some data collection of key trends by the existing mode). (For example, concern about breaking trends was cited as a reason why the Childcare and Early Years survey has not transitioned after a push-to-web pilot). An alternative view (rather than a mitigation) is that it is better to transition sooner rather than later to start developing future trends. The trade-off between maintaining trends and other priorities – including a bigger sample size and adapting data collection to current policy needs – is a central issue for any decision on future mode, and is likely to be a priority area for stakeholder engagement.

Contact

Email: sscq@gov.scot

Back to top