Long term survey strategy: mixed mode research report

Findings from research exploring mixed mode survey designs in the context of the Scottish Government’s general population surveys. The report details information on key issues, potential mitigations and remaining trade-offs, and includes 21 case studies on relevant surveys.


9. Survey quality metrics

Introduction

The preceding chapters of this report have all focused on the impact of changing or mixing modes on different elements of survey quality, drawing heavily on those elements identified by the Total Survey Error framework (TSE). In this chapter, we consider approaches to assessing the overall quality of surveys, which might in turn be used in comparing the quality of different mode designs (including the current approach) as options for the three main Scottish Government general population surveys.

Components of survey quality

Assessing the quality of a survey is a complex task, requiring one to choose both which features relating to quality are important, and how each should be appraised. Complicating this further is the fact that a given survey may be of interest to different audiences – for instance academic researchers, policy makers, charitable institutions, and product designers – and each of these audiences may place a different weight on different measures of survey quality.

The subtle differences in how different audiences understand ‘quality’ was discussed by Scottish Government survey stakeholders interviewed for this study. It was suggested that:

  • For Ministers, ‘quality’ was seen as something that enables them to say something on the issues they are concerned with, with confidence, having been reassured that the data is robust by others. Timeliness of data was also seen as a primary concern for Ministers.
  • For wider external stakeholders, such as academics and local authorities, quality was also associated with the level of granularity of the data, and the extent to which it can facilitate detailed geographic and other sub-group analysis (this was reflected both in comments from external stakeholders themselves and internal Scottish Government stakeholders’ views on their priorities).
  • For charities who use the survey data, it was suggested that they are often most concerned with the precise wording of specific questions and its relevance to their concerns rather than more general methodological issues.

The different emphases different stakeholders may apply underlines the importance of thinking not only about statistical accuracy (as emphasised by the focus on minimising representation and measurement error in the TSE framework), but also usability when assessing survey quality. Neglecting user-centric dimensions can lead to data that is unfit for use even if it is highly accurate. For instance, a methodological change on a repeated cross-sectional survey might increase or reduce data collection and processing times, with implications for the timeliness (and therefore usability) of the data.

The European Statistical System (ESS) and ONS both take greater account than the TSE of these usability aspects in the five quality dimensions they include: ‘relevance’, 'accuracy and reliability', 'timeliness and punctuality', 'coherence and comparability' and 'accessibility and clarity'.[49] Beimer (2010) presented the following nine dimensions of survey quality, which he identified as commonly used in national statistical offices and non-Governmental organisations worldwide:

  • accuracy (survey error is minimised)
  • credibility (trust is placed in the data by users and survey methodologists)
  • comparability (valid comparisons can be made over time and within survey subgroups)
  • usability (clear technical documentation is available and the data are well organised and labelled)
  • relevance (the survey data satisfies the demands of users)
  • accessibility (the data are easily located and are user friendly)
  • timeliness (data is published to schedule and in good time for its intended applications)
  • completeness (data provides answers to research questions, while minimising respondent burden), and
  • coherence (data collected from different modes, sources or instruments are reliably combined).

Response rates as a measure of survey quality

In spite of the multiple dimensions of quality identified above, overall survey response rate has often been used as a key metric for assessing survey quality in practice (Biemer and Lyberg, 2003), albeit generally alongside other features like sample size and whether the survey used probability sampling.

The importance placed on response rates has led journals in some fields to place minimum response rate requirements of up to 60% on studies submitted for review (e.g. JAMA, 2024). However, a survey’s response rate only contributes to one of the many aspects of survey quality (the ‘accuracy’ dimension). Moreover, as discussed in detail in chapter 5, the relationship between response rates and non-response bias is fairly weak. This weak relationship has led to survey methodologists questioning whether response rates should continue to be used as an indicator of survey quality. Instead, the view has been expressed that other summary measures may be better suited as proxy indicators of nonresponse bias, such as R-indicators, a measure of sample composition (Schouten et al, 2012) or the fraction of missing information (Wagner, 2010).

The experts interviewed for this study shared the view that response rates are a “blunt tool” for assessing survey quality. It was felt that, particularly given the current context of challenging response rates across social surveys, alternative measures of data quality should be considered. There was also a perception that, since the survey landscape is shifting rapidly, it is difficult to know what a good response rate is, or will be in the next couple of years, even for a face-to-face survey (which has long been considered the ‘gold standard’ approach).

However, there remained a sense among experts that response rates remain very important to some survey users. Interviewees discussed attempting to move away from this traditional approach to focus on a suite of quality indicators (potential examples of which are discussed below), but finding it difficult to convey alternative quality measures to others still concerned with overall response. This may reflect the relative simplicity of a single response rate vs. the complexity of interpreting a combined set of quality metrics.

Alternatives to response rate

An SRA Scotland event on the Future of Random Probability Surveys in 2017 posed the question, “If response rate is not a good indicator of survey quality, what is?”[50] It must be acknowledged that, in practice, quantifying quality for many of the dimensions discussed above is both challenging and contentious and deriving a single aggregated metric of survey quality is unlikely to be feasible. Instead, where researchers have attempted to move away from a singular reliance on response rates, the dimensions included in frameworks like TSE, ESS or other examples have tended to serve as checklists for use in the design and evaluation of surveys.

An example of the use of these wider dimensions in quality evaluation is the quarterly ‘Survey Performance and quality monitoring reports’ of the ONS’s Labour Force Survey. This includes total response rate, but also response rates for particular key measures, attrition rates, and a breakdown of sample composition by respondent characteristics. It also discusses (qualitatively) other aspects of the quality of the survey, including relevance, primary purpose, users and uses, accessibility and clarity, changes to the design that impact on comparability over time, coherence (focusing on comparisons between LFS data and other sources of data on the same topics), and includes measures of timeliness and punctuality.

Another example of an alternative to response rates is the quality review and improvement system, ASPIRE, developed by Statistics Sweden, which again attempts to reflect different components of quality, similar to ESS.[51] The ASPIRE manual suggests an approach to assessing the accuracy components of quality (which draw strongly on TSE) whereby the most important risks to accuracy are identified and then each type of risk is assessed (by external reviewers) based on six evaluation criteria:

  • C1 Available expertise (members of the production team or other internal expertise)
  • C2 Compliance with standards and best practices
  • C3 Knowledge of requirements, achievements, and improvement needs
  • C4 Plans for improvement activities
  • C5 Results of improvement activities and findings from other evaluations
  • C6 Communication with users and data suppliers.

The ASPIRE manual then includes guidance on steps in rating different components and sub-components of accuracy, with the aid of various sets of checklists, in order to produce a table of ratings, such as that shown below.

Figure 9.1: Tabulation of ratings for a typical product, from the ASPIRE manual

Table 3. Tabulation of ratings
Row Sub and sub-sub-components of Accuracy Average Score C1. Available Expertise C2. Compliance with standards & best practices C3. Knowledge of requirements, achievements, and improvement needs C4. Plans for improvement activities C5. Results of improvement activities and findings from other evaluations C6. Communication with users and data suppliers Importance to Overall accuracy (single sources of uncertainty)
1 Overall accuracy 35 5 3 3 4 3 3
2 Sources of uncertainty: 35
3 -Sampling 32 5 3 3 5 1 2 M
4 -Frame Coverage 28 5 3 3 3 1 2 M
5 -Measurement 40 7 6 3 5 1 2 M
6 -Non-response 42 7 5 5 4 1 3 L
7 -Data processing 43 6 4 3 6 5 2 L
8 -Model Assumptions 32 4 4 3 6 1 1 H
9 Preliminary statistics compared with final statistics 35 6 3 4 5 1 2

While the ASPIRE manual also suggests translating these scores into symbols (reflecting weak to excellent ratings) to aid interpretation, a table summarising quality on the accuracy dimension with nine rows and eight columns remains more likely to be more difficult to digest and interpret than a single metric. This, in part, explains the enduring appeal of response rates.

Experts interviewed for this study recognised the challenges of developing alternative response metrics. Their suggestions focused on two main areas: nonresponse bias and item response patterns. On the former, the main suggestion was to pay more systematic attention to response distributions by sub-group in order to assess potential biases (along the lines of the detailed tables in the ONS LFS quality reports). This is also likely to include triangulating sub-group response distributions against other sources of data (e.g. administrative data or the census) to identify potential biases.

In terms of item-level response patterns, experts suggested monitoring patterns of response over time for selected variables that can either be compared with administrative data and/or are not expected to change particularly quickly. This might also include monitoring the differences in response on key items between subgroups for any inconsistent or unexplained changes. It was acknowledged that it can be challenging to correctly disentangle changes in response patterns that reflect genuine change in society from those that reflect mode or design changes that might impact on quality. For example, several experts who had observed changes in response to key survey questions during the Covid period noted the difficulty of knowing if this change stemmed from the social upheaval associated with the pandemic, or reflected changes in mode necessitated by pandemic restrictions. However, patterns like higher levels of skips or ‘refused to answer’ at particular questions, alongside patterns of response that appear implausible in comparison with administrative or other data, can be indicators of quality issues. Cognitive testing can also help to support the process of assessing whether particular indicators are capturing the phenomenon they are designed to measure, and can therefore also support assessments of quality when considering a new mode (although as noted previously, if the question was not cognitively tested with the existing mode, it is also possible that it was not an accurate measure in the first place).

In conclusion, it is worth noting that while there may be no simple, summary way of measuring all aspects of quality on surveys, on a more optimistic note, experts interviewed for this study believed that ultimately survey users and methodologists have a common interest in achieving better quality data and therefore have a shared interest in working together to achieve progress on this front.

Summary framework to help guide consideration of future mode on SHS, SHeS and SCJS: Quality metrics
Priority considerations / issues Potential mitigations Remaining issues and trade-offs
Cross-cutting issues A key concern in assessing the potential impact of changing mode design is the impact it will have on survey quality. Assessing quality is complex – there are multiple features of quality and different stakeholders prioritise different elements. Various existing frameworks have been produced to help support consideration of survey quality across different dimensions of both statistical accuracy and usability. They tend to combine key measures (response rates, attrition rates, breakdowns of sample composition by key measures) and qualitative assessments of other elements, including relevance, accessibility and clarity, etc. These can provide templates to help the Scottish Government consider the quality of different mode designs. (See for example the quarterly ONS Labour Force Survey Performance and Quality monitoring reports). Frameworks with multiple quality dimensions may be time consuming to complete, are potentially difficult for non-expert users to digest, and can be open to disagreements on interpretation, so need to be approached with care and an eye to effective communication.
Response rates have tended to be used as a key summary measure of quality. However, RRs are increasingly seen as a ‘blunt tool’. They contribute to only one element (‘accuracy’) and (as discussed earlier) have a questionable relationship with nonresponse bias. Experts suggested focusing on nonresponse bias and item response patterns instead, including comparing weighted and unweighted sample profiles against other external estimates (e.g. Census or Housing statistics), and assessing any change in item response that are not expected to change quickly or where there is a robust external point of comparison. Response rates remain important to some users – moving away from them may be difficult to explain to stakeholders, especially as there is no single ‘metric’ that can be produced to measure nonresponse bias or item response. However, a shared interest in data quality provides a starting point for engaging with users on this issue.

Contact

Email: sscq@gov.scot

Back to top