Scottish household survey 2017: methodology and fieldwork outcomes

Methodology of the Scottish household survey 2017 and information on fieldwork targets and outcomes.


5 Data Processing

Summary

  • Data checks are important to maintain the high quality of the data. The main data file was subject to checks and editing involving range checks, simple logic checks and complex logic checks.
  • The data then underwent two additional processes. Firstly, the calculation of derived variables such as the age and sex of the Highest Income Householder and secondly, the imputation of household income.
  • Within the SHS, total net annual household income remains the main indicator of household income. A proportion of respondents either did not know how much they received or refused to say how much they received. In order to rectify this non-response, and produce an accurate measure of total net household income, missing values were imputed.
  • The edited data was delivered to the Scottish Government, who ran further checks on the data.
  • Physical survey data was also subject to a rigorous validation process to ensure the accuracy and validity of each item of data entered. This included range checks on all fields.

5.1 Social data processing

The social data processing routines are summarised in Figure 5.1.

Figure 5.1: SHS social survey data processing procedures

Figure 5.1: SHS social survey data processing procedures

The raw data was initially split into 3 files. Data from the 'other (write in)' variables and open-ended data was extracted for coding separately. Additionally, the variables used to produce NS-SEC variables were extracted into a separate file for coding.[15]

The main data file was subject to checks and editing involving:

  • Range checks, confirming that all variables were within the acceptable limits established for the question concerned.
  • Simple logic checks ensuring the relationships between questions were logical. For example, that the number of people answering a filtered question is equal to the number of people giving the appropriate response at the filtering question e.g. if 500 people say they smoke then the number of people giving a response to the number of cigarettes they smoke needs to be 500.
  • Complex logic checks. These involved examining the relationships between variables and assessing the logic of combinations of responses. Combinations of age and working status, age and relationships to other household members, for example, were checked to assess the logic of someone aged over 60 years and coded as the child of another household member.

The data then underwent two additional processes. Firstly, the calculation of derived variables such as the age and sex of the Highest Income Householder, and secondly, the imputation of household income. The edited data was delivered to the Scottish Government, who ran further checks on the data. Any data issues identified by Scottish Government were discussed and, where necessary, corrected and the data processing routines were amended.

5.1.1 Imputation of income in social data

Within the SHS, total net annual household income remains the main indicator of household income. This was defined as the total income from earnings, benefits and a variety of miscellaneous sources of the Highest Income Householder and their spouse, where applicable, with each component of income collected separately.

A proportion of respondents either did not know how much they received or refused to say how much they received. In order to rectify this non-response, and produce an accurate measure of total net household income, missing values were imputed. The process used was based on the imputation process developed by Scottish Homes for the 1996 Scottish House Condition Survey, and has been employed on the Scottish Household Survey for a number of years.

Missing income data was imputed for each component of income separately:

  • 4 components of earnings (earnings from main jobs and all other jobs of Highest Income Householder and Spouse)
  • 26 different benefit components
  • 13 different components of miscellaneous income.

Before starting the imputation process, the raw data was fully cleaned. For income from benefits, the upper limit of entitlement for each benefit was calculated. Any cases which were above these thresholds were examined, and edited if necessary. It is possible that respondents over-estimate income from one source of benefit and under-estimate income from another. Therefore, in cases where the benefit level was marginally above the threshold, the amount was not edited, but the case was excluded from use as a donor case in the imputation process.

Unlike benefits, clear rules do not exist regarding upper and lower limits of earnings and sources of miscellaneous income. These were examined against key indicators - such as tenure, NS-SEC, and description of employment - and were either edited or excluded from the imputation process.

Imputation of earnings has the largest effect on total net household income because of the proportion of cases with missing earnings data and the fact that earnings are commonly the main source of household income. For main jobs, imputed values were calculated from a regression model that related earnings to a set of explanatory variables, such as age and sex, full-time or part-time employment, car ownership, tenure, receipt of means tested benefits, and NS-SEC. For imputation of second and subsequent jobs, Hot Deck imputation was used. In Hot Deck imputation, respondents were sorted into imputation groups according to likely determinants. Cases with missing data were donated values from cases with data which were in their imputation groups, according to the characteristics chosen.

Imputation of income from benefits was undertaken for each benefit separately. For benefits which were received by only a few people, no modelling could be undertaken and the median value of receipt for these benefits was imputed. For non-means tested benefits which are received by a significant number of respondents, entitlement levels were approximated using variables collected in the rest of the social survey interview. For example, Child Benefit is dependent on the number of children, and whether the recipient is a lone parent. For these benefits, Hot Deck imputation was used, with the imputation classes reflecting the entitlement rules as closely as possible. For means tested benefits which are received by a significant number of respondents, Hot Deck imputation was used, with the imputation classes reflecting entitlement rules as closely as possible. These were undertaken after imputation of earnings and other sources of income, as they were dependent on the income of the household.

Imputation of income from miscellaneous income was undertaken separately. Most miscellaneous sources of income were received by a small number of respondents and no modelling could be undertaken. The median value of receipt was imputed for these components. For components where modelling could be undertaken - investment income, and income from non-state pensions - Hot Deck imputation was used, with the imputation classes based on the variables in the models that had the most explanatory power.

Following imputation, income from all components were summed to create a total net household income variable. All households with a net total household income were set to 'missing' if the computed figure was less than £25 a week. Although a small proportion of households will have had a lower income than this - and be living off savings or loans - it is likely that some households will have either under-reported receipt of benefits or earnings, or the imputation process has resulted in a low value being given.

Overall, imputation was undertaken for one or more component in 44 per cent of households. After imputation, household income was missing for 3.4 per cent of households.

With imputation, there is a danger that the donor groups may differ from those with missing information. While this factor can be minimised with careful specification, it can never be totally excluded. In order to guard against analyses that might be sensitive to the imputation procedures, a set of flag variables were created in order that analysts could identify cases and components where income had been imputed.

5.2 Physical survey data validation

The data from the physical survey forms were uploaded into the physical survey validation system together with the photographs of each dwelling.

The validation system worked by applying a set of rules (the same rules as used in previous years) provided by the Scottish Government, to the raw data, to ensure the accuracy and validity of each item of data entered. This included range checks on all fields, detailed consistency checks making use of the redundancy built into the survey schedule and plausibility checks on all appropriate items. Rules cross-reference different parts of the survey form (e.g. if the dwelling is a house, then aspects of common dwelling section should not be completed; if the house is a flat, then details for common parts should be present).

Surveyors were shown a list of all the errors picked up by the validation program. Additionally, they were shown a list of all the entered data, with a description of the variable next to each bit of data, and with the data split into representations of each page of the form. The validation system showed the data and the failed edits as well as showing the photographs of the property.

Corrections were then made and each form rechecked until it passed all edits. Changes to the data were made simply by overtyping the incorrect data where it was displayed. Once a surveyor had completed validation, the data was forwarded to their Regional Manager for sign-off. Validation of each form was completed when all errors had been eliminated or a supervisor had determined that the dwelling genuinely falls outside the validation criteria. An audit trail of changes made to the data was kept.

Figure 5.2: Physical data validation screens

Figure 5.2: Physical data validation screens

Contact

Email: shs@gov.scot

Back to top