Jump to content

Data screening

From Wikiversity
  1. Data screening should be conducted prior to data recoding and data analysis, to help ensure the integrity of the data.
  2. It is only necessary to screen the data for the variables and cases used for the analyses presented in the lab report.
  3. Data screening means checking data for errors and fixing or removing these errors. The goal is to maximise "signal" and minimise "noise" by identifying and fixing or removing errors.
  4. Keep a record of data screening steps undertaken and any changes made to the data. This should be summarised in a one to two paragraph section called "Data screening" at the beginning of the Results.
  5. Fixing or removing incorrect data:
    1. Erroneous data can be changed to missing data. Alternatively, if a correct value can be presumed, then this can be entered.
    2. To change data in a cell, open the data file and using the data view, left-click on the cell. Delete the data to make it missing data (sysmis). Or change the value in the cell by typing in the new value. Repeat for each problematic value.
    3. It is probably best to make erroneous data missing unless the correct value is obvious (e.g., if 77 was entered, it might reasonably be deduced that 7 was intended) in which case the incorrect value can be replaced with a best guess correct value.
    4. For cases with a lot of erroneous data, it is probably best to remove the entire case (i.e., delete the whole row).
  6. Out-of-range values:
    1. Out of range values are either below the minimum or above the maximum possible value.
    2. To know what the in-range values are, check:
      1. the survey
      2. download the data and check the SPSS Value Labels in Variable View
    3. Identify out-of-range values by obtaining descriptive statistics (in SPSS, use Analyze - Descriptive Statistics - Descriptives) to examine the minimum and maximum values for all variables of interest. Or try this syntax - DESCRIPTIVES VARIABLES=all /STATISTICS=MEAN STDDEV MIN MAX.
    4. In the SPSS Data View, sort variables with out-of-range values in ascending or descending order to help identify the case(s) which has(have) the out-of-range values. Alternatively, use search and find to identify the case(s) with out-of-range values.
    5. Decide whether to accept, replace, or remove out-of-range values.
  7. Unusual cases:
    1. Unusual cases occur when a case's responses are very different from the pattern of responses by most other respondents.
    2. In SPSS, Data - Identify Unusual cases - Enter several to many variables (e.g., all the Time Management Skill variables).
    3. The results will flag the top 10 anomalous cases.
    4. Look carefully at each of these cases' responses to the target variables - do they appear to be legitimate (e.g., are their out-of-range values, or consider if the data was fabricated, often the responses to the reverse scored items aren't fabricated in the expected direction). If so, consider removing the case.
  8. Duplicate cases:
    1. Duplicate cases occur when two or more cases have identical or near-identical data
    2. In SPSS, Data - Identify Duplicate cases - Enter several to many variables.
    3. Consider whether to remove both cases (e.g., the integrity of the data may be in doubt because it may have been fabricated and then duplicated?) or to retain one copy of each case and delete duplicates.
  9. Manual check for other anomalies
    1. Check carefully through the data file (case by case and variable by variable) looking for and addressing any oddities.
    2. Empty cases: e.g., cases with no or little data could be removed
    3. Cases with responses which lack meaningful variation: (e.g., 5 5 5 5 5 5 5 5 5) or which exhibit obvious arbitrary patterns (e.g., zig-zag - 1 2 3 4 5 4 3 2 1 2 3 4 5) - such responses are unlikely to be valid and probably should be deleted.

See also

[edit | edit source]