Data screening

From Wikiversity
Jump to: navigation, search
  1. Data screening involves checking data for errors and fixing or removing these errors. Data screening should be conducted prior to data analysis to help ensure the integrity of the data.
  2. Keep a record of data screening steps undertaken and any changes made to the data. This should be summarised in the Results.
  3. Check through the data file (Case by case and variable by variable) looking for and addressing any oddities, such as:
    1. Out-of-range values? A quick way to check for out-of-range values to obtain descriptive statistics for all variables and examine the minimum and maximum values - are any out of range? If so, decide whether you are going to accept, replace or remove these values.).
    2. Duplicate cases?
    3. Empty cases? e.g., cases with no or little data
    4. Cases with responses which lack meaningful variation? (e.g., 5 5 5 5 5 5 5 5 5) or which exhibit obvious arbitrary patterns (e.g., zig-zag - 1 2 3 4 5 4 3 2 1 2 3 4 5) - such responses may not be valid and probably should be deleted
  4. Erroneous data can be changed to missing data (sysmis) by deleting the data in the cell. Alternatively, if the correct value is known (e.g., 88 might have meant 8, or if checked against the hard copies), then this can be entered instead. If there are a lot of problematic values for specific cases, then perhaps remove the entire case.
  5. Consider whether reverse coding and/or recoding of data might be necessary or appropriate.
  6. Where hard-copy surveys have been entered into electronic format:
  7. Check the data entry accuracy (e.g., if the data has been hand-entered from hard copy surveys, then double-check the electronic data against the electronic data)
  8. Check missing data (e.g., using descriptives or missing data analysis) and fixing if possible (e.g., check against hard copy survey). Consider possible replacement of missing data (e.g. mean replacement or regression-based replacement).

See also [edit]