Jump to content

Survey research and design in psychology/Assessment/Lab report/Data screening

From Wikiversity
Lab report - Data screening

General steps

[edit | edit source]
Data screening helps to ensure more accurate data analysis.
  1. Obtain a copy of the data: TSQFUS1_2018_All_Unscreened
  2. The data file is provided as received, with minimal screening. There are likely to be several errors in the data file (e.g., due to mistakes made during completing the survey and/or data entry). Thus, it is important to thoroughly screen the data (check, find, and correct errors) before conducting data analysis.
  3. Only the data you decide to use needs screening, although it may prove worthwhile to simply screen all data.
  4. Follow the How to screen steps below.
  5. Save and backup the screened data file as you would for any other valuable electronic file - you don't want to have re-do these steps.
  6. Once data screening is complete, proceed to data recoding, then conduct the lab report data analyses.
  1. Data screening should be conducted prior to data recoding and data analysis, to help ensure the integrity of the data.
  2. It is only necessary to screen the data for the variables and cases used for the analyses presented in the lab report.
  3. Data screening means checking data for errors and fixing or removing these errors. The goal is to maximise "signal" and minimise "noise" by identifying and fixing or removing errors.
  4. Keep a record of data screening steps undertaken and any changes made to the data. This should be summarised in a one to two paragraph section called "Data screening" at the beginning of the Results.
  5. Fixing or removing incorrect data:
    1. Erroneous data can be changed to missing data. Alternatively, if a correct value can be presumed, then this can be entered.
    2. To change data in a cell, open the data file and using the data view, left-click on the cell. Delete the data to make it missing data (sysmis). Or change the value in the cell by typing in the new value. Repeat for each problematic value.
    3. It is probably best to make erroneous data missing unless the correct value is obvious (e.g., if 77 was entered, it might reasonably be deduced that 7 was intended) in which case the incorrect value can be replaced with a best guess correct value.
    4. For cases with a lot of erroneous data, it is probably best to remove the entire case (i.e., delete the whole row).
  6. Out-of-range values:
    1. Out of range values are either below the minimum or above the maximum possible value.
    2. To know what the in-range values are, check:
      1. the survey
      2. download the data and check the SPSS Value Labels in Variable View
    3. Identify out-of-range values by obtaining descriptive statistics (in SPSS, use Analyze - Descriptive Statistics - Descriptives) to examine the minimum and maximum values for all variables of interest. Or try this syntax - DESCRIPTIVES VARIABLES=all /STATISTICS=MEAN STDDEV MIN MAX.
    4. In the SPSS Data View, sort variables with out-of-range values in ascending or descending order to help identify the case(s) which has(have) the out-of-range values. Alternatively, use search and find to identify the case(s) with out-of-range values.
    5. Decide whether to accept, replace, or remove out-of-range values.
  7. Unusual cases:
    1. Unusual cases occur when a case's responses are very different from the pattern of responses by most other respondents.
    2. In SPSS, Data - Identify Unusual cases - Enter several to many variables (e.g., all the Time Management Skill variables).
    3. The results will flag the top 10 anomalous cases.
    4. Look carefully at each of these cases' responses to the target variables - do they appear to be legitimate (e.g., are their out-of-range values, or consider if the data was fabricated, often the responses to the reverse scored items aren't fabricated in the expected direction). If so, consider removing the case.
  8. Duplicate cases:
    1. Duplicate cases occur when two or more cases have identical or near-identical data
    2. In SPSS, Data - Identify Duplicate cases - Enter several to many variables.
    3. Consider whether to remove both cases (e.g., the integrity of the data may be in doubt because it may have been fabricated and then duplicated?) or to retain one copy of each case and delete duplicates.
  9. Manual check for other anomalies
    1. Check carefully through the data file (case by case and variable by variable) looking for and addressing any oddities.
    2. Empty cases: e.g., cases with no or little data could be removed
    3. Cases with responses which lack meaningful variation: (e.g., 5 5 5 5 5 5 5 5 5) or which exhibit obvious arbitrary patterns (e.g., zig-zag - 1 2 3 4 5 4 3 2 1 2 3 4 5) - such responses are unlikely to be valid and probably should be deleted.

Known issues for the 2018 data

[edit | edit source]

This is a list of known data file errors and suggested possible corrections ... so far ...

Post about anomalies you find in the data to the discussion forum - many hands make light work - or Linus' Law "given enough eyeballs, all bugs are shallow".

Data enterer ID
  1. Data enterer 2 (Cases 6 - 10): Hard copies were not submitted, so unable to verify. Recommended action: Remove cases.
  2. Data enterer 13 (Cases 61 - 65): Other than the background variables, the data is missing for all cases. Hard copies were not submitted, so unable to verify. Recommended action: Remove cases.
  3. Data enterer 32 (Cases 156 - 160): Hard copies were not submitted, so unable to verify. Recommended action: Remove cases.
  4. Data enterer 38 (Cases 186 - 190): None of the 18 time management variables for any of the 5 cases are above 4 for an 8-point scale. Seems highly improbable and may indicate that the data was fabricated. Hard copies were not submitted, so unable to verify. Recommended action: Remove cases.
  5. Data enterer 78 (Cases 386 - 390): Hard copies were not submitted, so unable to verify. Recommended action: Remove cases.
  6. Data enterer 80 (Cases 396 - 400): Hard copies were not submitted, so unable to verify. Recommended action: Remove cases.
  7. Data enterer 109 (Cases 542 - 546): Several problems including: numeric responses appear in the open-ended variables; ages are single-digits; hours are unrealistically low. Recommended action: Remove cases.
Case ID
  1. Case 17 - hours add up to less than 24. Estimated total hours seem to be under-realistic. Recommended action: Change all hour estimates to missing data for this case.
  2. Case 20 - hours add up to less than 24. Estimated total hours seem to be under-realistic. Recommended action: Change all hour estimates to missing data for this case.
  3. Case 32 - pss items are all 1. Recommended action: Change all pss items to missing data for this case.
  4. Case 33 - hours estimates are missing or very low. Recommended action: Change all hour estimates to missing data for this case.
  5. Case 98 - International04 is entered as 11 and enrol05 is missing. It was probably meant to be entered as 1 for each variable. Recommended action: Either change the 11 to missing data or replace both 04 and 05 with 1 each.
  6. Case 122 - hours add up to less than 24. Estimated total hours seem to be under-realistic. Recommended action: Change all hour estimates to missing data for this case.
  7. Case 126 and 251 - Virtually identical. Recommended action: Remove one or both cases.
  8. Case 129 - sleepinghours09 - 18 hours: Only one estimate of hours is provided - for sleeping. And 18 seems to be unrealistic/unlikely. Recommended action: Change sleepinghours09 for this case to missing data.
  9. Case 149 - workhours09 - 150 hours: This is over 21 hours per day, 7 days a week, for a full-time student. There are no other estimates of hours for other activities. This seems to be an unlikely number of work hours. Recommended action: Change workshours09 for this case to missing data.
  10. Case 185 - hours estimates are missing or very low. Recommended action: Change all hour estimates to missing data for this case.
  11. Case 285 - hours add up to less than 24. Estimated total hours seem to be under-realistic. Recommended action: Change all hour estimates to missing data for this case.
  12. Case 307 - hours add up to more than 168. Estimated hours are greater than than the maximum hours in a week. Recommended action: Change all hour estimates to missing data for this case.
  13. Case 339 - hours add up to more than 168. Estimated hours are greater than than the maximum hours in a week. Recommended action: Change all hour estimates to missing data for this case.
  14. Case 347 and 358 - Virtually identical. Recommended action: Remove one or both cases.
  15. Case 363 - hours estimates are missing or very low. Recommended action: Change all hour estimates to missing data for this case.
  16. Case 437 - hours estimates are very low. Recommended action: Change all hour estimates to missing data for this case.
  17. Case 466 - tm variables from 06 to 18 are all 8. Seems likely that the respondent didn't think about the last dozen or so questions and just circled 8. Recommended action: Change all tm variables to missing data for this case.
  18. Case 470 - hours add up to more than 168. Estimated hours are greater than than the maximum hours in a week. Recommended action: Change all hour estimates to missing data for this case.
  19. Case 475 - hours add up to more than 168. Estimated hours are greater than than the maximum hours in a week. Recommended action: Change all hour estimates to missing data for this case.
Variables
  1. sleepinghours09 - estimates that seem too low e.g., because they were estimated daily rather than weekly:
    1. Case 7 - 7. Change to 49.
    2. Case 8 - 0. Change to missing data.
    3. Case 13 - 4. Change to 28.
    4. Case 32 - 6. Change to 42.
    5. Case 39 - 10. Change to 70.
    6. Case 84 - 12. Change to 84.
    7. Case 128 - 20. Change to missing.
    8. Case 160 - 7. Change to 49.
    9. Case 165 - 4. Change to 28.
    10. Case 197 - 7. Change to 49.
    11. Case 213 - 25. Change to missing.
    12. Case 219 - 7.5. Change to 56.
    13. Case 228 - 7. Change to 49.
    14. Case 303 - 4.5. Change to 31.5.
    15. Case 334 - 6. Change to 42.
    16. Case 407 - 7. Change to 49.
    17. Case 408 - 7. Change to 49.
    18. Case 409 - 11. Change to 77.
    19. Case 414 - 6.5. Change to 45.5.
    20. Case 424 - 7.5. Change to 52.5.
    21. Case 425 - 7.5. Change to 56.
    22. Case 433 - 20. Change to missing.
    23. Case 434 - 20. Change to missing.
    24. Case 500 - 25. Change to missing.
    25. Case 518 - 18. Change to missing.
  2. pss03
    1. Case 468 - 999. Recommended action: Change pss03 to missing data.
  3. tm10
    1. Case 7 - 0. Recommended action: Change tm10 to missing data.
  4. tm16
    1. Case 75 - 9. Recommended action: Change tm16 to missing data.

Perceived Stress Scale

[edit | edit source]
  1. pss04, pss05, pss07, and pss08 need recoding prior to analysis so that higher scores represent higher stress for all the pss variables.

Time Management Skills

[edit | edit source]
  1. tm07, tm09, tm11, tm13, tm18 need recoding prior to analysis.