Logistic regression/Section 2: Examples in SPSS
Type classification: this is a notes resource. |
P308D - Categorical Data Analysis
Kruskal-Wallis statistic: An analogue to One-Way Anova
The null hypothesis is that these groups are equivalent. There are other alternatives in this case too: one new alternative that is very attractive is Resampling. WE’ll look at that after Binary Logistic Regression. There’s a whole host of ___ for ordinary ANOVA that don’t assume a normal distribution of error - which ‘’is’’ assumed in ANOVA.
Repeated Measures
- If you have relationships with three graders
Parametric Statistics - Kendall’s Concordance has a coefficient of concordance; with only two raters it is a Spearman, but if you have 3 raters, 4, 5, it’s a composite statistic; kind of an average of the Spearman of all the others.
Now, I recognize that this can be kind of fuzzy; overwhelming, or whatever: There’s all these different tests. I’m not saying you should go look these things up and deal with them and so forth: what I’m saying is, if you look at your situation, and want to do a non-parametric test, these are some alternatives. And there are a lot of others as well.
A classic one is by Siegel; (around 1976?) very easy to read; tells you all about these non-parametric statistics.
Any comments or statistics; things that have come up as you’re working on the non-parametric homework?
Example 1[edit | edit source]
Turn to handout (CD06 p. 5: We’re interested in seeing if we can predict who actually consume’s alcohol. There are abstainers: people who don’t drink at all; for this, we’re just interested in, “Do people drink at all? Yes or no?”
Some things that might be relevant: ‘’Age’’; ‘’Gender’’; and maybe ‘’Marital Status’’
So, what’s the first thing you would do? You have the assignment: lets do analysis of this to see if we can predict who drinks.
First thing you do, new data set? A: LOOK AT THE DATA!
SPSS “Drivers” Dataset Example of Looking at the Data[edit | edit source]
We have the data itself up on the screen now: if you do the variable view; (driver.sav) - 1800 cases across the US
In SPSS:
- Look at Descriptive Statistics for:
- Age
- Marital Status
- Gender
- Did you Drink
- Select “Frequencies”
- *Note; it’s nice to see that SPSS is putting bootstrapping into most of their routines now.
‘’’SPSS Syntax:’’’
FREQUENCIES VARIABLES=age marst sex2 drink2 /STATISTICS=STDDEV MINIMUM MAXIMUM MEAN MEDIAN SKEWNESS SESKEW KURTOSIS SEKURT /HISTOGRAM /FORMAT=LIMIT(20) /ORDER=ANALYSIS.
What do we see, does anything catch your eye?
- Starting point: Missing data: we have 1800 cases, and only 17 missing. . now, that’s not bad; 17 people didn’t tell us their age; 10 people didn’t tell us their marital status.
- We didn’t suppress tables; you’d have pages and pages and pages if you had a variable with a ton of data points.
- Looking at the table, we can get a rough estimate of the median: around 36-37 years for age. . .
- Under “Marital Status”, it shows only the labels, rather than the values: you might want to put both on here.
We see that hte bulk of the sample is married, with reasonable percentages of the.
Gender of respondent - see anything wrong here? 50.1% male to 49.9% female - see anything here? A: It’s suspiciously close.
Did you Drink Alcohol in the last year?
How to deal with Missing Data (SPSS Example)[edit | edit source]
There are different ways to screen out the ones that are missing: I have one way that works, . . .
- I go up to ‘’Data/Select Cases’’
- Select cases “IF” a condition is satisfied
- Now, the missing ones don’t have a value. . . or do they? Go back to the data set and check how the variable “Age” has been coded.
- It’s coded, “0”, or “99” is “Missing”
- IF Age >= 0& marital status >=0 (
- RESULT: Some of these cases are missing;
- now we go up to Descriptives:
- Age, Marital Status, Gender; Run:
- Here we go: My descriptive result: 1776; so it worked; I told you it was superstitious! SPSS just did that. . .
- But if you use techniques like these, CHECK IT!
- (J: You think it’s not going to work, logically. . .) but it did work! So this is a bit awkward.
Apparently, 0 was not a legitimate value; now we’re taking all these cases that do have legitimate values.
So, we’ll try to see whether people drink or not using a single dichotomous variable - ‘’sex’’, in this case.
Q: What analysis would you use? A: χ^2 test of independence.
That’s correct: that’s what I actually did first here then, is I ran just a simple cross-tab:
I’ve got the
Crosstabs (SPSS Report Section)[edit | edit source]
What statistic would you report? - Is there a statistically significant difference? A: Yes.
Can we conclude that men drink more then women? A: No. (It doesn’t say which way the difference goes)
If you ask, “What proportion of the men had a drink
Logistic Regression (SPSS Instructions)[edit | edit source]
- In SPSS go to menu item ‘’Analyse; Regression; Binary Logistic’’
- Dependent variable: “Did you drink alcohol…”
- Covariate (what SPSS calls Independent Variables here):
- If we have categorical data, we have to tell the program that it’s categorical.
- A dichotomy is a very special type of categorical variable: In a sense, it’s an interval variable: all the intervals are equal because there is only one variable! So we don’t have the problem of unequal intervals.
- Some analysis in SPSS won’t run if it thinks you’re giving it the wrong kind of variable: it may not even tell you why, and if it does tell you it’s kind of hidden in there; you have to look for it. . .Okay, so we don’t have to call it a categorical variable: run the analysis
- OUTPUT: Logistic Regression, report from SPSS
Components of a Logistic Regression Report in SPSS[edit | edit source]
Components:
- Logistic Regression (Header)
- Case Processing Summary (Table)
- Dependent Variable Encoding (Table)
- Block 0: Beginning Block (Header)
- Classification Table
- Variables in the Equation (Table)
- Variables not in the Equation (Table)
- Block 1: Method = Enter (Header)
- Omnibust Tests of Model Coefficients
‘’’2.1 Classification Table’’’
If we know people’s gender, can that help us be more accurate? The baserate is that
Q: Can we go over 1) those numbers, and 2) how to interpret? A: In our sample; we have 654 drinkers, and 1122 non-drinkers. The prediction would be, you predict everybody is a drinker, because the majority are: On the people who are not drinkers you are correct on 0%; on the drinkers 100%; the total % correct is 63.2%
‘’’2.2 Variables in the Equation’’’
Horizontal:
- B = .540
- S.E. = .049
- Wald = 120.373
- df = 1
- Sig. = .000 (<- This means P(Statistics) is less than .001)
- Exp(B)
- Exp“B” is an odds ratio of being a drinker: = (# drinkers)/(# non-drinkers) = 1122/654 = 1.716. In other words, a little less than 2:1.
Now, how to use this: if B=.540, you can work through for any person the predicted score would be the constant, which is .540. e^.540 = 1.716
‘’’3.1 Omnibus Tests of Model Coefficients’’’
If you don’t have any . . .it’s 10.69; the likelihood ratio of χ^2
Now you have a capital R associated with them
‘’’3.2 Model Summary’’’
- Cox & Snell R Square = .006,
- Nagelkerke R Square = .008
Calculated Pearson R on 2x2 table and squared it; it comes out to .006; so these are kind of ball-park.
‘’’Variables in the Equation’’’ (p. ??)
- age
- B
- S.E.
- Wald = 111.580
- df
- Sig.
- Exp(B) for variable sex2 is .725; that is, the odds ratio for through group coded “1” divided by the odds for the group coded “0”
- 524 Female; 358 Female drinkers
- This shows that the odds Odds of drinking decrease by .968 as you get older
- Constant
- B = 1.963
- S.E. = .146
- Wald = 179.677 (<- the Wald Statistic - if chi square is about 1; we have a chi square here of about 111 with 1 degree of freedom; the null hypothesis (“H0”) is that age is unrelated to drinking. When we look at that constant on (CD06, p. 13) - that’s the value of someone with “0” on predictor: according to this model, newborn babies have incredible odds of being a drinker: more likely than 25 year olds. . . i
- df
- Sig.
- Exp(B) for variable sex2 is .725; that is, the odds ratio for through group coded “1” divided by the odds for the group coded “0”
- 524 Female; 358 Female drinkers
There is a bonus we get out of logistic regression that we don’t get out of χ^2.
We generate a value called “U sub i” =
- Ui = ConstantB + Bi*Xi
- = .703 + (-.322*Sex2) (<- For Sex2 0=Male; 1=Female)
- U_{males} = .703*(-.322*0) = .703
- U_{females} = .703*(-3.22*1) = .381
Pi = “Probability for Any Given Case” =
- e^Ui/(1+e^Ui)
- = e^.381/(1+e^.381) = = 1.4631/(1+1.4631) = .594
- .594 = The probability that a female is a drinker.
You can take a constant, apply an equation just like we did in multiple regression, and have characteristics.
You find on p. 6, 59.4%; but the value of this is that you can put in more predictors. You’re not constrained to just one predictor.
You can also ask the program to generate this little plot: You’ve got a plot here showing (J: It would be very helpful if the professor would display the handouts on the screen as he or she references them)
‘’’Plot for “Age of Respondents”’’’ (From SPSS Output - may not be included on Wikiversity!)
Everyone who is a male will line up in one spot; everyone who is a female lines up at about 60%; the N’s and Y’s are telling us, “was the person a drinker or not?” - (Y = yes, N = not) - the predicted probability of them being a drinker:
If you know whether someone is male or female, does that help you predict whether they’re a drinker?
- If you get a male, what do you predict? A: Drinker.
- If you get a female, what do you predict? A: Drinker.
So, for this question, you don’t get much difference here:)
Now, based on this model, if we want to predict whether someone is a drinker, given that they’re age 25, the formula for this is: (See Formula Sheet for Categorical Data Analysis)
Ui, for someone Age 25 =
- u=B+Bi*Xi
- B = 1.963; Bi = -.33; Xi = Age = 25
- ui = 1.963-.33*25 = 1.138
Pi = e^Ui/(1+e^Ui)
- = 3.12/(1+3.12) = 3.12/4.12 = .76
Now, does this make sense?
‘’’Variables in the Equation’’’ (p. 13)
- age
- B
- S.E.
- Wald = 111.580
- df
- Sig.
- Exp(B) for variable sex2 is .725; that is, the odds ratio for through group coded “1” divided by the odds for the group coded “0”
- 524 Female; 358 Female drinkers
- This shows that the odds Odds of drinking decrease by .968 as you get older
- Constant
- B = 1.963
- S.E. = .146
- Wald = 179.677 (<- the Wald Statistic - if chi square is about 1; we have a chi square here of about 111 with 1 degree of freedom; the null hypothesis (“H0”) is that age is unrelated to drinking. When we look at that constant on (CD06, p. 13) - that’s the value of someone with “0” on predictor: according to this model, newborn babies have incredible odds of being a drinker: more likely than 25 year olds. . . i
- df
- Sig.
- Exp(B) for variable sex2 is 7.118; that is what you would get if you were “zero” on the age variable” - now,
We find that ‘’Age’’ is a significant variable, we find that from the ‘’’Wald’’’ Statistic;
===Distribution of Predicted probability of drinking=== (p. 14)
===Classification Table=== (p. 13)
Of the people who did drink, who did we predict would drink?
- 984
Who did we predict would drink that didn’t drink?
- 138
(J: So, on this table, did you drink is: …?)
- False Negative (Beta Error) = 209/N = 209/1776 = .12) - Correct Negative
- False Positive (Alpha Error) = 138/N = 138/1776 =.08) - Correct Positive
So,
- when we said “Drink or not drink” and (sort?) by gender (“sex2”) we’d be randomly right 63.6% of the time;
- when we put in age (only) we get up to being correct 67.2% of the time.
Part 2[edit | edit source]
One categorical predictor: Chi-square compared to logistic regression[edit | edit source]
(CD06 p. 15)
This is a nice little 2x4 statistic: so χ^2 for Independent would be the default. If you rejected that hypothesis, would you conclude that single people would be more likely to drink?
I start on p. 15 by doing a Crosstab Table
- What % of single people drink? ≈ 70.4%
- Married people: 62.8%
- Widowed: 40.6%
Look at how odds are easily misinterpreted in this case by reading through the handout. (J: Hard if the handout is not uploaded!)
We’d pre
SPSS Report:
Categorical Variable Codings (Table)[edit | edit source]
Variables in the Equation (Table)[edit | edit source]
Variables not in the Equation (Table)
Block 1: Method = Enter (Header)
Omnibus … (table)
Model Summary (table)
Classification table (Table) (p. 19)
Variables in the Equation (table)
Step 1 (a) a= ““Variable(s) entered on setp 1: marst.”
‘’’Variables:’’’ (Vertical: Rows)
- marst
- mast(1)
- mast(2)
- mast(3)
- Constant
‘’’___’’’ (Horizontal: Columns)
- B
- S.E.
- Wald
- df
- Sig.
- Exp(B)
p. 20: Step number 1: Graphic of Observed Groups and Predicted Probabilities https://www.evernote.com/shard/s95/sh/55fc6b1d-0c1b-4157-89c6-92bf3eb00418/328870b19b533663394c8f16f1a9dccd
- Widowed is the N down at the bottom
- Married or stable are the big group
- Divorced or separated the little “Y” just to the right of the Married/Stable group
- Single: another step to the right of that.
So now we’ll go on to Hierarchical logistic regression with continuous and categorical predictors
Hierarchical logistic regression with continuous and categorical predictors[edit | edit source]
‘’’Syntax:’’’ LOGISTIC REGRESSION VAR=drink2 /METHOD=ENTER age /METHOD=ENTER sex2 /METHOD=ENTER marst /CONTRAST (marst)=Indicator /CLASSPLOT /CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .
Output:
- Logistic Regression: (Case Processing Summary (Table), Dependent Variable Encoding (Table), Categorical Variables (Table) - (p. 21)
- Block 0: Beginning Block & Block 1: Method = Enter (Block 0 = 2 tables) (Block 1 = 3 tables)
- [https://www.evernote.com/shard/s95/sh/2891c586-7765-4be3-9934-defa2ae61329/e7354bf24e11ac6b7d6c0d77373d2092 Block 2 & Block 3):
- Block 2: Method = Enter
- Table: Variables in the Equation
- Block 3: Method = Enter
- Table: Omnibus Tests of Model Coefficients
- Block 2: Method = Enter
For block 2 Variables in the Equation, Q: How do you interpret Odds Ratio for the Constant? - A: That’s the odds for if the person is zero on both predictors; 8.188 is just the odds of being a drinker or non-drinker. THe other two Exp(B) numbers are Odds Ratio’s
Now, go down to the 3rd step: Marital Status predicting whether peopel drink or not wen controlling for mariage and gender: In other words, whether people drink or not at the same age and gender:
Table: Model Summary
Exp(B) - Odds:
- age = .967
- sex2 = .744 = odds ratio of odds of drinking for females / odds of drinking for males (or visa versa…)
- marst = Wald = 3.135 for 3 df: not at all statistically significant, what does that mean? A: Marital Status doesn’t help prediction beyond knowing sex and age. M: none of the constants here; Married; Single; and Divorced (marst(1) thorugh marst(3) approach statistical significance
Based on the overall model patterns in the data, what’s the probability that a 21-year-old divorced man would drink? (I tried to pick a category where there might be some drinking going on:))
Take the B values, where:
- X1 = Age = 21; B1 = -.034*21
- X2 = Sex = Male = 1; B2 = -.296
- X3 = Divorced = marst(1); B3 = -.036
- Constant = 2.013
- U = B1*X1i + B2*X2i + B3*X3i . . .
- = Age*B for Age + Gender * B for gender + Marriage Status*Bmarst+Constant
- = So we calculate U = (-.034)(21) + (-.296)(0) + (-.036)(0) + (.170)(0) + (.277)(1) + 2.013 = 1.576
If you compare the Log likelihood from the last model to the previous model, what it gives you should
How to Graph Logistic Models with SPSS[edit | edit source]
One can use syntax to generate graphs in SPSS. In this example we will use the coefficients from the final model to generate a graph of modeled proportion of male and female drivers who drink alcohol as a function of age. We first create a new data file that contains the steps we wish to plot on the X axis (e.g., age 20 to 80 by steps of 5), then provide the equation that uses age to predict probability of drinking, and then create a graph.
- This syntax creates a new data file that has only “Age” in it from 20 to 80 in steps of 5.
input program. loop #i = 20 to 80 by 5. compute age = #i. end case. end loop. end file. end input program. execute.
- Enter the B values that you will use from the logistic equation in the compute statement.
compute Umale = 2.013 - .036 -.034*age . compute p_male = exp(Umale)/(1 + exp(Umale)) . compute Ufemale = 2.013 - .036 -.296 -.034*age . compute p_female = exp(Ufemale)/(1 + exp(Ufemale)) . execute.
- Create a graph with probability on the Y axis and age on the X axis.
GRAPH /LINE(SIMPLE)=VALUE( p_male p_female ) BY age .
How to Graph Logistic Regression Models with Excel[edit | edit source]
A graph can be an excellent way to show data or a model. Here we demonstrate using the graphing capability of Excel to create a graph showing the predicted probability of drinking as a function of age for single men and women.