Data Mining With R: LOGISTIC REGRESSION using R

LOGISTIC REGRESSION

Before doing Logistic regression, one should be clear about what "odds" are, and interpretation of it.

When you see on a screen , that the odds for a particular team A are 6 to 1 , this means that the team is expected to win 1 and lose 6. If we see the probability, the probability of wining is 1/7 . But the odds for winning is 1/6 = 0.17 Odds are the ratio of two probability

          p(one outcome)       p(success)    p
odds = -------------------- = ----------- = ---, where q = 1 - p
       p(the other outcome)    p(failure)    q

So, odds(winning) = (1/7)/(6/7) = 1/6

Suppose in the same match Team B is given odds of 2 to 1, which is to say, two expected loses for each expected win. Team B's odds of winning are 1/2, or 0.5. How much better is this than the winning odds for Team A?

The odds ratio tells us: 0.5 / 0.17 = ~3. The odds of Team B winning are three times the odds of Team A winning. Be careful not to say "times as likely to win," which would not be correct. The probability (likelihood, chance) of Team B winning is 1/3 and for Team A is 1/7, resulting in a likelihood ratio of 2.36 .Team B is 2.36 times more likely to win than is Team A.

Logistic regression is a method for fitting a regression curve, y = f(x), when y consists of proportions or probabilities, or binary coded (0,1--failure,success) data. When the response is a binary (dichotomous) variable, and x is numerical, logistic regression fits a logistic curve to the relationship between x and y.Hence, logistic regression is linear regression on the logit transform of y, where y is the proportion (or probability) of success at each value of x.

We will work on a dataset where we will be interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.

> Data <- read.csv("C:/Users/shubham/Downloads/binary.csv")

> head(Data)

admit gre gpa rank

0 380 3.61 3

1 660 3.67 3

1 800 4.00 1

1 640 3.19 4

0 520 2.93 4

1 760 3.00 2

> summary(Data)

admit gre gpa rank

Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000

1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000

Median :0.0000 Median :580.0 Median :3.395 Median :2.000

Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485

3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000

Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000

> sapply(Data,sd)

admit gre gpa rank

0.4660867 115.5165364 0.3805668 0.9444602

> xtabs(~admit+rank, data=Data)

rank

admit 1 2 3 4

0 28 97 93 55

1 33 54 28 12

Dataset has been loaded with the command "read.csv" , then the "head()" function is used to see the first six observation ,our target variable is "admit" which dichotomous in nature were 0 means not admitted and 1 means admitted, we are taking "gpa" and "gre" as continuous variable and "rank" as categorical variable with 4 levels from Rank 1 to Rank 4 , where Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. We can also use "tail()" to see the last 6 observations. Summary of the dataset is obtained using "summary()" command and then standard deviation of the variables using "sapply()" function , and later we have made contingency table

> str(Data)

'data.frame': 400 obs. of 4 variables:

$ admit: int 0 1 1 1 0 1 1 0 1 0 ...

$ gre : int 380 660 800 640 520 760 560 400 540 700 ...

$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...

$ rank : int 3 3 1 4 4 2 1 2 3 2 ...

> Data$rank=as.factor(Data$rank)

If we take a look at the above output given by "str()" function, we can see that the Rank is being treated as integer. to make it categorical, it has been converted into factors using function "as.factor".

Now we are ready to estimate the logistic regression model , we can do it using "glm()" function

> logit=glm(admit~., data=Data)

> summary(logit)

Call:

glm(formula = admit ~ gre + gpa + rank, data = Data)

Deviance Residuals:

Min 1Q Median 3Q Max

-0.7022 -0.3288 -0.1922 0.4952 0.9093

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.2589102 0.2159904 -1.199 0.2314

gre 0.0004296 0.0002107 2.038 0.0422 *

gpa 0.1555350 0.0639618 2.432 0.0155 *

rank2 -0.1623653 0.0677145 -2.398 0.0170 *

rank3 -0.2905705 0.0702453 -4.137 4.31e-05 ***

rank4 -0.3230264 0.0793164 -4.073 5.62e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1979062)

Null deviance: 86.677 on 399 degrees of freedom

Residual deviance: 77.975 on 394 degrees of freedom

AIC: 495.12

Number of Fisher Scoring iterations: 2

Above,"glm()" is used to fit the model with "admit" as our response variable.We can see that Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.0004.
For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.156.
The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.16

Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC.

But what does actually null and deviance residuals mean?

Let LL = loglikelihood

Here is a quick summary of what you see from the summary(glm.fit) output,

Null Deviance = 2(LL(Saturated Model) - LL(Null Model)) on df = df_Sat - df_Null

Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model)) df = df_Sat - df_Res

The Saturated Model is a model that assumes each data point has its own parameters (which means you have n parameters to estimate.).The Null Model assumes the exact "opposite", in that is assumes one parameter for all of the data points, which means you only estimate 1 parameter.The Proposed Model assumes you can explain your data points with p parameters + an intercept term, so you have p+1 parameters.If your Null Deviance is really small, it means that the Null Model explains the data pretty well. Likewise with your Residual Deviance.

What does really small mean? If your model is "good" then your Deviance is approx Chi^2 with (df_sat - df_model) degrees of freedom.If you want to compare you Null model with your Proposed model, then you can look at

(Null Deviance - Residual Deviance) approx Chi^2 with df Proposed - df Null = (n-(p+1))-(n-1)=p

Now we would like to test the overall effect of Rank. we can do it by Wald Test . But before that install package "aod" in R.

> wald.test(b=coef(logit), Sigma=vcov(logit),Terms=4:6)

Wald test:

----------

Chi-squared test:

X2 = 23.2, df = 3, P(> X2) = 3.8e-05

In wald.test function "b"supplies the coefficients, while "Sigma" supplies the variance covariance matrix of the error terms, finallyTerms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels ofrank. The chi-squared test statistic of 23.2, with three degrees of freedom is associated with a p-value of 3.8e-05 indicating that the overall effect of rank is statistically significant.

Now we will predict the probability of admission at each value of rank, holding gre and gpa at their means. First we create and view the data frame.

>newdata=with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))

>newdata

gre gpa rank

1 587.7 3.3899 1

2 587.7 3.3899 2

3 587.7 3.3899 3

4 587.7 3.3899 4

>newdata$rankP <- predict(logit, newdata = newdata, type = "response")

>newdata

gre gpa rank rankP

1 587.7 3.3899 1 0.5207974

2 587.7 3.3899 2 0.3584321

3 587.7 3.3899 3 0.2302269

4 587.7 3.3899 4 0.1977710

In the above output we see that the predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1), and 0.18 for students from the lowest ranked institutions (rank=4), holding greand gpa at their means, like wise we can predict the probability of any required gpa and gre.

Data Mining With R

Monday, 1 June 2015

LOGISTIC REGRESSION using R

2 comments:

Blog Archive

About Me