Monday, 1 June 2015

LOGISTIC REGRESSION using R

LOGISTIC REGRESSION 

Before doing  Logistic regression, one should be clear about what "odds" are, and interpretation of it.
When you see on a screen , that the odds for a particular team A are  6 to 1 , this means that the team is expected to win 1 and lose 6. If we see the probability, the probability of wining is 1/7 . But the odds for winning is 1/6 = 0.17 Odds are the ratio of two probability

          p(one outcome)       p(success)    p
odds = -------------------- = ----------- = ---, where q = 1 - p
       p(the other outcome)    p(failure)    q

So, odds(winning) = (1/7)/(6/7) = 1/6

Suppose in the same match Team B is given odds of 2 to 1, which is to say, two expected loses for each expected win. Team B's odds of winning are 1/2, or 0.5. How much better is this than the winning odds for Team A?

 The odds ratio tells us: 0.5 / 0.17 = ~3. The odds of Team B winning are three times the odds of Team A winning. Be careful not to say "times as likely to win," which would not be correct. The probability (likelihood, chance) of Team B winning is 1/3 and for Team A is 1/7, resulting in a likelihood ratio of 2.36 .Team B is 2.36 times more likely to win than is Team A.

Logistic regression is a method for fitting a regression curve, y = f(x), when y consists of proportions or probabilities, or binary coded (0,1--failure,success) data. When the response is a binary (dichotomous) variable, and x is numerical, logistic regression fits a logistic curve to the relationship between x and y.Hence, logistic regression is linear regression on the logit transform of y, where y is the proportion (or probability) of success at each value of x.

We will work on a dataset where we will be interested in  how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.

> Data <- read.csv("C:/Users/shubham/Downloads/binary.csv")
> head(Data)
  admit gre  gpa rank
     0    380 3.61    3
     1    660 3.67    3
     1    800 4.00    1
     1    640 3.19    4
     0    520 2.93    4
     1    760 3.00    2

> summary(Data)
     admit             gre                      gpa                 rank      
 Min.   :0.0000     Min.   :220.0       Min.   :2.260      Min.   :1.000  
 1st Qu.:0.0000    1st Qu.:520.0     1st Qu.:3.130     1st Qu.:2.000  
 Median :0.0000   Median :580.0    Median :3.395    Median :2.000  
 Mean   :0.3175    Mean   :587.7    Mean   :3.390     Mean   :2.485  
 3rd Qu.:1.0000    3rd Qu.:660.0    3rd Qu.:3.670      3rd Qu.:3.000  
 Max.   :1.0000     Max.   :800.0     Max.   :4.000      Max.   :4.000 

> sapply(Data,sd)
      admit         gre                   gpa            rank 
  0.4660867   115.5165364   0.3805668   0.9444602 

> xtabs(~admit+rank, data=Data)
              rank
admit  1  2  3  4
    0   28 97 93 55
    1   33 54 28 12

Dataset has been loaded with the command "read.csv" , then the "head()" function is used to see the first six observation ,our target variable is "admit" which dichotomous in nature were 0 means not admitted and 1 means admitted, we are taking "gpa" and "gre" as continuous variable and "rank" as categorical variable with 4 levels from Rank 1 to Rank 4 , where  Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. We can also use "tail()" to see the last 6 observations. Summary of the dataset is obtained using "summary()" command and then standard deviation of the variables using "sapply()" function , and later we have made contingency table 

> str(Data)
'data.frame': 400 obs. of  4 variables:
 $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
 $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
 $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
 $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

> Data$rank=as.factor(Data$rank)

If we take a look at the above output given by "str()" function, we can see that the Rank is being treated as integer. to make it categorical, it has been converted into factors using function "as.factor".
Now we are ready to estimate the logistic regression model , we can do it using "glm()" function

> logit=glm(admit~., data=Data)
> summary(logit)
Call:
glm(formula = admit ~ gre + gpa + rank, data = Data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7022  -0.3288  -0.1922   0.4952   0.9093  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.2589102  0.2159904  -1.199   0.2314    
gre             0.0004296  0.0002107   2.038   0.0422 *  
gpa            0.1555350  0.0639618   2.432   0.0155 *  
rank2        -0.1623653  0.0677145  -2.398   0.0170 *  
rank3        -0.2905705  0.0702453  -4.137 4.31e-05 ***
rank4        -0.3230264  0.0793164  -4.073 5.62e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1979062)

    Null deviance: 86.677  on 399  degrees of freedom
Residual deviance: 77.975  on 394  degrees of freedom
AIC: 495.12

Number of Fisher Scoring iterations: 2

Above,"glm()" is used to fit the model with "admit" as our response variable.We can see that Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
  • For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.0004.
  • For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.156.
  • The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.16
Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC.
But what does actually null and deviance residuals mean?
Let LL = loglikelihood
Here is a quick summary of what you see from the summary(glm.fit) output,
Null Deviance = 2(LL(Saturated Model) - LL(Null Model)) on df = df_Sat - df_Null
Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model)) df = df_Sat - df_Res
The Saturated Model is a model that assumes each data point has its own parameters (which means you have n parameters to estimate.).The Null Model assumes the exact "opposite", in that is assumes one parameter for all of the data points, which means you only estimate 1 parameter.The Proposed Model assumes you can explain your data points with p parameters + an intercept term, so you have p+1 parameters.If your Null Deviance is really small, it means that the Null Model explains the data pretty well. Likewise with your Residual Deviance.
What does really small mean? If your model is "good" then your Deviance is approx Chi^2 with (df_sat - df_model) degrees of freedom.If you want to compare you Null model with your Proposed model, then you can look at
(Null Deviance - Residual Deviance) approx Chi^2 with df Proposed - df Null = (n-(p+1))-(n-1)=p

Now we would like to test the overall effect of Rank. we can do it by Wald Test . But before that install package "aod" in R.

> wald.test(b=coef(logit), Sigma=vcov(logit),Terms=4:6)
Wald test:
----------

Chi-squared test:
X2 = 23.2, df = 3, P(> X2) = 3.8e-05

In wald.test function "b"supplies the coefficients, while "Sigma" supplies the variance covariance matrix of the error terms, finallyTerms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels ofrankThe chi-squared test statistic of 23.2, with three degrees of freedom is associated with a p-value of 3.8e-05 indicating that the overall effect of rank is statistically significant.

Now we will predict the probability of admission at each value of rank, holding gre and gpa at their means. First we create and view the data frame.

>newdata=with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
>newdata
     gre     gpa     rank
1 587.7 3.3899    1
2 587.7 3.3899    2
3 587.7 3.3899    3
4 587.7 3.3899    4

>newdata$rankP <- predict(logit, newdata = newdata, type = "response")
>newdata
    gre    gpa rank     rankP
1 587.7 3.3899    1 0.5207974
2 587.7 3.3899    2 0.3584321
3 587.7 3.3899    3 0.2302269
4 587.7 3.3899    4 0.1977710

In the above output we see that the predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1), and 0.18 for students from the lowest ranked institutions (rank=4), holding greand gpa at their means, like wise we can predict the probability of any required gpa and gre.

2 comments:

  1. A very good clear description - much more use than some of the explainations on the web:)

    ReplyDelete