Diagnostics 1 (Reading: Faraway(2005), 4.2)


Residual

We'll use the data (with country name) as an example here. 

 

First fit the model and make an index plot of the residuals:

> savings <- read.table("savings.data") # read the data into R

> g <- lm(sav ~ p15 + p75 + inc + gro, data=savings) # fit the model with sav as the response and the rest variables as predictors

> plot(g$res, ylab="Residual", main="index plot of residuals") # draw the plot of residuals against index

 


Leverage

Now let us look at the leverage.

 

We first extract the X-matrix here using "model.matrix()" and then compute and plot the leverages (also called "hat" values)

> x <- model.matrix(g) # extract X-matrix
> lev <- hat(x) # calculate leverages. The command "hat" returns the leverage, another way is to use the function "lm.influence", e.g., lev <- lm.influence(g)$hat, see example in influential observation
> sum(lev) # examine the sum of leverages

[1] 5

 

Let's check which countries have large leverages. 

 


Studentized residuals

Now get the studentized residuals:

> gsum <- summary(g) # get the summary output of the fitted model g
> gsum$sig # The σ-hat

[1] 3.802648

> stud <- g$res/(gsum$sig*sqrt(1-lev)) # studentized residuals, an easy way is to use the function "rstandard()", e.g., stud <- rstandard(g)
> plot(stud,ylab="studentized Residuals",main="studentized residuals") # draw the plot of studentized residuals against index

 


Outlier and jacknife residuals

Let's compute the jacknife residuals:

> jack <- stud*sqrt((50-5-1)/(50-5-stud^2)) # jacknife residuals, an easy way is to use the function "rstudent()", e.g., jack <- rstudent(g)
> plot(jack, ylab="Jacknife residuals", main="Jacknife residuals") # draw the plot of jacknife residuals against index

> jack[abs(jack)==max(abs(jack))] # find out the country with largest jacknife residual

Zambia 
2.853581

> qt(0.05/2, 44) # compare with tn-p-1(a/2)

[1] -2.015368


Here is an example of a dataset with multiple outliers.

Read in and plot the data:

> star <- read.table("star.data",h=T) # read the data into R

> star # take a look of the data

   index temp light

1      1 4.37  5.23

2      2 4.56  5.74

...deleted...

47    47 4.42  4.50

> plot(star$temp,star$light) # draw the scatter plot of log light intensity against temperature

Now fit a linear regression and add the fitted line to the plot:

> gs <- lm(light ~ temp, data=star) # fit a model with light as the response and temp as the predictor
> abline(gs) # add the fitted line to the scatter plot

 

The four stars on the upper left of the plot are giants. See what happens if these are excluded:

> gs1 <- lm(light ~ temp, data=star, subset=(temp>3.6)) # fit a model without using the giant data
> plot(star$temp,star$light); abline(gs); abline(gs1, lty=2)  # add the fitted line of the last fitted model

 


Influential observation

Continuing with our study of the saving data. First we compute and plot the Cook statistics:

> cook <- stud^2*lev/(5*(1-lev)) # The Cook statistics, an easy way is to use the function "cooks.distance()", e.g., cook <- cooks.distance(g)
> plot(cook, ylab="Cooks distance") # draw the plot of Cook's statistics against index
> identify(1:50,cook,countries) # identify the three countries with largest Cook's statistics\

 

It would be rather tedious to do this for each country. There's is a quick way in R so that we don't have to fit 50 regression models to examine the coefficients found when each country is excluded:

> ginf <- lm.influence(g) # calculate some values useful for investigating influence
> names(ginf) # ginf has three components, the hat values (leverages), change in leave-out-one coefficients, change in leave-out-one sigmahat, and weighted residuals

[1] "hat"          "coefficients" "sigma"        "wt.res" 

> ginf # take a look of ginf

$hat

    Australia       Austria       Belgium       Bolivia        Brazil        Canada

   0.06771409    0.12039224    0.08748097    0.08947314    0.06956000    0.15840654

    ...deleted...

        Libya      Malaysia

   0.53145507    0.06523377

 

$coefficients

              (Intercept)           p15          p75           inc           gro

Australia      0.09156168 -1.525349e-03 -0.029047707  4.265293e-05 -3.166120e-05

Austria       -0.07467639  8.682630e-04  0.044733545 -3.456376e-05 -1.622880e-03

    ...deleted...

Malaysia       0.27201243 -8.876680e-03  0.035190516 -4.633154e-05 -1.423981e-02

 

$sigma

    Australia       Austria       Belgium       Bolivia        Brazil        Canada

     3.843255      3.844341      3.829643      3.844035      3.805321      3.845264

    ...deleted...

        Libya      Malaysia

     3.794791      3.817615

 

$wt.res

    Australia       Austria       Belgium       Bolivia        Brazil        Canada

    0.8632876     0.6162526     2.2187481    -0.6982426     3.5527272    -0.3172475

    ...deleted...

        Libya      Malaysia

   -2.8294673    -2.9708394

 

The following is a summary of the countries we identified in previous analysis.

  Zambia Chile Libya US Japan Ireland Sweden
large leverage     ** * * *  
studentized residual * *          
jacknife residual * *          
Cook stat. *   **   *    
diff. in LO1 coef. (p15)     *   *    
diff. in LO1 coef. (p75)         * *  
diff. in LO1 coef. (inc)       *   * *
diff. in LO1 coef. (gro)     *        
change in LO1 sigma * *