Solutions to regression book by Sheather, chapter 6

Problem 1

Well that is fairly easy! (I’m not sure why this exercise is here)

\[ Var(\hat{Y})=Var(HY)=H(Var(Y))H^\intercal=H\sigma^2IH^\intercal=\sigma^2H \]

Note that H is symmetric and idempotent.

Did you know that you can write your proofs in a proof assistant?! Yes, you heard that right, A PROOF ASSISTANT; A TRUE PATH TO SALVATION!

As you may already know, most people don’t care about foundations, but I do! Don’t you want to know your proof is based on what axioms? Whether your proof is logical? Don’t you want the machine to check your proofs? Don’t you want to find more proofs with the aid of AI? If you do, follow along…

I suggest you to use LEAN. It’s an open source proof assistant so it’s totally free! A lot of great theorems have been proven in lean like The Law of Large Numbers, BUT! you know what’s missing? The Cochran’s Theorem, Central Limit Theorem, concepts like sufficiency, completeness, etc.

Join us here on Zulip, we are building the future of FORMAL SCIENCE!

Did I mention the Holy Trinity of Formal Science?

The Holy Trinity of Formal Science

OK, enough advertisement. Let’s head back to regression, shall we?

Problem 2

What the hell is a honeymoon effect?! Anyways, I think this is a time series problem since a lot of the measurements depend on past years and so the first six variables must be highly correlated which breaks the uncorrelated assumption (which is \(Cov(\epsilon_i,\epsilon_j)=0\) I guess!)! Another thing is that some of these variables are related (which leads to multicollinearity) like \(x_8\) and \(x_{11}\) If I understood correctly since the high quality stadiums charge you more money thus they earn more income :) (I don’t know what I’m talking about!)

Anyways, I don’t think multicollinearity is necessarily a problem as long as you have enough data!

Problem 3

What if my answer wasn’t too impressive? Would you still be so impressed? Anyways, I hope these questions be the last ones in his mind!

a

In my opinion, it’s not a valid model! Let me tell you why:

Figure 6.53 shows correlation between independent variables, thus multicollinearity exists!
“The plot of residuals against fitted values produces a curved pattern” even the impressed analyst admits!
The right tail doesn’t look normal!
Bad leverage points?!(Don’t cry, transformation will fix it honey!)

b

A transformation MIGHT help!

c

The threshold for a leverage point is \[hii>\frac{2(p+1)}{n} \] which is

2*(7+1)/234

## [1] 0.06837607

Roughly speaking, I can confirm that the point 223 is a bad leverage point!

d

Figure 6.55 shows correlation between independent variables, thus multicollinearity exists!
Oops! More bad leverage points!(David Cox left the group!)
Almost no pattern in residuals (This is good BTW)

I don’t like any of these models! Something is wrong with those bad leverage points!

You might remove …(NOOOOO, don’t peel the orange, the doc said you may cut your hand!)

e

\[F=\frac{(RSS(reduced) - RSS(full))/(df_R - df_F)}{MSE(full)}=\frac{(RSS(reduced) - RSS(full))/k}{RSS(full)/(n - p - 1)}\]

First, let’s examine the outputs of the book!

library(car)

## Loading required package: carData

d = read.csv("cars04.csv",header=TRUE)
attach(d)
tSuggestedRetailPrice <- log(SuggestedRetailPrice)
tEngineSize <- EngineSize^0.25
tCylinders <- log(Cylinders)
tHorsepower <- log(Horsepower)
tHighwayMPG <- 1/HighwayMPG
tWheelBase <- log(WheelBase)
FullModel <- lm(tSuggestedRetailPrice~tEngineSize+tCylinders
+tHorsepower+tHighwayMPG+Weight+tWheelBase+Hybrid)
summary(FullModel)

## 
## Call:
## lm(formula = tSuggestedRetailPrice ~ tEngineSize + tCylinders + 
##     tHorsepower + tHighwayMPG + Weight + tWheelBase + Hybrid)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42288 -0.10983 -0.00203  0.10279  0.70068 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.703e+00  2.010e+00   2.838  0.00496 ** 
## tEngineSize -1.575e+00  3.332e-01  -4.727 4.01e-06 ***
## tCylinders   2.335e-01  1.204e-01   1.940  0.05359 .  
## tHorsepower  8.992e-01  8.876e-02  10.130  < 2e-16 ***
## tHighwayMPG  8.029e-01  4.758e+00   0.169  0.86614    
## Weight       5.043e-04  6.367e-05   7.920 1.07e-13 ***
## tWheelBase  -6.385e-02  4.715e-01  -0.135  0.89240    
## Hybrid       6.422e-01  1.150e-01   5.582 6.78e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1789 on 226 degrees of freedom
## Multiple R-squared:  0.8621, Adjusted R-squared:  0.8578 
## F-statistic: 201.8 on 7 and 226 DF,  p-value: < 2.2e-16

ReducedModel <- lm(tSuggestedRetailPrice~tEngineSize+
tCylinders+tHorsepower+Weight+Hybrid)
summary(ReducedModel)

## 
## Call:
## lm(formula = tSuggestedRetailPrice ~ tEngineSize + tCylinders + 
##     tHorsepower + Weight + Hybrid)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42224 -0.11001 -0.00099  0.10191  0.70205 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.422e+00  3.291e-01  16.474  < 2e-16 ***
## tEngineSize -1.591e+00  3.157e-01  -5.041 9.45e-07 ***
## tCylinders   2.375e-01  1.186e-01   2.003   0.0463 *  
## tHorsepower  9.049e-01  8.305e-02  10.896  < 2e-16 ***
## Weight       5.029e-04  5.203e-05   9.666  < 2e-16 ***
## Hybrid       6.340e-01  1.080e-01   5.870 1.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1781 on 228 degrees of freedom
## Multiple R-squared:  0.862,  Adjusted R-squared:  0.859 
## F-statistic: 284.9 on 5 and 228 DF,  p-value: < 2.2e-16

#I also tried this method but I got the same output as above
#d = cbind(basicPower(d[-8],c(0,0.25,0,0,-1,1,0)),d$Hybrid)
#lm(d$`log(SuggestedRetailPrice)`~.,d)

There’s a problem though, the output of full transformed regression model doesn’t match up with the output in the book, so I’m going to use the outputs produced here.

# Manual partial F-test
SSE_F = sum(FullModel$residuals^2)
SSE_R = sum(ReducedModel$residuals^2)
df_F = FullModel$df.residual
df_R = ReducedModel$df.residual
F_stat = ((SSE_R - SSE_F)/(df_R-df_F))/(SSE_F/df_F)

F_stat

## [1] 0.03400598

print("p-value = ")

## [1] "p-value = "

pf(F_stat,df_R-df_F,df_F,lower.tail = FALSE)

## [1] 0.9665707

# Easy partial F-test
anova(ReducedModel,FullModel)

Given the p-value equals 0.9666, there is little, if any, evidence to support the alternative hypothesis.

This means that we are happy to adopt the reduced model.

f

You can use regex!

brand = as.character(regmatches(d$Vehicle.Name,gregexpr("^[^ ]*",d$Vehicle.Name)))

NewModel <- lm(tSuggestedRetailPrice~tEngineSize+
tCylinders+tHorsepower+Weight+Hybrid+brand)
summary(NewModel)

## 
## Call:
## lm(formula = tSuggestedRetailPrice ~ tEngineSize + tCylinders + 
##     tHorsepower + Weight + Hybrid + brand)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38766 -0.05957 -0.00215  0.06234  0.37516 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.505e+00  2.777e-01  23.429  < 2e-16 ***
## tEngineSize         6.933e-01  2.871e-01   2.415 0.016654 *  
## tCylinders         -2.561e-02  9.576e-02  -0.267 0.789427    
## tHorsepower         3.457e-01  7.310e-02   4.730 4.29e-06 ***
## Weight              3.424e-04  4.222e-05   8.109 5.43e-14 ***
## Hybrid              4.253e-01  7.433e-02   5.723 3.88e-08 ***
## brandAudi           5.041e-02  6.119e-02   0.824 0.411016    
## brandBMW            1.544e-01  6.195e-02   2.493 0.013492 *  
## brandBuick         -2.086e-01  6.943e-02  -3.004 0.003012 ** 
## brandCadillac      -7.469e-02  7.806e-02  -0.957 0.339834    
## brandChevrolet     -3.182e-01  6.372e-02  -4.995 1.30e-06 ***
## brandChrvsler      -1.350e-01  1.246e-01  -1.083 0.279991    
## brandChrysler      -2.132e-01  6.261e-02  -3.405 0.000803 ***
## brandDodge         -3.145e-01  6.996e-02  -4.495 1.19e-05 ***
## brandFord          -3.489e-01  6.592e-02  -5.292 3.22e-07 ***
## brandHonda         -1.636e-01  6.498e-02  -2.517 0.012640 *  
## brandHyundai       -3.123e-01  6.669e-02  -4.683 5.27e-06 ***
## brandInfiniti      -1.416e-01  7.016e-02  -2.018 0.044906 *  
## brandJaguar         1.794e-01  6.663e-02   2.693 0.007706 ** 
## brandKia           -3.946e-01  7.091e-02  -5.566 8.51e-08 ***
## brandLexus          5.779e-02  7.024e-02   0.823 0.411629    
## brandLincoln       -8.570e-02  6.959e-02  -1.231 0.219645    
## brandMazda6        -2.536e-01  1.243e-01  -2.041 0.042558 *  
## brandMercedes-Benz  2.227e-01  6.007e-02   3.707 0.000273 ***
## brandMercury       -3.977e-01  7.387e-02  -5.384 2.07e-07 ***
## brandMini          -1.885e-02  9.729e-02  -0.194 0.846529    
## brandMitsubishi    -2.907e-01  9.607e-02  -3.026 0.002810 ** 
## brandNissan        -2.930e-01  6.781e-02  -4.321 2.46e-05 ***
## brandOldsmobile    -1.746e-01  9.785e-02  -1.785 0.075886 .  
## brandPontiac       -3.068e-01  7.559e-02  -4.059 7.11e-05 ***
## brandSaab           1.809e-01  7.187e-02   2.517 0.012622 *  
## brandSaturn        -3.678e-01  7.185e-02  -5.119 7.31e-07 ***
## brandScion         -1.872e-01  1.284e-01  -1.458 0.146538    
## brandSubaru        -1.701e-01  6.929e-02  -2.455 0.014965 *  
## brandSuzuki        -4.184e-01  7.475e-02  -5.597 7.28e-08 ***
## brandToyota        -2.488e-01  6.148e-02  -4.046 7.49e-05 ***
## brandVolkswagen    -6.937e-02  6.525e-02  -1.063 0.289033    
## brandVolvo          9.513e-02  6.471e-02   1.470 0.143127    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1127 on 196 degrees of freedom
## Multiple R-squared:  0.9526, Adjusted R-squared:  0.9436 
## F-statistic: 106.4 on 37 and 196 DF,  p-value: < 2.2e-16

Increased the \(R^2_{adj}\) by 8 percentage points! Definitely worth it!

Problem 4

a

In my opinion, it’s a valid model! I see almost no pattern in residuals and the variance seems to be constant. No bad leverage points. High \(R^2_{adj}\).

There’s one thing to note here: It looks like that the data is grouped in someway since there are multiple different linear patterns in one scatter plot (See scatter plot between VTINV and HEAT in Figure 6.58 for instance). More investigation may help.

b

You may add quadratic terms, but I suggest you to try one of the approaches to see if there’s anything to learn from the data. I say it’s already good enough!

c

I see one problem here, if you watch Figure 6.58 carefully, you notice that there’s linear pattern between predictor variables which indicates multicollinearity! So why don’t you use the VIF score? Or any criteria which can detect multicollinearity?

Problem 5

Let’s solve this one quickly!

a

d = read.csv("pgatour2006.csv")
y = d$PrizeMoney
x1 = d$DrivingAccuracy
x2 = d$GIR
x3 = d$PuttingAverage
x4 = d$BirdieConversion
x5 = d$SandSaves
x6 = d$Scrambling
x7 = d$PuttsPerRound

Approach 1

ap1 = summary(powerTransform(cbind(x1,x2,x3,x4,x5,x6,x7)~1,data=d))
xtable(ap1$result)

xtable(ap1$tests)

Well, according to LR test, we can say that the predictor variables don’t need transformation!

invResPlot(lm(y~x1+x2+x3+x4+x5+x6+x7))

These results support this recommendation!

b

logy = log(y)
FullModel = lm(logy~x1+x2+x3+x4+x5+x6+x7)

pairs(cbind(logy,x1,x2,x3,x4,x5,x6,x7))

rs = rstandard(FullModel)

shapiro.test(rs)

## 
##  Shapiro-Wilk normality test
## 
## data:  rs
## W = 0.9881, p-value = 0.1005

par(mfrow=c(3,2),mar=c(5,4,0,0))
plot(x1,rs)
plot(x2,rs)
plot(x3,rs)
plot(x4,rs)
plot(x5,rs)
plot(x6,rs)

plot(x7,rs)

par(mfrow=c(2,2),mar=c(5,4,0,0))
plot(FullModel)

Almost everything looks good except the last plot which shows that some leverage points exist!

c

n = length(x1)
p = 7
ts = 2*(p+1)/n
ts

[1] 0.08163265

print('bad leverage points')

[1] “bad leverage points”

d2 = d[c(1,2)]
d2$ABS_SRES = abs(rstandard(FullModel))
d2$ABS_leverage = abs(hatvalues(FullModel))

xtable(d2[((abs(hatvalues(FullModel))>=ts) &
(abs(rstandard(FullModel))>=2)),])

xtable(d2[(abs(hatvalues(FullModel))>=ts),],caption = "'leverage points'")

xtable(d2[(abs(rstandard(FullModel))>=2),],caption = "outliers")

Tom Lehman is the imposter!

d

Outliers probably!

e

Don’t do it! You will cut your hand, kid!

Do it step by step since every time you remove a predictor, everything will change including p-values! Try stepwise regression!

Solutions to regression book by Sheather, chapter 6

Arshak Parsa

2025-01-11

Problem 1

Advertisement

Problem 2

Problem 3

a

b

c

d

e

f

Problem 4

a

b

c

Problem 5

a

b

c

d

e