Well that is fairly easy! (I’m not sure why this exercise is here)
\[ Var(\hat{Y})=Var(HY)=H(Var(Y))H^\intercal=H\sigma^2IH^\intercal=\sigma^2H \]
Note that H is symmetric and idempotent.
Did you know that you can write your proofs in a proof assistant?! Yes, you heard that right, A PROOF ASSISTANT; A TRUE PATH TO SALVATION!
As you may already know, most people don’t care about foundations, but I do! Don’t you want to know your proof is based on what axioms? Whether your proof is logical? Don’t you want the machine to check your proofs? Don’t you want to find more proofs with the aid of AI? If you do, follow along…
I suggest you to use LEAN. It’s an open source proof assistant so it’s totally free! A lot of great theorems have been proven in lean like The Law of Large Numbers, BUT! you know what’s missing? The Cochran’s Theorem, Central Limit Theorem, concepts like sufficiency, completeness, etc.
Join us here on Zulip, we are building the future of FORMAL SCIENCE!
Did I mention the Holy Trinity of Formal Science?
OK, enough advertisement. Let’s head back to regression, shall we?
What the hell is a honeymoon effect?! Anyways, I think this is a time series problem since a lot of the measurements depend on past years and so the first six variables must be highly correlated which breaks the uncorrelated assumption (which is \(Cov(\epsilon_i,\epsilon_j)=0\) I guess!)! Another thing is that some of these variables are related (which leads to multicollinearity) like \(x_8\) and \(x_{11}\) If I understood correctly since the high quality stadiums charge you more money thus they earn more income :) (I don’t know what I’m talking about!)
Anyways, I don’t think multicollinearity is necessarily a problem as long as you have enough data!
What if my answer wasn’t too impressive? Would you still be so impressed? Anyways, I hope these questions be the last ones in his mind!
In my opinion, it’s not a valid model! Let me tell you why:
A transformation MIGHT help!
The threshold for a leverage point is \[hii>\frac{2(p+1)}{n} \] which is
2*(7+1)/234
## [1] 0.06837607
Roughly speaking, I can confirm that the point 223 is a bad leverage point!
I don’t like any of these models! Something is wrong with those bad leverage points!
You might remove …(NOOOOO, don’t peel the orange, the doc said you may cut your hand!)
\[F=\frac{(RSS(reduced) - RSS(full))/(df_R - df_F)}{MSE(full)}=\frac{(RSS(reduced) - RSS(full))/k}{RSS(full)/(n - p - 1)}\]
First, let’s examine the outputs of the book!
library(car)
## Loading required package: carData
d = read.csv("cars04.csv",header=TRUE)
attach(d)
tSuggestedRetailPrice <- log(SuggestedRetailPrice)
tEngineSize <- EngineSize^0.25
tCylinders <- log(Cylinders)
tHorsepower <- log(Horsepower)
tHighwayMPG <- 1/HighwayMPG
tWheelBase <- log(WheelBase)
FullModel <- lm(tSuggestedRetailPrice~tEngineSize+tCylinders
+tHorsepower+tHighwayMPG+Weight+tWheelBase+Hybrid)
summary(FullModel)
##
## Call:
## lm(formula = tSuggestedRetailPrice ~ tEngineSize + tCylinders +
## tHorsepower + tHighwayMPG + Weight + tWheelBase + Hybrid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42288 -0.10983 -0.00203 0.10279 0.70068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.703e+00 2.010e+00 2.838 0.00496 **
## tEngineSize -1.575e+00 3.332e-01 -4.727 4.01e-06 ***
## tCylinders 2.335e-01 1.204e-01 1.940 0.05359 .
## tHorsepower 8.992e-01 8.876e-02 10.130 < 2e-16 ***
## tHighwayMPG 8.029e-01 4.758e+00 0.169 0.86614
## Weight 5.043e-04 6.367e-05 7.920 1.07e-13 ***
## tWheelBase -6.385e-02 4.715e-01 -0.135 0.89240
## Hybrid 6.422e-01 1.150e-01 5.582 6.78e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1789 on 226 degrees of freedom
## Multiple R-squared: 0.8621, Adjusted R-squared: 0.8578
## F-statistic: 201.8 on 7 and 226 DF, p-value: < 2.2e-16
ReducedModel <- lm(tSuggestedRetailPrice~tEngineSize+
tCylinders+tHorsepower+Weight+Hybrid)
summary(ReducedModel)
##
## Call:
## lm(formula = tSuggestedRetailPrice ~ tEngineSize + tCylinders +
## tHorsepower + Weight + Hybrid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42224 -0.11001 -0.00099 0.10191 0.70205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.422e+00 3.291e-01 16.474 < 2e-16 ***
## tEngineSize -1.591e+00 3.157e-01 -5.041 9.45e-07 ***
## tCylinders 2.375e-01 1.186e-01 2.003 0.0463 *
## tHorsepower 9.049e-01 8.305e-02 10.896 < 2e-16 ***
## Weight 5.029e-04 5.203e-05 9.666 < 2e-16 ***
## Hybrid 6.340e-01 1.080e-01 5.870 1.53e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1781 on 228 degrees of freedom
## Multiple R-squared: 0.862, Adjusted R-squared: 0.859
## F-statistic: 284.9 on 5 and 228 DF, p-value: < 2.2e-16
#I also tried this method but I got the same output as above
#d = cbind(basicPower(d[-8],c(0,0.25,0,0,-1,1,0)),d$Hybrid)
#lm(d$`log(SuggestedRetailPrice)`~.,d)
There’s a problem though, the output of full transformed regression model doesn’t match up with the output in the book, so I’m going to use the outputs produced here.
# Manual partial F-test
SSE_F = sum(FullModel$residuals^2)
SSE_R = sum(ReducedModel$residuals^2)
df_F = FullModel$df.residual
df_R = ReducedModel$df.residual
F_stat = ((SSE_R - SSE_F)/(df_R-df_F))/(SSE_F/df_F)
F_stat
## [1] 0.03400598
print("p-value = ")
## [1] "p-value = "
pf(F_stat,df_R-df_F,df_F,lower.tail = FALSE)
## [1] 0.9665707
# Easy partial F-test
anova(ReducedModel,FullModel)
Given the p-value equals 0.9666, there is little, if any, evidence to support the alternative hypothesis.
This means that we are happy to adopt the reduced model.
You can use regex!
brand = as.character(regmatches(d$Vehicle.Name,gregexpr("^[^ ]*",d$Vehicle.Name)))
NewModel <- lm(tSuggestedRetailPrice~tEngineSize+
tCylinders+tHorsepower+Weight+Hybrid+brand)
summary(NewModel)
##
## Call:
## lm(formula = tSuggestedRetailPrice ~ tEngineSize + tCylinders +
## tHorsepower + Weight + Hybrid + brand)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38766 -0.05957 -0.00215 0.06234 0.37516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.505e+00 2.777e-01 23.429 < 2e-16 ***
## tEngineSize 6.933e-01 2.871e-01 2.415 0.016654 *
## tCylinders -2.561e-02 9.576e-02 -0.267 0.789427
## tHorsepower 3.457e-01 7.310e-02 4.730 4.29e-06 ***
## Weight 3.424e-04 4.222e-05 8.109 5.43e-14 ***
## Hybrid 4.253e-01 7.433e-02 5.723 3.88e-08 ***
## brandAudi 5.041e-02 6.119e-02 0.824 0.411016
## brandBMW 1.544e-01 6.195e-02 2.493 0.013492 *
## brandBuick -2.086e-01 6.943e-02 -3.004 0.003012 **
## brandCadillac -7.469e-02 7.806e-02 -0.957 0.339834
## brandChevrolet -3.182e-01 6.372e-02 -4.995 1.30e-06 ***
## brandChrvsler -1.350e-01 1.246e-01 -1.083 0.279991
## brandChrysler -2.132e-01 6.261e-02 -3.405 0.000803 ***
## brandDodge -3.145e-01 6.996e-02 -4.495 1.19e-05 ***
## brandFord -3.489e-01 6.592e-02 -5.292 3.22e-07 ***
## brandHonda -1.636e-01 6.498e-02 -2.517 0.012640 *
## brandHyundai -3.123e-01 6.669e-02 -4.683 5.27e-06 ***
## brandInfiniti -1.416e-01 7.016e-02 -2.018 0.044906 *
## brandJaguar 1.794e-01 6.663e-02 2.693 0.007706 **
## brandKia -3.946e-01 7.091e-02 -5.566 8.51e-08 ***
## brandLexus 5.779e-02 7.024e-02 0.823 0.411629
## brandLincoln -8.570e-02 6.959e-02 -1.231 0.219645
## brandMazda6 -2.536e-01 1.243e-01 -2.041 0.042558 *
## brandMercedes-Benz 2.227e-01 6.007e-02 3.707 0.000273 ***
## brandMercury -3.977e-01 7.387e-02 -5.384 2.07e-07 ***
## brandMini -1.885e-02 9.729e-02 -0.194 0.846529
## brandMitsubishi -2.907e-01 9.607e-02 -3.026 0.002810 **
## brandNissan -2.930e-01 6.781e-02 -4.321 2.46e-05 ***
## brandOldsmobile -1.746e-01 9.785e-02 -1.785 0.075886 .
## brandPontiac -3.068e-01 7.559e-02 -4.059 7.11e-05 ***
## brandSaab 1.809e-01 7.187e-02 2.517 0.012622 *
## brandSaturn -3.678e-01 7.185e-02 -5.119 7.31e-07 ***
## brandScion -1.872e-01 1.284e-01 -1.458 0.146538
## brandSubaru -1.701e-01 6.929e-02 -2.455 0.014965 *
## brandSuzuki -4.184e-01 7.475e-02 -5.597 7.28e-08 ***
## brandToyota -2.488e-01 6.148e-02 -4.046 7.49e-05 ***
## brandVolkswagen -6.937e-02 6.525e-02 -1.063 0.289033
## brandVolvo 9.513e-02 6.471e-02 1.470 0.143127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1127 on 196 degrees of freedom
## Multiple R-squared: 0.9526, Adjusted R-squared: 0.9436
## F-statistic: 106.4 on 37 and 196 DF, p-value: < 2.2e-16
Increased the \(R^2_{adj}\) by 8 percentage points! Definitely worth it!
In my opinion, it’s a valid model! I see almost no pattern in residuals and the variance seems to be constant. No bad leverage points. High \(R^2_{adj}\).
There’s one thing to note here: It looks like that the data is grouped in someway since there are multiple different linear patterns in one scatter plot (See scatter plot between VTINV and HEAT in Figure 6.58 for instance). More investigation may help.
You may add quadratic terms, but I suggest you to try one of the approaches to see if there’s anything to learn from the data. I say it’s already good enough!
I see one problem here, if you watch Figure 6.58 carefully, you notice that there’s linear pattern between predictor variables which indicates multicollinearity! So why don’t you use the VIF score? Or any criteria which can detect multicollinearity?
Let’s solve this one quickly!
d = read.csv("pgatour2006.csv")
y = d$PrizeMoney
x1 = d$DrivingAccuracy
x2 = d$GIR
x3 = d$PuttingAverage
x4 = d$BirdieConversion
x5 = d$SandSaves
x6 = d$Scrambling
x7 = d$PuttsPerRound
Approach 1
ap1 = summary(powerTransform(cbind(x1,x2,x3,x4,x5,x6,x7)~1,data=d))
xtable(ap1$result)
xtable(ap1$tests)
Well, according to LR test, we can say that the predictor variables don’t need transformation!
invResPlot(lm(y~x1+x2+x3+x4+x5+x6+x7))
These results support this recommendation!
logy = log(y)
FullModel = lm(logy~x1+x2+x3+x4+x5+x6+x7)
pairs(cbind(logy,x1,x2,x3,x4,x5,x6,x7))
rs = rstandard(FullModel)
shapiro.test(rs)
##
## Shapiro-Wilk normality test
##
## data: rs
## W = 0.9881, p-value = 0.1005
par(mfrow=c(3,2),mar=c(5,4,0,0))
plot(x1,rs)
plot(x2,rs)
plot(x3,rs)
plot(x4,rs)
plot(x5,rs)
plot(x6,rs)
plot(x7,rs)
par(mfrow=c(2,2),mar=c(5,4,0,0))
plot(FullModel)
Almost everything looks good except the last plot which shows that some leverage points exist!
n = length(x1)
p = 7
ts = 2*(p+1)/n
ts
[1] 0.08163265
print('bad leverage points')
[1] “bad leverage points”
d2 = d[c(1,2)]
d2$ABS_SRES = abs(rstandard(FullModel))
d2$ABS_leverage = abs(hatvalues(FullModel))
xtable(d2[((abs(hatvalues(FullModel))>=ts) &
(abs(rstandard(FullModel))>=2)),])
xtable(d2[(abs(hatvalues(FullModel))>=ts),],caption = "'leverage points'")
xtable(d2[(abs(rstandard(FullModel))>=2),],caption = "outliers")
Tom Lehman is the imposter!
Outliers probably!
Don’t do it! You will cut your hand, kid!
Do it step by step since every time you remove a predictor, everything will change including p-values! Try stepwise regression!