Use scan() and dput() to read data!
xp = c(0, 2, 4, 6, 8, 12, 17, 22, 28, 34)
size = c(17, 33, 19, 25, 18, 60, 58, 31, 34, 19)
salary = c(101300, 111303, 98000, 124000, 128475, 117410, 115825, 134300,128066, 164700)
Let’s have a vision
plot(xp,salary)
Try a simple model
m = lm(salary~xp)
summary(m)
##
## Call:
## lm(formula = salary ~ xp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14150 -9430 -1428 9712 14370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104352.9 5619.4 18.570 7.29e-08 ***
## xp 1352.3 325.7 4.152 0.0032 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11320 on 8 degrees of freedom
## Multiple R-squared: 0.683, Adjusted R-squared: 0.6434
## F-statistic: 17.24 on 1 and 8 DF, p-value: 0.0032
par(mfrow=c(2,2))
plot(m)
print("OLS Estimation :")
[1] “OLS Estimation :”
predict(m,data.frame(xp=c(6)))
1
112466.4
print("Error :")
[1] “Error :”
salary[4]-predict(m,data.frame(xp=c(6)))
1
11533.56
Now, let’s make a use of sample sizes
mw = lm(salary~xp,weights = size)
summary(mw)
##
## Call:
## lm(formula = salary ~ xp, weights = size)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -67520 -40994 4937 51648 87516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104759.0 5752.2 18.21 8.49e-08 ***
## xp 1172.5 336.9 3.48 0.00832 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57620 on 8 degrees of freedom
## Multiple R-squared: 0.6022, Adjusted R-squared: 0.5524
## F-statistic: 12.11 on 1 and 8 DF, p-value: 0.008323
par(mfrow=c(2,2))
plot(mw)
The outputs don’t look good, I prefer not to use WLS in this situation!
print("WLS Estimation :")
## [1] "WLS Estimation :"
predict(mw,data.frame(xp=c(6)))
## 1
## 111793.8
print("Error :")
## [1] "Error :"
salary[4]-predict(mw,data.frame(xp=c(6)))
## 1
## 12206.21
Let’s do it with bare hands!
xwm = sum(size*xp)/sum(size)
ywm = sum(size*salary)/sum(size)
sxw = sum(size*(xp-xwm)^2)
sxyw = sum(size*(xp-xwm)*(salary-ywm))
b1 = sxyw/sxw
c(ywm - xwm*b1,b1)
## [1] 104759.048 1172.456
Fun fact: If you use the inverse of the sizes as your weights, you get a better model!
mw = lm(salary~xp,weights = 1/size)
summary(mw)
##
## Call:
## lm(formula = salary ~ xp, weights = 1/size)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3148.9 -1528.5 -576.5 1766.0 2924.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103925.6 5232.1 19.863 4.3e-08 ***
## xp 1517.9 305.8 4.964 0.0011 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2243 on 8 degrees of freedom
## Multiple R-squared: 0.7549, Adjusted R-squared: 0.7242
## F-statistic: 24.64 on 1 and 8 DF, p-value: 0.001102
par(mfrow=c(2,2))
plot(mw)
print("Weird WLS Estimation :")
## [1] "Weird WLS Estimation :"
predict(mw,data.frame(xp=c(6)))
## 1
## 113033
print("Error :")
## [1] "Error :"
salary[4]-predict(mw,data.frame(xp=c(6)))
## 1
## 10967
I suspect that the problem is related to \(Y_i\) since it’s neither average nor median; it’s a third quantile, so \(n_i=w_i\) is not appropriate! Anyways, I don’t like this WLS method at all!
Easy! Let’s assume \(Var(e_i)=\sigma^2/w_i\)
We multiply the model equation by \(\sqrt{w_i}\)
\[\sqrt{w_i}y_i = \sqrt{w_i}\hat{\beta}x_i + \sqrt{w_i}e_i \quad where \quad \sqrt{w_i}e_i \sim N(0,\sigma^2) \]
\[SSE = \sum_{i=1}^n \hat{e_i}^2 = \sum_{i=1}^n (\sqrt{w_i}y_i - \hat{y_i})^2 = \sum_{i=1}^n (\sqrt{w_i}y_i - \sqrt{w_i}\hat{\beta}x_i)^2\]
\[\frac{dSSE}{d\hat{\beta}} = \sum_{i=1}^n -2x_iw_i(y_i - \hat{\beta}x_i) = 0 \]
\[\rightarrow \sum_{i=1}^n x_iw_i(y_i - \hat{\beta}x_i) = 0 \rightarrow \hat{\beta} = \frac{\sum_{i=1}^n x_iy_iw_i}{\sum_{i=1}^n x_i^2w_i}\]
Now, since \(Var(e_i|x_i)=x_i^2\sigma^2=\sigma^2/w_i\) implies \(w_i=x_i^{-2}\), we get \(\hat{\beta} = \frac{\sum_{i=1}^n x_i^{-1}y_i}{n}\)
According to the last subsection of this chapter (4.1.5), when \(Y_i\) is the average or the median of \(n_i\) observations, we shall use \(n_i\) as model weights. Since \(Y_i\) is “2006 median price per square foot”, it’s reasonable to choose \(n_i\) as model weights.
Well, just like the first problem, it looks like WLS can’t do magic! (Sorry Tibshirani!) The plot of standardized residuals indicates non-constant variance and tons of outliers!
Try out the approach 1 or 2! It’s the only magic left!
One thing to note: the distribution of x1i and
x2i are highly skewed. Sounds familiar? A
transformation may definitely help!