Problem 1

Use scan() and dput() to read data!

xp = c(0, 2, 4, 6, 8, 12, 17, 22, 28, 34)
size = c(17, 33, 19, 25, 18, 60, 58, 31, 34, 19)
salary = c(101300, 111303, 98000, 124000, 128475, 117410, 115825, 134300,128066, 164700)

Let’s have a vision

plot(xp,salary)

Try a simple model

m = lm(salary~xp)
summary(m)
## 
## Call:
## lm(formula = salary ~ xp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14150  -9430  -1428   9712  14370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104352.9     5619.4  18.570 7.29e-08 ***
## xp            1352.3      325.7   4.152   0.0032 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11320 on 8 degrees of freedom
## Multiple R-squared:  0.683,  Adjusted R-squared:  0.6434 
## F-statistic: 17.24 on 1 and 8 DF,  p-value: 0.0032
par(mfrow=c(2,2))
plot(m)

print("OLS Estimation :")

[1] “OLS Estimation :”

predict(m,data.frame(xp=c(6)))
   1 

112466.4

print("Error :")

[1] “Error :”

salary[4]-predict(m,data.frame(xp=c(6)))
   1 

11533.56

Now, let’s make a use of sample sizes

mw = lm(salary~xp,weights = size)
summary(mw)
## 
## Call:
## lm(formula = salary ~ xp, weights = size)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -67520 -40994   4937  51648  87516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104759.0     5752.2   18.21 8.49e-08 ***
## xp            1172.5      336.9    3.48  0.00832 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57620 on 8 degrees of freedom
## Multiple R-squared:  0.6022, Adjusted R-squared:  0.5524 
## F-statistic: 12.11 on 1 and 8 DF,  p-value: 0.008323
par(mfrow=c(2,2))
plot(mw)

The outputs don’t look good, I prefer not to use WLS in this situation!

print("WLS Estimation :")
## [1] "WLS Estimation :"
predict(mw,data.frame(xp=c(6)))
##        1 
## 111793.8
print("Error :")
## [1] "Error :"
salary[4]-predict(mw,data.frame(xp=c(6)))
##        1 
## 12206.21

Let’s do it with bare hands!

xwm = sum(size*xp)/sum(size)
ywm = sum(size*salary)/sum(size)
sxw = sum(size*(xp-xwm)^2)
sxyw = sum(size*(xp-xwm)*(salary-ywm))
b1 = sxyw/sxw
c(ywm - xwm*b1,b1)
## [1] 104759.048   1172.456

Fun fact: If you use the inverse of the sizes as your weights, you get a better model!

mw = lm(salary~xp,weights = 1/size)
summary(mw)
## 
## Call:
## lm(formula = salary ~ xp, weights = 1/size)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -3148.9 -1528.5  -576.5  1766.0  2924.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 103925.6     5232.1  19.863  4.3e-08 ***
## xp            1517.9      305.8   4.964   0.0011 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2243 on 8 degrees of freedom
## Multiple R-squared:  0.7549, Adjusted R-squared:  0.7242 
## F-statistic: 24.64 on 1 and 8 DF,  p-value: 0.001102
par(mfrow=c(2,2))
plot(mw)

print("Weird WLS Estimation :")
## [1] "Weird WLS Estimation :"
predict(mw,data.frame(xp=c(6)))
##      1 
## 113033
print("Error :")
## [1] "Error :"
salary[4]-predict(mw,data.frame(xp=c(6)))
##     1 
## 10967

I suspect that the problem is related to \(Y_i\) since it’s neither average nor median; it’s a third quantile, so \(n_i=w_i\) is not appropriate! Anyways, I don’t like this WLS method at all!

Problem 2

Easy! Let’s assume \(Var(e_i)=\sigma^2/w_i\)

We multiply the model equation by \(\sqrt{w_i}\)

\[\sqrt{w_i}y_i = \sqrt{w_i}\hat{\beta}x_i + \sqrt{w_i}e_i \quad where \quad \sqrt{w_i}e_i \sim N(0,\sigma^2) \]

\[SSE = \sum_{i=1}^n \hat{e_i}^2 = \sum_{i=1}^n (\sqrt{w_i}y_i - \hat{y_i})^2 = \sum_{i=1}^n (\sqrt{w_i}y_i - \sqrt{w_i}\hat{\beta}x_i)^2\]

\[\frac{dSSE}{d\hat{\beta}} = \sum_{i=1}^n -2x_iw_i(y_i - \hat{\beta}x_i) = 0 \]

\[\rightarrow \sum_{i=1}^n x_iw_i(y_i - \hat{\beta}x_i) = 0 \rightarrow \hat{\beta} = \frac{\sum_{i=1}^n x_iy_iw_i}{\sum_{i=1}^n x_i^2w_i}\]

Now, since \(Var(e_i|x_i)=x_i^2\sigma^2=\sigma^2/w_i\) implies \(w_i=x_i^{-2}\), we get \(\hat{\beta} = \frac{\sum_{i=1}^n x_i^{-1}y_i}{n}\)

Problem 3

a

According to the last subsection of this chapter (4.1.5), when \(Y_i\) is the average or the median of \(n_i\) observations, we shall use \(n_i\) as model weights. Since \(Y_i\) is “2006 median price per square foot”, it’s reasonable to choose \(n_i\) as model weights.

b

Well, just like the first problem, it looks like WLS can’t do magic! (Sorry Tibshirani!) The plot of standardized residuals indicates non-constant variance and tons of outliers!

c

Try out the approach 1 or 2! It’s the only magic left!

One thing to note: the distribution of x1i and x2i are highly skewed. Sounds familiar? A transformation may definitely help!