source(knitr::purl(
here::here("_Data_ Tab", "load_and_clean_data.Rmd"), quiet=TRUE),
echo = FALSE # Use echo=FALSE or omit it to avoid code output
)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## corrplot 0.84 loaded
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(years)` instead of `years` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
This week, we shifted our focus to a microeconomic analysis. We analysed variables contained in our second dataset: “Data_Extract_From_Health_Nutrition_and_Population_Statistics”.
This dataset contains information on microeconomic variables such as mortality rates among others. The following variables were chosen for the preliminary analysis:
#filtering out the macro covariates
micov<-c("Number of infant deaths",
"Rural population (% of total population)",
"School enrollment, primary (% gross)",
"Unemployment, male (% of male labor force)",
"GDP per capita (current US$)",
"year",
"Country Name")
micro_econi<-select(econi_wide,micov)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(micov)` instead of `micov` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Here we decided to use different response variables. This was because using GDP per capita as the response variable would not lead to high correlation levels. In a microeconomic setting, the multiplier effects of variables historically do not show effects on macroeconomic variables such as GDP.
We first start with our analysis for HIPC countries. To find adequate variables, we implemented a model selection strategy:
#backward selection
micro_econi_hipc <- micro_econi %>% filter(`Country Name`=="Heavily indebted poor countries (HIPC)")
micro_econi_back<-micro_econi_hipc %>%
select(-c(year,`Country Name`))
models <- regsubsets(`Unemployment, male (% of male labor force)`~.,
data = micro_econi_back,
nvmax = 3,
method="backward")
summary(models)$which
## (Intercept) `Number of infant deaths`
## 1 TRUE TRUE
## 2 TRUE TRUE
## 3 TRUE TRUE
## `Rural population (% of total population)`
## 1 FALSE
## 2 FALSE
## 3 TRUE
## `School enrollment, primary (% gross)` `GDP per capita (current US$)`
## 1 FALSE FALSE
## 2 FALSE TRUE
## 3 FALSE TRUE
From the output above, we can conclude that an adequate model has the Number of Infant Deaths as the sole covariate. We confirm this by running the two models proposed and assessing diagnostics:
micro_mod_1a<-lm(`Unemployment, male (% of male labor force)`~
`Rural population (% of total population)` +
`Number of infant deaths`,
data=micro_econi_hipc)
#final model:
micro_mod_1b<-lm(`Unemployment, male (% of male labor force)`~
`Number of infant deaths`,
data=micro_econi_hipc)
#model diagnostics:
modelSummary1a <- summary(micro_mod_1a)
modelSummary1b <- summary(micro_mod_1b)
modelSummary1a
##
## Call:
## lm(formula = `Unemployment, male (% of male labor force)` ~ `Rural population (% of total population)` +
## `Number of infant deaths`, data = micro_econi_hipc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19778 -0.06376 0.01488 0.07499 0.14476
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.470e+00 8.951e-01 2.759
## `Rural population (% of total population)` -3.040e-02 2.196e-02 -1.385
## `Number of infant deaths` 2.804e-06 4.187e-07 6.696
## Pr(>|t|)
## (Intercept) 0.0105 *
## `Rural population (% of total population)` 0.1779
## `Number of infant deaths` 4.19e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09477 on 26 degrees of freedom
## Multiple R-squared: 0.9443, Adjusted R-squared: 0.94
## F-statistic: 220.4 on 2 and 26 DF, p-value: < 2.2e-16
modelSummary1b
##
## Call:
## lm(formula = `Unemployment, male (% of male labor force)` ~ `Number of infant deaths`,
## data = micro_econi_hipc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.200910 -0.066810 0.007432 0.076564 0.163667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.253e+00 1.734e-01 7.224 9.05e-08 ***
## `Number of infant deaths` 2.243e-06 1.089e-07 20.602 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09637 on 27 degrees of freedom
## Multiple R-squared: 0.9402, Adjusted R-squared: 0.938
## F-statistic: 424.5 on 1 and 27 DF, p-value: < 2.2e-16
From this output, we concluded that the model with 1 covariate (Number of Infant Deaths) along with an intercept is an appropriate model for this data. In the first model, we can see that only 1 of the covariates is significant at the 99% significance level and that the R-Squared value is approximately 0.94. Contrastingly, in the second model, we can see that both the intercept and the covariate is singificant. The R-Squared value is approxiamately identical.
There are many reasons why the Rural Population covariate, despite having a
significant correlation with Unemployment, fails to be a good predictor. One of
main reasons might be multicollinearity which is the phenomenon when one
predictor variable can be accurately predicted from others. The covariate of
Number of Infant Deaths could possibly be predicted from other covariates,
making any other variables non-significant. This also explains why our final
model has a very low p-value, indicating the significance of the model and
no presence of omitted variable bias.
Now conducting a similar regression analysis for OECD members:
#backward selection
micro_econi_oecd <- micro_econi %>% filter(`Country Name`=="OECD members")
micro_econi_back2<-micro_econi_oecd %>%
select(-c(year,`Country Name`))
models2 <- regsubsets(`Unemployment, male (% of male labor force)`~.,
data = micro_econi_back2,
nvmax = 3,
method="backward")
summary(models2)$which
## (Intercept) `Number of infant deaths`
## 1 TRUE FALSE
## 2 TRUE FALSE
## 3 TRUE FALSE
## `Rural population (% of total population)`
## 1 FALSE
## 2 FALSE
## 3 TRUE
## `School enrollment, primary (% gross)` `GDP per capita (current US$)`
## 1 TRUE FALSE
## 2 TRUE TRUE
## 3 TRUE TRUE
micro_mod_2<-lm(`Unemployment, male (% of male labor force)`~
`School enrollment, primary (% gross)`,
data=micro_econi_oecd)
modelSummary <- summary(micro_mod_2)
modelSummary
##
## Call:
## lm(formula = `Unemployment, male (% of male labor force)` ~ `School enrollment, primary (% gross)`,
## data = micro_econi_oecd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.53916 -0.48295 0.04195 0.40863 1.70238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.4803 23.6736 -1.076 0.291
## `School enrollment, primary (% gross)` 0.3113 0.2287 1.361 0.185
##
## Residual standard error: 0.8518 on 27 degrees of freedom
## Multiple R-squared: 0.0642, Adjusted R-squared: 0.02954
## F-statistic: 1.852 on 1 and 27 DF, p-value: 0.1848
After conducting backward selection, we found that a model with School Enrollment as the sole predictor was recommended. However, this model has a high element of bias present. Firstly, none of the covariates (including the intercept) are significant. This suggests a variable bias is present. Additionally, the p-value for the model is 0.1848, higher than any acceptable significance levels. The R-Squared supports this pattern, with an unusually low value appearing.
Therefore, we decided to pursue our modified model for OECD members.
models3 <- regsubsets(`Number of infant deaths`~.,
data = micro_econi_back2,
nvmax = 3,
method="backward")
summary(models3)$which
## (Intercept) `Rural population (% of total population)`
## 1 TRUE TRUE
## 2 TRUE TRUE
## 3 TRUE TRUE
## `School enrollment, primary (% gross)`
## 1 FALSE
## 2 FALSE
## 3 TRUE
## `Unemployment, male (% of male labor force)` `GDP per capita (current US$)`
## 1 FALSE FALSE
## 2 FALSE TRUE
## 3 FALSE TRUE
We performed backward selection once again to find that the intercept and Rural Population were chosen as the covariates for a model with Infant Deaths as the response variable.
#final model:
micro_mod_2b<-lm(`Number of infant deaths`~
`Rural population (% of total population)`,
data=micro_econi_oecd)
#model diagnostics:
modelSummary2b <- summary(micro_mod_2b)
modelSummary2b
##
## Call:
## lm(formula = `Number of infant deaths` ~ `Rural population (% of total population)`,
## data = micro_econi_oecd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18353 -8662 1456 6233 29519
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -378039.3 21821.2 -17.32
## `Rural population (% of total population)` 23864.5 956.2 24.96
## Pr(>|t|)
## (Intercept) 3.75e-16 ***
## `Rural population (% of total population)` < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12060 on 27 degrees of freedom
## Multiple R-squared: 0.9585, Adjusted R-squared: 0.9569
## F-statistic: 622.9 on 1 and 27 DF, p-value: < 2.2e-16
As expected, the model has significant covariates, along with a significant p-value.