Blog Post Seven

2021-05-06

source(knitr::purl(
  here::here("_Data_ Tab", "load_and_clean_data.Rmd"), quiet=TRUE),
  echo = FALSE # Use echo=FALSE or omit it to avoid code output  
)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.5     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## corrplot 0.84 loaded

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(years)` instead of `years` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

This week, we shifted our focus to a microeconomic analysis. We analysed variables contained in our second dataset: “Data_Extract_From_Health_Nutrition_and_Population_Statistics”.

This dataset contains information on microeconomic variables such as mortality rates among others. The following variables were chosen for the preliminary analysis:

#filtering out the macro covariates
micov<-c("Number of infant deaths",
              "Rural population (% of total population)",
              "School enrollment, primary (% gross)",
              "Unemployment, male (% of male labor force)",
         "GDP per capita (current US$)",
         "year",
         "Country Name")
micro_econi<-select(econi_wide,micov)

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(micov)` instead of `micov` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

Here we decided to use different response variables. This was because using GDP per capita as the response variable would not lead to high correlation levels. In a microeconomic setting, the multiplier effects of variables historically do not show effects on macroeconomic variables such as GDP.

We first start with our analysis for HIPC countries. To find adequate variables, we implemented a model selection strategy:

#backward selection
micro_econi_hipc <- micro_econi %>% filter(`Country Name`=="Heavily indebted poor countries (HIPC)")

micro_econi_back<-micro_econi_hipc %>% 
  select(-c(year,`Country Name`))
models <- regsubsets(`Unemployment, male (% of male labor force)`~., 
                     data = micro_econi_back, 
                     nvmax = 3,
                     method="backward")
summary(models)$which

##   (Intercept) `Number of infant deaths`
## 1        TRUE                      TRUE
## 2        TRUE                      TRUE
## 3        TRUE                      TRUE
##   `Rural population (% of total population)`
## 1                                      FALSE
## 2                                      FALSE
## 3                                       TRUE
##   `School enrollment, primary (% gross)` `GDP per capita (current US$)`
## 1                                  FALSE                          FALSE
## 2                                  FALSE                           TRUE
## 3                                  FALSE                           TRUE

From the output above, we can conclude that an adequate model has the Number of Infant Deaths as the sole covariate. We confirm this by running the two models proposed and assessing diagnostics:

micro_mod_1a<-lm(`Unemployment, male (% of male labor force)`~
                `Rural population (% of total population)` +
                 `Number of infant deaths`,
              data=micro_econi_hipc)

#final model:
micro_mod_1b<-lm(`Unemployment, male (% of male labor force)`~
                 `Number of infant deaths`,
              data=micro_econi_hipc)


#model diagnostics:
modelSummary1a <- summary(micro_mod_1a)
modelSummary1b <- summary(micro_mod_1b)
modelSummary1a

## 
## Call:
## lm(formula = `Unemployment, male (% of male labor force)` ~ `Rural population (% of total population)` + 
##     `Number of infant deaths`, data = micro_econi_hipc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19778 -0.06376  0.01488  0.07499  0.14476 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                 2.470e+00  8.951e-01   2.759
## `Rural population (% of total population)` -3.040e-02  2.196e-02  -1.385
## `Number of infant deaths`                   2.804e-06  4.187e-07   6.696
##                                            Pr(>|t|)    
## (Intercept)                                  0.0105 *  
## `Rural population (% of total population)`   0.1779    
## `Number of infant deaths`                  4.19e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09477 on 26 degrees of freedom
## Multiple R-squared:  0.9443, Adjusted R-squared:   0.94 
## F-statistic: 220.4 on 2 and 26 DF,  p-value: < 2.2e-16

modelSummary1b

## 
## Call:
## lm(formula = `Unemployment, male (% of male labor force)` ~ `Number of infant deaths`, 
##     data = micro_econi_hipc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.200910 -0.066810  0.007432  0.076564  0.163667 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.253e+00  1.734e-01   7.224 9.05e-08 ***
## `Number of infant deaths` 2.243e-06  1.089e-07  20.602  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09637 on 27 degrees of freedom
## Multiple R-squared:  0.9402, Adjusted R-squared:  0.938 
## F-statistic: 424.5 on 1 and 27 DF,  p-value: < 2.2e-16

From this output, we concluded that the model with 1 covariate (Number of Infant Deaths) along with an intercept is an appropriate model for this data. In the first model, we can see that only 1 of the covariates is significant at the 99% significance level and that the R-Squared value is approximately 0.94. Contrastingly, in the second model, we can see that both the intercept and the covariate is singificant. The R-Squared value is approxiamately identical.

There are many reasons why the Rural Population covariate, despite having a significant correlation with Unemployment, fails to be a good predictor. One of main reasons might be multicollinearity which is the phenomenon when one predictor variable can be accurately predicted from others. The covariate of Number of Infant Deaths could possibly be predicted from other covariates, making any other variables non-significant. This also explains why our final model has a very low p-value, indicating the significance of the model and no presence of omitted variable bias.

Now conducting a similar regression analysis for OECD members:

#backward selection
micro_econi_oecd <- micro_econi %>% filter(`Country Name`=="OECD members")

micro_econi_back2<-micro_econi_oecd %>% 
  select(-c(year,`Country Name`))

models2 <- regsubsets(`Unemployment, male (% of male labor force)`~., 
                      data = micro_econi_back2, 
                      nvmax = 3,
                      method="backward")
summary(models2)$which

##   (Intercept) `Number of infant deaths`
## 1        TRUE                     FALSE
## 2        TRUE                     FALSE
## 3        TRUE                     FALSE
##   `Rural population (% of total population)`
## 1                                      FALSE
## 2                                      FALSE
## 3                                       TRUE
##   `School enrollment, primary (% gross)` `GDP per capita (current US$)`
## 1                                   TRUE                          FALSE
## 2                                   TRUE                           TRUE
## 3                                   TRUE                           TRUE

micro_mod_2<-lm(`Unemployment, male (% of male labor force)`~
                  `School enrollment, primary (% gross)`,
              data=micro_econi_oecd)
modelSummary <- summary(micro_mod_2)
modelSummary

## 
## Call:
## lm(formula = `Unemployment, male (% of male labor force)` ~ `School enrollment, primary (% gross)`, 
##     data = micro_econi_oecd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53916 -0.48295  0.04195  0.40863  1.70238 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -25.4803    23.6736  -1.076    0.291
## `School enrollment, primary (% gross)`   0.3113     0.2287   1.361    0.185
## 
## Residual standard error: 0.8518 on 27 degrees of freedom
## Multiple R-squared:  0.0642, Adjusted R-squared:  0.02954 
## F-statistic: 1.852 on 1 and 27 DF,  p-value: 0.1848

After conducting backward selection, we found that a model with School Enrollment as the sole predictor was recommended. However, this model has a high element of bias present. Firstly, none of the covariates (including the intercept) are significant. This suggests a variable bias is present. Additionally, the p-value for the model is 0.1848, higher than any acceptable significance levels. The R-Squared supports this pattern, with an unusually low value appearing.

Therefore, we decided to pursue our modified model for OECD members.

models3 <- regsubsets(`Number of infant deaths`~., 
                      data = micro_econi_back2, 
                      nvmax = 3,
                      method="backward")
summary(models3)$which

##   (Intercept) `Rural population (% of total population)`
## 1        TRUE                                       TRUE
## 2        TRUE                                       TRUE
## 3        TRUE                                       TRUE
##   `School enrollment, primary (% gross)`
## 1                                  FALSE
## 2                                  FALSE
## 3                                   TRUE
##   `Unemployment, male (% of male labor force)` `GDP per capita (current US$)`
## 1                                        FALSE                          FALSE
## 2                                        FALSE                           TRUE
## 3                                        FALSE                           TRUE

We performed backward selection once again to find that the intercept and Rural Population were chosen as the covariates for a model with Infant Deaths as the response variable.

#final model:
micro_mod_2b<-lm(`Number of infant deaths`~
                 `Rural population (% of total population)`,
              data=micro_econi_oecd)

#model diagnostics:
modelSummary2b <- summary(micro_mod_2b)
modelSummary2b

## 
## Call:
## lm(formula = `Number of infant deaths` ~ `Rural population (% of total population)`, 
##     data = micro_econi_oecd)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18353  -8662   1456   6233  29519 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                -378039.3    21821.2  -17.32
## `Rural population (% of total population)`   23864.5      956.2   24.96
##                                            Pr(>|t|)    
## (Intercept)                                3.75e-16 ***
## `Rural population (% of total population)`  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12060 on 27 degrees of freedom
## Multiple R-squared:  0.9585, Adjusted R-squared:  0.9569 
## F-statistic: 622.9 on 1 and 27 DF,  p-value: < 2.2e-16

As expected, the model has significant covariates, along with a significant p-value.

Next Blog post six