Data

Featured Image

Data Source – How can you find it?

The five of us came together with one common curiosity: what makes Developing Countries ‘tick’? What are the main drivers of growth in many low-income and underprivileged regions? In economics, Gross Domestic Product (GDP), a metric for the value of all final goods and services produced within a country’s borders in a given period of time, is widely used as an important measure of growth and development. But we wanted to investigate, what is it that makes GDP itself fluctuate?

Given that most of our lines of enquiries were focused on econometrics, we decided to search for a dataset on the World Bank website, a highly reputed database which collects an exhaustive and detailed list of development metrics for a range of countries. Specifically, we decided to utilize their widely used “World Development Index (WDI)” dataset. This is the World Bank’s primary collection of developmental measures, updated every quarter with data from 1960 up to 2020. The link to this dataset can be found here: https://datacatalog.worldbank.org/dataset/world-development-indicators

Given the depth and breadth of this dataset, we decided to use this as ou rprimary source. The URL to find this dataset can he found here. We imported the data as an excel file. This is a large dataset and therefore, as mentioned in our load_and_clean_data.Rmd file, we have tried to condense the information gained through this dataset into the ones relevant for our analysis.

What Data Did We Use?

Our questions didn’t just involve metrics on developing countries; we also wanted to compare the impact of different indicators on development in low-income and high-income countries. Our original plan was to create a basket of countries which would represent developed and developing regions in the world. However, we realised that creating such a basket would invite bias into our analysis, since geographic and economic conditions in neighbouring countries have proven to be correlated in the past. Additionally, we decided to use the dataset picked to our advantage.Contained within the dataset were two different sets of countries:

  1. Heavily Indebted Poor Countries (HIPC): a collection of 39 developing countries with histories of high indebtedness and poverty

  2. Organisation for Economic Co-Operation and Development (OECD): an organisation with 37 members promoting social policies with the goal of increasing welfare

We decided to use these two sets of countries as our areas of focus.

After much discussion on economic theory and past studies, we decided to use the following covariates in our analysis:

  1. GDP per capita - Our main response variable and a measure of output per individual in a country

  2. Gross Domestic Savings (current USD) - the total amount saved after consumption has been subtracted from total GDP. This metric gives us an indication of saving levels in the countries. Many developing countries rely on low savings rate and high investment levels, leading to a budget deficit. A lot of the investment made into a developing country is in the form of debt which, with low savings rates, is hard to repay forany developing country. Debt crises lead to rampant bankruptcy levels and overall hampers development. Therefore, savings rate proves to be an important metric in our analysis.

  3. GDP growth - GDP levels become harder to compare across countries with different income levels (high vs. low GDP levels, on different scales). GDP growth helps make comparable conclusions of GDP fluctuations in the two sets of countries.

  4. Foreign Direct Investment (FDI) - As mentioned before, many developing countries rely on investment from developed countries. A lot of this investment is in the form of FDI, which is made by firms with the intention of growing business in a particular location. High FDI levels traditionally suggest growth and stable economic conditions.

  5. Unemployment- An essential statistic for measuring the output of labour. High levels of unemployment are common traits of developing countries and creates lower participation in the labour force, leading to lower levels of output.

Merging Datasets

After reviewing this list, we realised these covariates were not enough to provide a holistic view of development indicators. We realised that with these metrics, we were taking a macroeconomic view which may not be appropriate for developing countries, which have multiple microeconomic variables which impact the daily lives of individuals. We realised that it would not be fair to compare low- and high-income countries based on the same statistics.

Therefore, we decided to incorporate a new dataset, one which focused onmicroeconomic developmental indicators. This dataset originates from the same source (World Bank). The dataset can be found in the “Data” folder titled “Data_Extract_From_Health_Nutrition_and_Population_Statistics”.

Similar to the first dataset, we decided to focus on the HIPC and OECD countries. We also chose new statistics from this dataset to use in our analysis:

  1. Number of Infant Deaths:- In a low-income setting, such a metric provides us insights into living conditions and how conducive environments are to the development of children health.

  2. People using at least basic drinking water services (% of population)- Water, an essential resource needed for survival, is scarce is many low-income settings. This causes many health and wellbeing problems for individuals, all impacting productivity levels.

  3. Mortality rate, under-5 (per 1,000)- Similar to the “Number of Infant Deaths” metrics, with a few additional years’ worth of data

  4. GNI per capita, Atlas method (current US$)- GNI is a similar measure to GDP. What makes this metric important is the method of collection, the Atlas Method. This method incorporates inflation into the statistic. This not only gives us insight on output levels in the economy, but also on how output is impacted by inflation and rising price levels.

  5. Rural population (% of total population)- Rural populations make up a high share of overall population in developing countries. Individuals in these communities lack access to services provided in urban environments, such as educational facilities or other quality of life enhancements, and therefore provide low human capital. Due to their lack of specialised skills, individuals become part of low-income jobs and eventually create poverty trap of low-skilled jobs and low wage rates, potentially hampering development.

  6. School enrollment, primary (% gross) -Education is an important development metric. Access to proper educational facilities leads to a skilled labour-force, eventually contributing to higher output and development levels. The enrolment in these facilities, therefore, can prove to be a key determinant of development.

R File Containing Data Cleaning Steps

The steps taken to clean and merge these datasets are shown in detail in the load_and_clean_data.r. We used the following packages:

  1. tidyverse
  2. readxl
  3. lubridate
  4. GGally
  5. corrplot
  6. leaps
  7. modelr

Apart from the tidyverse package that we have discussed in class, we used readxl. It was important as our data was in a .xlsx file and so this package was important for the following data loading step. Lubridate was a package required for formatting of datetime values. GGally and corrplot were essential for some visualizations (particularly the scatterplot matrices). Leaps and modelr were required for the modelling steps.

First Dataset:

In this step, we loaded our primary dataset into R. This is an exhaustive dataset, consisting of 378,576 rows and 65 columns. First and foremost, it was important to clean this data to select relevant information.

Data Wrangling (first dataset):

As a group, we picked a list of econometrics which we believed would allow us to perform in-depth analysis. Our primary objective was to compare these metrics in developed and developing countries. Conveniently, this dataset contained 2 values for the “Country Name” column which were: “Heavily indebted poor countries (HIPC)” and “OECD members”. Instead of aggregating statistics from a group of countries, we decided to use these previously created groups.

countries <-c("Heavily indebted poor countries (HIPC)","OECD members")

Next, we filtered the dataset to include only the metrics important to our study.

econ_var<-c("GDP per capita (current US$)",
            "Gross domestic savings (current US$)",
            "Government expenditure on education, total (% of GDP)",
            "Profit tax (% of commercial profits)",
            "GDP growth (annual %)",
            "Foreign direct investment, net inflows (BoP, current US$)",
            "Foreign direct investment, net outflows (BoP, current US$)",
            "Unemployment, female (% of female labor force) 
            (modeled ILO estimate)",
            "Unemployment, male (% of male labor force) (modeled ILO estimate)")


WDIEXCEL<-WDIEXCEL %>% 
  filter(`Indicator Name` %in% econ_var,`Country Name` %in% countries)

Second Dataset:

Our first dataset focused primarily on macroeconomic data pertaining to matters of Gross Domestic Product (GDP), Foreign Direct Investment (FDI) and Unemployment among others.

In order to gain a more holistic view of the different circumstances in the different sets fo countries, we decided to incorporate a dataset that focuses on individual health; matters such as: Mortality Rate, Education and others.

We first imported the .xlsx file and similar to the first dataset, it was long and exhaustive. We once again chose a list of covariates on which we could filter the dataset.

health_var<-c("Number of infant deaths",
              "People using at least basic drinking water services 
              (% of population)",
              "Mortality rate, under-5 (per 1,000)",
              "GNI per capita, Atlas method (current US$)",
              "Rural population (% of total population)",
              "School enrollment, primary (% gross)",
              "Unemployment, male (% of male labor force)")

Health_data<-Health_data %>% filter(`Series Name` %in% health_var)

Unlike the first dataset, the columns in this dataset were not easily accessible. They were in a format which made accessing yearly data difficult. In order to solve this, we used the column names from the first dataset (since the timespan for both datasets was identical) and copied it.

colnames(Health_data)[5:65]<-c(colnames(WDIEXCEL)[5:65])

To make the 1st and 2nd dataset more accessible, we decided to use pivot_longer and create a column for the years through which the dataset spans. We first gathered all column names corresponding to the years (columns 5 to 65) and then used the pivot_longer function:

years<-c(colnames(WDIEXCEL)[5:65])

WDIEXCEL<- WDIEXCEL %>% 
  pivot_longer(years,names_to="year",values_to="count")
Health_data<-Health_data %>% 
  pivot_longer(years,names_to="year",values_to="count")

Furthermore, to make the process of combining both datasets simpler, we made sure all identical columns had identical names.

colnames(Health_data)[which(names(Health_data)=="Series Name")] <- 
  "Indicator Name"
colnames(Health_data)[which(names(Health_data)=="Series Code")] <- 
  "Indicator Code"

Combining datasets:

Finally, we wanted to combine both datasets.Since we wanted to keep all metrics in both datasets, we used ‘rbind’.

econi<-rbind(WDIEXCEL,Health_data)
econi<-econi %>% 
  group_by(`year`) %>% 
  summarize(`Indicator Name`,`Country Name`,count) %>% 
  filter(!(year=="2020"))