Project 3: GDP per capita and life expectancy

by Michel Wermelinger and Dave Smith, 15 November 2015

This is an amended project notebook for Week 3 of The Open University's Learn to code for Data Analysis course.

Richer countries can afford to invest more on healthcare, on work and road safety, and other measures that reduce mortality. On the other hand, richer countries may have less healthy lifestyles. Is there any relation between the wealth of a country and the life expectancy of its inhabitants?

The following analysis checks whether there is any correlation between the per capita gross domestic product (GDP) of a country in 2013 and the life expectancy of people born in that country in 2013.

Getting the data

Two datasets of the World Bank are considered. One dataset, available at http://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD, lists the per capita GDP of the world's countries in current US dollars, for various years. The use of a common currency allows to compare values across countries. The other dataset, available at http://data.worldbank.org/indicator/SP.DYN.LE00.IN, lists the life expectancy of the world's countries.

The datasets are downloaded directly, using the unique indicator name given in the URL.

In [23]:
from pandas import *
from pandas.io.wb import download

YEAR = 2013
PER_CAPITA_INDICATOR = 'NY.GDP.PCAP.PP.CD'
perCapita = download(indicator=PER_CAPITA_INDICATOR, country='all', start=YEAR, end=YEAR)
LIFE_INDICATOR = 'SP.DYN.LE00.IN'
life = download(indicator=LIFE_INDICATOR, country='all', start=YEAR, end=YEAR)

Cleaning the data

Inspecting the data with head() and tail() shows that:

  1. country names are the row indices, not column values;
  2. the first 34 rows are aggregated data, for the Arab World, the Caribbean small states, and other country groups used by the World Bank;
  3. GDP per capita and life expectancy values are missing for some countries.

The data is therefore cleaned by:

  1. transforming the dataframe index into columns and creating a new index 0, 1, 2, etc.;
  2. removing the first 34 rows;
  3. removing rows with unavailable values.
In [24]:
perCapita = perCapita.reset_index()[34:].dropna()
life = life.reset_index()[34:].dropna()

Transforming the data

The World Bank reports GDP per capita in US dollars and cents. Here, the value is converted to British pounds (the author's local currency) with the following auxiliary function, using the average 2013 dollar-to-pound conversion rate provided by http://www.ukforex.co.uk/forex-tools/historical-rate-tools/yearly-average-rates.

In [25]:
def usdToGBP (usd):
    return usd / 1.564768

PER_CAPITA = 'GDP per capita (£)'
perCapita[PER_CAPITA] = perCapita[PER_CAPITA_INDICATOR].apply(usdToGBP).apply(round)
perCapita.head()
Out[25]:
country year NY.GDP.PCAP.PP.CD GDP per capita (£)
34 Afghanistan 2013 1937.855965 1238
35 Albania 2013 9925.906623 6343
36 Algeria 2013 13676.471654 8740
39 Angola 2013 7083.903435 4527
40 Antigua and Barbuda 2013 21027.397641 13438

The unnecessary columns can be dropped.

In [26]:
COUNTRY = 'country'
headings = [COUNTRY, PER_CAPITA]
perCapita = perCapita[headings]
perCapita.head()
Out[26]:
country GDP per capita (£)
34 Afghanistan 1238
35 Albania 6343
36 Algeria 8740
39 Angola 4527
40 Antigua and Barbuda 13438

The World Bank reports the life expectancy with several decimal places. After rounding, the original column is discarded.

In [27]:
LIFE = 'Life expectancy (years)'
life[LIFE] = life[LIFE_INDICATOR].apply(round)
headings = [COUNTRY, LIFE]
life = life[headings]
life.head()
Out[27]:
country Life expectancy (years)
34 Afghanistan 61
35 Albania 78
36 Algeria 71
39 Angola 52
40 Antigua and Barbuda 76

Combining the data

The tables are combined through an inner join on the common 'country' column.

In [28]:
perCapitaVsLife = merge(perCapita, life, on=COUNTRY, how='inner')
perCapitaVsLife.head()
Out[28]:
country GDP per capita (£) Life expectancy (years)
0 Afghanistan 1238 61
1 Albania 6343 78
2 Algeria 8740 71
3 Angola 4527 52
4 Antigua and Barbuda 13438 76

Calculating the correlation

To measure if the life expectancy and the GDP per capita grow together, the Spearman rank correlation coefficient is used. It is a number from -1 (perfect inverse rank correlation: if one indicator increases, the other decreases) to 1 (perfect direct rank correlation: if one indicator increases, so does the other), with 0 meaning there is no rank correlation. A perfect correlation doesn't imply any cause-effect relation between the two indicators. A p-value below 0.05 means the correlation is statistically significant.

In [29]:
from scipy.stats import spearmanr

(correlation, pValue) = spearmanr(perCapitaVsLife[PER_CAPITA], perCapitaVsLife[LIFE])
print('The correlation is', correlation)
print('The p-value is', pValue)
if pValue < 0.05:
    print('It is statistically significant.')
else:
    print('It is not statistically significant.')
The correlation is 0.835289947089
The p-value is 3.69070998447e-49
It is statistically significant.

The value shows a direct correlation, i.e. richer populations tend to have longer life expectancy.

Showing the data

Measures of correlation can be misleading, so it is best to see the overall picture with a scatterplot.

In [30]:
%matplotlib inline
perCapitaVsLife.plot(x=PER_CAPITA, y=LIFE, kind='scatter', grid=True, figsize=(15, 6))
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x9ac3208>

The GDP per capita axis may use a logarithmic scale to better display the large range of values.

In [31]:
perCapitaVsLife.plot(x=PER_CAPITA, y=LIFE, kind='scatter', grid=True, logx=True, figsize=(15, 6))
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x97f1f98>

The plot shows that there is a direct correlation. There are no countries with a a per capita GDP below £1,000 and a life expectancy above 66 years. There are only 2 countries with a per capita GDP above £10,000 and a life expectancy below 69 years.

Populations of intermediate wealth (per capita GDP between £1,000 and £10,000) have a wider spread. The 4 countries with the lowest life expectancy fall within this band.

There are 2 obvious outliers; countries aroound the £10,000 and £20,000 per capita GDP mark, with life expectancies of 47 and 53 years respectively.

In [32]:
# notable outlier - Botswana
perCapitaVsLife.sort(PER_CAPITA)[104:113]
Out[32]:
country GDP per capita (£) Life expectancy (years)
169 Turkmenistan 8950 65
113 Montenegro 9034 75
162 Thailand 9138 74
39 Costa Rica 9177 80
21 Botswana 9905 47
77 Iraq 10037 69
22 Brazil 10050 74
24 Bulgaria 10054 74
76 Iran, Islamic Rep. 10307 74
In [33]:
# notable outlier - Equatorial Guinea
perCapitaVsLife.sort(PER_CAPITA)[144:153]
Out[33]:
country GDP per capita (£) Life expectancy (years)
44 Czech Republic 18544 78
106 Malta 18614 81
166 Trinidad and Tobago 19355 70
43 Cyprus 19938 80
51 Equatorial Guinea 20475 53
79 Israel 20764 82
87 Korea, Rep. 21146 81
151 Spain 21148 82
133 Puerto Rico 22328 79

Conclusions

To sum up, there is a strong correlation between a country's per capita wealth and the life expectancy of its inhabitants.

However, whilst a low per capita GDP limits the life expectancy of a population, an intermediate to high per capita GDP does not guarantee a higher life expectancy. The 4 countries with the lowest life expectancy fall within the band of intermediate wealth.

Two notable exceptions to the overall trend are Botswana and Equatorial Guinea, with medium to high per capita GDP figures but very low life expectancies. A little research reveals some possible contributory factors for these anomalies. Botswana has been hit particularly hard by HIV/AIDS, whist Equatorial Guinea's oil wealth is distributed very unevenly.