by Michel Wermelinger and Dave Smith, 15 November 2015

This is an amended project notebook for Week 3 of The Open University's *Learn to code for Data Analysis* course.

Richer countries can afford to invest more on healthcare, on work and road safety, and other measures that reduce mortality. On the other hand, richer countries may have less healthy lifestyles. Is there any relation between the wealth of a country and the life expectancy of its inhabitants?

The following analysis checks whether there is any correlation between the per capita gross domestic product (GDP) of a country in 2013 and the life expectancy of people born in that country in 2013.

Two datasets of the World Bank are considered. One dataset, available at http://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD, lists the per capita GDP of the world's countries in current US dollars, for various years. The use of a common currency allows to compare values across countries. The other dataset, available at http://data.worldbank.org/indicator/SP.DYN.LE00.IN, lists the life expectancy of the world's countries.

The datasets are downloaded directly, using the unique indicator name given in the URL.

In [23]:

```
from pandas import *
from pandas.io.wb import download
YEAR = 2013
PER_CAPITA_INDICATOR = 'NY.GDP.PCAP.PP.CD'
perCapita = download(indicator=PER_CAPITA_INDICATOR, country='all', start=YEAR, end=YEAR)
LIFE_INDICATOR = 'SP.DYN.LE00.IN'
life = download(indicator=LIFE_INDICATOR, country='all', start=YEAR, end=YEAR)
```

Inspecting the data with `head()`

and `tail()`

shows that:

- country names are the row indices, not column values;
- the first 34 rows are aggregated data, for the Arab World, the Caribbean small states, and other country groups used by the World Bank;
- GDP per capita and life expectancy values are missing for some countries.

The data is therefore cleaned by:

- transforming the dataframe index into columns and creating a new index 0, 1, 2, etc.;
- removing the first 34 rows;
- removing rows with unavailable values.

In [24]:

```
perCapita = perCapita.reset_index()[34:].dropna()
life = life.reset_index()[34:].dropna()
```

The World Bank reports GDP per capita in US dollars and cents. Here, the value is converted to British pounds (the author's local currency) with the following auxiliary function, using the average 2013 dollar-to-pound conversion rate provided by http://www.ukforex.co.uk/forex-tools/historical-rate-tools/yearly-average-rates.

In [25]:

```
def usdToGBP (usd):
return usd / 1.564768
PER_CAPITA = 'GDP per capita (£)'
perCapita[PER_CAPITA] = perCapita[PER_CAPITA_INDICATOR].apply(usdToGBP).apply(round)
perCapita.head()
```

Out[25]:

The unnecessary columns can be dropped.

In [26]:

```
COUNTRY = 'country'
headings = [COUNTRY, PER_CAPITA]
perCapita = perCapita[headings]
perCapita.head()
```

Out[26]:

In [27]:

```
LIFE = 'Life expectancy (years)'
life[LIFE] = life[LIFE_INDICATOR].apply(round)
headings = [COUNTRY, LIFE]
life = life[headings]
life.head()
```

Out[27]:

The tables are combined through an inner join on the common 'country' column.

In [28]:

```
perCapitaVsLife = merge(perCapita, life, on=COUNTRY, how='inner')
perCapitaVsLife.head()
```

Out[28]:

To measure if the life expectancy and the GDP per capita grow together, the Spearman rank correlation coefficient is used. It is a number from -1 (perfect inverse rank correlation: if one indicator increases, the other decreases) to 1 (perfect direct rank correlation: if one indicator increases, so does the other), with 0 meaning there is no rank correlation. A perfect correlation doesn't imply any cause-effect relation between the two indicators. A p-value below 0.05 means the correlation is statistically significant.

In [29]:

```
from scipy.stats import spearmanr
(correlation, pValue) = spearmanr(perCapitaVsLife[PER_CAPITA], perCapitaVsLife[LIFE])
print('The correlation is', correlation)
print('The p-value is', pValue)
if pValue < 0.05:
print('It is statistically significant.')
else:
print('It is not statistically significant.')
```

The value shows a direct correlation, i.e. richer populations tend to have longer life expectancy.

Measures of correlation can be misleading, so it is best to see the overall picture with a scatterplot.

In [30]:

```
%matplotlib inline
perCapitaVsLife.plot(x=PER_CAPITA, y=LIFE, kind='scatter', grid=True, figsize=(15, 6))
```

Out[30]:

The GDP per capita axis may use a logarithmic scale to better display the large range of values.

In [31]:

```
perCapitaVsLife.plot(x=PER_CAPITA, y=LIFE, kind='scatter', grid=True, logx=True, figsize=(15, 6))
```

Out[31]:

The plot shows that there is a direct correlation. There are no countries with a a per capita GDP below £1,000 and a life expectancy above 66 years. There are only 2 countries with a per capita GDP above £10,000 and a life expectancy below 69 years.

Populations of intermediate wealth (per capita GDP between £1,000 and £10,000) have a wider spread. The 4 countries with the lowest life expectancy fall within this band.

There are 2 obvious outliers; countries aroound the £10,000 and £20,000 per capita GDP mark, with life expectancies of 47 and 53 years respectively.

In [32]:

```
# notable outlier - Botswana
perCapitaVsLife.sort(PER_CAPITA)[104:113]
```

Out[32]:

In [33]:

```
# notable outlier - Equatorial Guinea
perCapitaVsLife.sort(PER_CAPITA)[144:153]
```

Out[33]:

To sum up, there is a strong correlation between a country's per capita wealth and the life expectancy of its inhabitants.

However, whilst a low per capita GDP limits the life expectancy of a population, an intermediate to high per capita GDP does not guarantee a higher life expectancy. The 4 countries with the lowest life expectancy fall within the band of intermediate wealth.

Two notable exceptions to the overall trend are Botswana and Equatorial Guinea, with medium to high per capita GDP figures but very low life expectancies. A little research reveals some possible contributory factors for these anomalies. Botswana has been hit particularly hard by HIV/AIDS, whist Equatorial Guinea's oil wealth is distributed very unevenly.