UN public health data/Plotting countries and aggregations of countries in UN public health data

From Wikiversity
Jump to navigation Jump to search

The plots included in the Wikiversity article on "UN public health data" can be reproduced by downloading the data from the appropriate UN website as described in the R Markdown code in the vignette below and "knitting" it in Rstudio.

Process[edit | edit source]

RStudio RMarkdown knit icon
  1. Open an instance of RStudio. If you don't already have it installed, you can go to "RStudio Cloud" and click "get started for free" (at least as of 2020-10-24).[1] Or you can download and install free and open-source versions of it and R on your local computer. R is available from The Comprehensive R Archive Network. For RStudio, go to its website > Products (top center) > RStudio > "RStudio Desktop" > "Open Source Edition" > "Download RStudio Desktop".
  2. Start RStudio. Then File > "New File" > "R Markdown..." > Title: "Measuring Worth" > HTML > OK.
  3. Replace the default code in this new RMarkdown vignette on "Measuring Worth" with the text from the section below entitled, 'RMarkdown vignette on "Measuring Worth"'.
  4. Save.
  5. Click the "Knit" icon; see the companion image. Or read the text and run the code chunks one at a time manually. The latter makes it relatively easy to look at intermediate computations carefully and experiment with changing things in different ways.

RMarkdown vignette on "Plotting countries and aggregations of countries in UN public health data"[edit | edit source]

---
title: "Plotting countries and aggregations of countries in UN public health data"
author: "Spencer Graves"
date: "11/22/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Intro 

Desired:  Compare the historical evolution of public health between the US and other countries, especially advanced industrialized countries like the OECD.  

One data source is the [Mortality data](https://population.un.org/wpp/Download/Standard/Mortality/) that is part of [World Population Prospects](https://population.un.org/wpp/Publications/Files/WPP2015_Volume-I_Comprehensive-Tables.pdf) published by the United Nations.  These include several tables including the following:  
* Infant Mortality Rate (IMR) (XLSX, 122 KB)
* Life Expectancy at Birth (e0) - Both Sexes (XLSX, 122 KB)

We manually downloaded both these tables on 2020-11-22.  Both contain sheets for the following:  
* "Estimates" for 255 different countries and aggregations of countries for years 1950-1955, 1955-1960, ..., 2015-2020.
* "Medium Variant" for the same countries and aggregations as for "Estimates" forecasting for 2020-2025, 2025-2030, ..., 2095-2100.  
* Notes.  

For present purposes, we plan to ignore the "Medium Variant" and "Notes" sheets and consider the "Estimates".  We first identify the files, read them, then invent ways to produce the comparisons we want.  

## Files 

```{r dataFiles}
(datFiles <- dir(pattern='\\.xlsx$'))
```

We need "IMR" and "LIFE_EXPECTANCY" tables.  

```{r imrE0}
(imrFile <- grep('IMR', datFiles, value=TRUE))
(e0File <- grep('LIFE', datFiles, value=TRUE))
```

## Read 

R has no native capability to read `*.xlsx` files.  There are several packages that claim an ability to read `*.xlsx` files.  

```{r sos, eval=FALSE}
library(sos)
(xls <- findFn('read xlsx'))
```

On 2020-11-22 the first package this found was `XLConnect`. However, this requires `rJava`.  I've had trouble with `rJava`, so I'd prefer to avoid this if feasible.  

The second package on this list was `openxlsx`.  This was most recently updated 2020-10-26, less than a month ago (though the summary displayed in a browser gives an earlier version).  It includes `read.xlsx` that looks like it might do what we need.  

```{r imr}
library(openxlsx)
imr <- read.xlsx(imrFile, startRow=17)
imr[1,]
```

The value `startRow` = 17 was selected after opening the file in spreadsheet software and experimenting with different values for this parameter.  

The value of `imr[1, "2015-2020"]` matched the number in seen in the spreadsheet software:  approximately 29, represent 29 deaths per thousand live births;  [infant mortality rate](https://en.wikipedia.org/wiki/Infant_mortality) "is the probability of deaths of children under one year of age per 1000 live births."  

```{r e0}
e0 <- read.xlsx(e0File, startRow=17)
e0[1,]
```

The value of `e0[1, "2015-2020"]` matched the number in seen in the spreadsheet software.  

However, the "numbers" were all read as `character`.  We'll ignore that for the moment.  Let's first focus on selecting the countries and regions we want to compare.  

## Countries and regions

To start I want Canada and "United States of America".  A look at the `e0` in a spreadsheet for 2015-2020 suggests that life expectancy in "Eastern Europe" is lower than that in "Northern Europe", "Southern Europe", and "Western Europe".

In addition, many on the Left comment about the relatively high standard of public health in Cuba, while those on the Right complain about how terrible the Communist government of Cuban is.  We will therefore plot Cuba separate "Latin America and the Caribbean".  "LATIN AMERICA AND THE CARIBBEAN" appears as an "SDG region" with "Country code" 1830, while "Latin America and the Caribbean" appears as a "Region" with "Country code" 904.  We'll look at the two, compare them, and decide what to do with them.  Let's also look at Japan and the "WORLD".

```{r sel}
sel <- c("Canada", "United States of America", 
   "Eastern Europe", "Northern Europe", 
   "Southern Europe", "Western Europe",
   "Cuba", "Latin America and the Caribbean", 
   "LATIN AMERICA AND THE CARIBBEAN", 
   "Japan", "WORLD")
(Sel <- which(e0[, 3] %in% sel))
```

Both `sel` and `Sel` have length = 11, so we found all the countries and regions we wanted.  

We should extract the names of the entities (countries or regions) separate from the data:  

```{r Sel2}
(entities <- imr[Sel, 3])
imr. <- imr[Sel, -(1:7)]
e0. <- e0[Sel, -(1:7)]
```

Now we convert to numeric and transpose starting with `e0`:

```{r num}
str(e0m <- as.matrix(e0.))
str(e0n <- as.numeric(e0m))
dim(e0n) <- dim(e0.)
e0.[c(1,11), c(1, 14)]
e0n[c(1,11), c(1, 14)]
```

Visual inspection indicates we did that correctly. 

Let's repeat that to get `imrn`:

```{r }
str(imrn <- as.numeric(as.matrix(imr.)))
dim(imrn) <- dim(imr.)
imr.[c(1,11), c(1, 14)]
imrn[c(1,11), c(1, 14)]
```

Before we go further, let's compare rows 2 and 4, which are the two rows from the table for Latin America and the Caribbean.  

```{r com2.4}
quantile(e0n[2,]-e0n[4,])
quantile(imrn[2,]-imrn[4,])
```

Good:  They are identical to within roundoff error.  We'll drop row 4.  

Next, we need to transpose `e0n` and `imrn`, add column names for the entities (countries and regions) and convert the year ranges into single numbers.  

```{r t}
E0 <- t(e0n[-4,])
IMR <- t(imrn[-4,])

cbind(entities[-4], 
Entities <- c('World', 'LatinAmericaCaribbean', 
    'Japan', 'Cuba', 'EEurope', 'NEurope', 
    'SEurope', 'WEurope', 'Canada', 'US'))
colnames(E0) <- Entities
colnames(IMR) <- Entities
(Yr <- seq(1952.5, 2017.5, 5))
```

## Plot 

```{r plot}
cCodes <- c('w', 'la', 'jp', 'cu', 
    'ee', 'ne', 'se', 'we', 'ca', 'us')
names(cCodes) <- colnames(E0)
cCuscu <- c('', '', '', 'cu', 
    '', '', '', '', '', 'us')
names(cCuscu) <- colnames(E0)

MatPlot <- function(y=E0, x=Yr, 
    Labels=cCuscu, Legend=cCodes, 
    legendCex=.7, 
    legendXY='topleft', ...){
  op <- par(mar=c(3,3,4, 2)+.1)
  k <- ncol(y)
  matplot(x, y, type='l', las=1, 
      xlab='', ylab='', col=1:k, 
      lty=1:k, bty='n', ...)
  legend(legendXY, lty=1:k, col=1:k, 
      bty='n', legend=Legend, 
      text.col=1:k, cex=legendCex)
  cNames <- colnames(y)
  for(i in 1:k){
    text(x, y[, i], Labels[cNames[i]], col=i)
  }
  par(op)
}

MatPlot(E0)
yLim <- range(E0[, -(1:2)])
MatPlot(E0, ylim=yLim)

MatPlot(IMR, log='y', 
    legendXY='bottomleft', legendCex=.8)
```

These plots look like what we want.  

Next, we want to export them as `svg`.  

However, `cex` must be adjusted for `svg`.  

```{r svg}
Cex <- 1.5
svg('lifeExpectancy.svg')
MatPlot(E0, cex=Cex, 
  legendCex=0.7*Cex, cex.axis=Cex)
svg('lifeExpectancy2.svg')
MatPlot(E0, ylim=yLim, 
    cex=Cex, cex.axis=Cex,
    legendCex=0.7*Cex)

svg('infantMortality.svg')
MatPlot(IMR, log='y', 
    legendXY='bottomleft', 
    cex=Cex, legendCex=.8*Cex, 
    cex.axis=Cex)
```
  1. RStudio Cloud, Wikidata Q100799903.