Data + R

class: center, middle, inverse, title-slide

# Data + R
## Epi Lunch & Learn
### Alec Robitaille
### 2019-05-29

---

# Good practices with data

Keep raw data raw

Keep track of where the data comes from

No spaces, weird characters or symbols in file names or column names

Use a consistent folder structure

```{}
data
├── derived-data
│   └── 1-prep
│       └── cleaned-ed-visits.Rds
└── raw-data
    └── ED visits for Alec.xls
```

---
# Good practices with R

Use [RStudio projects](https://csgillespie.github.io/efficientR/set-up.html#project-management)

Use [relative not absolute paths](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/)

Comment scripts throughout, [keep track of changes](http://github.com/)

Use a common project structure

```
project/
├── data/
│   ├── derived-data/
│   └── raw-data/
├── graphics/
├── R/
├── epi-lunch-learn.Rproj
└── README.md
```

[An example](https://github.com/wildlifeevoeco/SocCaribou)

---
# Tidy data

![https://r4ds.had.co.nz/tidy-data.html](https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png)

* Packages are designed to work with tidy data (`dplyr`, `ggplot2`)
* Handles challenges like inconsistent number of diagnoses
* Data is unambiguous

---
# Functions
Do you find yourself copy+pasting code, repeating lines on subsets of a data set, across different projects...

.pull-left[
`WhitehorseMeans.R`

```r
a2015 <- abs(mean(b2015) - b2015)
a2016 <- abs(mean(b2016) - b2016)
a2017 <- abs(mean(b2017) - b2015)
a2018 <- abs(mean(b2018) - b2018)
```
]

.pull-right[
`DawsonMeans.R`

```r
c2015 <- abs(mean(d2015) - d2015)
c2016 <- abs(mean(d2016) - d2016)
c2016 <- abs(mean(d2017) - d2015)
c2018 <- abs(mean(d2018) - d2018)
```
]

Problems...

* a greater risk of typos = hidden errors
* more lines of code = lost in your scripts
* more typing = tiring, [carpal tunnel](https://www.youtube.com/watch?v=fhauC2TwgxI)
* can't use in other projects or scripts = not reusable
* any change you make has to be made everywhere

---
# Alternatively..

#### Write a function

```r
calc <- function(b) {
  abs(mean(b) - b)
}
a2015 <- calc(b2015)
a2016 <- calc(b2016)
# ...
```

### and apply that function over a list

```r
lsYears <- list(b2015, b2016, b2017, b2018)
lapply(lsYears, calc)
```

### Or across groups

```r
bees %>% 
	group_by(year) %>% 
	calc(val)
```

---
# Plotting

.pull-right[![](https://raw.githubusercontent.com/tidyverse/ggplot2/master/man/figures/logo.png)]

## `ggplot`

## `geom_*`

What kind of plot? (points, lines, histograms, ...)

## `aes`

Links the data to aesthetic properties.

* ID -> color
* Treatment -> linetype
* value -> point size

```r
ggplot(mtcars) + 
	geom_point(aes(mpg, cyl))
```

---
# Spatial methods

![](https://user-images.githubusercontent.com/520851/50280460-e35c1880-044c-11e9-9ed7-cc46754e49db.jpg)

* [`sf`](https://github.com/r-spatial/sf/)

---
# Resources
* [Advanced R](http://adv-r.had.co.nz/)
* [Efficient R](https://csgillespie.github.io/efficientR)
* [R for Data Science](https://r4ds.had.co.nz/)

## Extras
* [RMarkdown](http://rmarkdown.rstudio.com/)
* [data.table](https://cran.r-project.org/web/packages/data.table/)
* [GitHub](http://github.com/)