+ - 0:00:00
Notes for current slide
Notes for next slide

Data + R

Epi Lunch & Learn

Alec Robitaille

2019-05-29

1 / 29

Good practices with data

2 / 29

Good practices with data

Keep raw data raw

3 / 29

Good practices with data

Keep raw data raw

Keep track of where the data comes from

4 / 29

Good practices with data

Keep raw data raw

Keep track of where the data comes from

No spaces, weird characters or symbols in file names or column names

5 / 29

Good practices with data

Keep raw data raw

Keep track of where the data comes from

No spaces, weird characters or symbols in file names or column names

Use a consistent folder structure

data
├── derived-data
│ └── 1-prep
│ └── cleaned-ed-visits.Rds
└── raw-data
└── ED visits for Alec.xls
6 / 29

Good practices with R

7 / 29

Good practices with R

Use RStudio projects

8 / 29

Good practices with R

Use RStudio projects

Use relative not absolute paths

9 / 29

Good practices with R

Use RStudio projects

Use relative not absolute paths

Comment scripts throughout, keep track of changes

10 / 29

Good practices with R

Use RStudio projects

Use relative not absolute paths

Comment scripts throughout, keep track of changes

Use a common project structure

project/
├── data/
│ ├── derived-data/
│ └── raw-data/
├── graphics/
├── R/
├── epi-lunch-learn.Rproj
└── README.md
11 / 29

Good practices with R

Use RStudio projects

Use relative not absolute paths

Comment scripts throughout, keep track of changes

Use a common project structure

project/
├── data/
│ ├── derived-data/
│ └── raw-data/
├── graphics/
├── R/
├── epi-lunch-learn.Rproj
└── README.md

An example

12 / 29

Tidy data

13 / 29

Tidy data

https://r4ds.had.co.nz/tidy-data.html

14 / 29

Tidy data

https://r4ds.had.co.nz/tidy-data.html

  • Packages are designed to work with tidy data (dplyr, ggplot2)
  • Handles challenges like inconsistent number of diagnoses
  • Data is unambiguous
15 / 29

Functions

Do you find yourself copy+pasting code, repeating lines on subsets of a data set, across different projects...

WhitehorseMeans.R

a2015 <- abs(mean(b2015) - b2015)
a2016 <- abs(mean(b2016) - b2016)
a2017 <- abs(mean(b2017) - b2015)
a2018 <- abs(mean(b2018) - b2018)

DawsonMeans.R

c2015 <- abs(mean(d2015) - d2015)
c2016 <- abs(mean(d2016) - d2016)
c2016 <- abs(mean(d2017) - d2015)
c2018 <- abs(mean(d2018) - d2018)
16 / 29

Functions

Do you find yourself copy+pasting code, repeating lines on subsets of a data set, across different projects...

WhitehorseMeans.R

a2015 <- abs(mean(b2015) - b2015)
a2016 <- abs(mean(b2016) - b2016)
a2017 <- abs(mean(b2017) - b2015)
a2018 <- abs(mean(b2018) - b2018)

DawsonMeans.R

c2015 <- abs(mean(d2015) - d2015)
c2016 <- abs(mean(d2016) - d2016)
c2016 <- abs(mean(d2017) - d2015)
c2018 <- abs(mean(d2018) - d2018)

Problems...

17 / 29

Functions

Do you find yourself copy+pasting code, repeating lines on subsets of a data set, across different projects...

WhitehorseMeans.R

a2015 <- abs(mean(b2015) - b2015)
a2016 <- abs(mean(b2016) - b2016)
a2017 <- abs(mean(b2017) - b2015)
a2018 <- abs(mean(b2018) - b2018)

DawsonMeans.R

c2015 <- abs(mean(d2015) - d2015)
c2016 <- abs(mean(d2016) - d2016)
c2016 <- abs(mean(d2017) - d2015)
c2018 <- abs(mean(d2018) - d2018)

Problems...

  • a greater risk of typos = hidden errors
  • more lines of code = lost in your scripts
  • more typing = tiring, carpal tunnel
  • can't use in other projects or scripts = not reusable
  • any change you make has to be made everywhere
18 / 29

Alternatively..

19 / 29

Alternatively..

Write a function

calc <- function(b) {
abs(mean(b) - b)
}
a2015 <- calc(b2015)
a2016 <- calc(b2016)
# ...
20 / 29

Alternatively..

Write a function

calc <- function(b) {
abs(mean(b) - b)
}
a2015 <- calc(b2015)
a2016 <- calc(b2016)
# ...

and apply that function over a list

lsYears <- list(b2015, b2016, b2017, b2018)
lapply(lsYears, calc)
21 / 29

Alternatively..

Write a function

calc <- function(b) {
abs(mean(b) - b)
}
a2015 <- calc(b2015)
a2016 <- calc(b2016)
# ...

and apply that function over a list

lsYears <- list(b2015, b2016, b2017, b2018)
lapply(lsYears, calc)

Or across groups

bees %>%
group_by(year) %>%
calc(val)
22 / 29

Plotting

23 / 29

Plotting

ggplot

24 / 29

Plotting

ggplot

geom_*

What kind of plot? (points, lines, histograms, ...)

25 / 29

Plotting

ggplot

geom_*

What kind of plot? (points, lines, histograms, ...)

aes

Links the data to aesthetic properties.

  • ID -> color
  • Treatment -> linetype
  • value -> point size
26 / 29

Plotting

ggplot

geom_*

What kind of plot? (points, lines, histograms, ...)

aes

Links the data to aesthetic properties.

  • ID -> color
  • Treatment -> linetype
  • value -> point size
ggplot(mtcars) +
geom_point(aes(mpg, cyl))
27 / 29

Spatial methods

28 / 29

Good practices with data

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow