data.table or: How I Learned To Stop For-Looping and Love With I Do J By Group

class: center, middle, inverse, title-slide

# data.table or: How I Learned To Stop For-Looping and Love With I Do J By Group
### Alec Robitaille
### 2017-11-16

---

# What is wrong with my for loops?

for loops (*in R*) are inefficient, especially when growing objects iteratively.

they can be improved by pre-allocating the vectors or lists, if we know the expected size of the final objects.

regardless, there are more concise and efficient ways of writing R (both in number of keystrokes and processing efficiency)

---
class:middle, center, clear, review

`DT[i, j, by]`

which reads:

> with I do J by group

???
whenever you want to write a data.table expression, repeat these words...

---

# Modify by reference

One of data.table's main driving philosophies is to modify objects *by reference*.

This avoids the memory cost of assigning and reassigning objects by changing them in place.

This also means that methods on small tables translate better to large ones.

---

# Creating a data.table

.pull-left[
data.tables can go anywhere data.frames go because they share the data.frame class:

```r
diamonds <- data.frame(diamonds)

setDT(diamonds)

class(diamonds)
```

```
## [1] "data.table" "data.frame"
```
]

???
differentiate setDT and data.table and as.data.table
note that this isn't the only way
if it is used though, it only needs to be run
once. stick it on a line and move on

---

# data.table "i"

forget `base::subset` and its weird issues

```r
diamonds[cut == 'Very Good']
```

```
##    carat       cut color clarity depth table price    x    y    z
## 1:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
## 2:  0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
## 3:  0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53
## 4:  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39
## 5:  0.30 Very Good     J     SI1  62.7    59   351 4.21 4.27 2.66
```

```r
diamonds[carat < 0.24 & color != 'J']
```

```
##    carat       cut color clarity depth table price    x    y    z
## 1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4:  0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
## 5:  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39
```

---
# data.table "i": ordering the rows

```r
diamonds[order(price)]
```

```
##    carat     cut color clarity depth table price    x    y    z
## 1:  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2:  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3:  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4:  0.29 Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5:  0.31    Good     J     SI2  63.3    58   335 4.34 4.35 2.75
```

```r
diamonds[order(depth, price, carat)]
```

```
##    carat   cut color clarity depth table price    x    y    z
## 1:  1.00  Fair     G     SI1  43.0    59  3634 6.32 6.27 3.97
## 2:  1.09 Ideal     J     VS2  43.0    54  4778 6.53 6.55 4.12
## 3:  1.00  Fair     G     VS2  44.0    53  4032 6.31 6.24 4.12
## 4:  1.43  Fair     I     VS1  50.8    60  6727 7.73 7.25 3.93
## 5:  0.30  Fair     E    VVS2  51.0    67   945 4.67 4.62 2.37
```

???
a quick note about non-standard evaluation
fill more actual description about what non-standard eval is
notice how in the previous example, we used the column names directly?

so easy, so clean..

---
# data.table  "j"

this is where we can select, compute on, add or delete columns.

the only requirement for j is to return a list, list of lists or data.frame/data.table. ie: no SpatialPolygons, etc. [1]

```r
diamonds[, sum(price)]
```

```
## [1] 212135217
```

```r
diamonds[, .(sum(price))]
```

```
##           V1
## 1: 212135217
```

.footnote[
[1] You can store complex data in columns if they are within a list. See [here](http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#columns-of-lists) for an example.
]

---
# data.table  "j"

```r
diamonds[, .(sumPrice = sum(price))]
```

```
##     sumPrice
## 1: 212135217
```

```r
diamonds[, .(sumPrice = sum(price),
             meanCarat = mean(carat))]
```

```
##     sumPrice meanCarat
## 1: 212135217 0.7979397
```

???
j is the real workhorse.
without intermediate output changed my life
highlight the differences between these three lines

---
# :=  add/modify columns

For one column:

```r
diamonds[, meanPrice := mean(price)]
```

```
##    carat     cut color clarity depth meanPrice
## 1:  0.23   Ideal     E     SI2  61.5    3932.8
## 2:  0.21 Premium     E     SI1  59.8    3932.8
## 3:  0.23    Good     E     VS1  56.9    3932.8
```

---
# :=  add/modify columns

For multiple columns:

```r
diamonds[, c('meanPrice', 'meanCarat')  := .(mean(price), mean(carat))]
```

Or with a list of column names, wrap with brackets.

---

# := remove columns
Our example data:

Remove the meanPrice column

```r
diamonds[, meanPrice := NULL]
```

```
##    carat     cut color clarity depth
## 1:  0.23   Ideal     E     SI2  61.5
## 2:  0.21 Premium     E     SI1  59.8
## 3:  0.23    Good     E     VS1  56.9
```

---

# := remove columns

Remove multiple columns

```r
diamonds[, c('carat', 'cut') := NULL]
```

```
##    color clarity depth
## 1:     E     SI2  61.5
## 2:     E     SI1  59.8
## 3:     E     VS1  56.9
```

---

# data.table "by"

## Single column

```r
diamonds[, mean(price), by = color]
```

```
##    color       V1
## 1:     E 3076.752
## 2:     I 5091.875
## 3:     J 5323.818
## 4:     H 4486.669
## 5:     F 3724.886
## 6:     G 3999.136
## 7:     D 3169.954
```

## Multiple columns

```r
diamonds[, mean(price), by = .(color, clarity)]
diamonds[, mean(price), by = c('color', 'clarity')]
diamonds[, mean(price), by = color:clarity]
```

???
how do we fix this column name?
instant pop quiz review
alternatively, how can we add it as a column

highlight the clarity and unambiguous nature of this syntax
we aren't using any strange numbers that have no meaning

---
#  data.table "by"

## Group by a conditional statement

```r
diamonds[, mean(price), by = price > 500]
```

```
##    price > 500        V1
## 1:       FALSE  445.4197
## 2:        TRUE 4049.6671
```

## Group by a list of columns

```r
cols <- c('color', 'clarity')

diamonds[, mean(price), by = cols]
```

---

# Chaining expressions

```r
diamonds[, mnPrice := mean(price), by = cut][, dif := price - mnPrice]
```

```
##    price  mnPrice        dif
## 1:  5499 3981.760  1517.2401
## 2:   753 3981.760 -3228.7599
## 3:   873 3457.542 -2584.5420
## 4:  4760 3981.760   778.2401
## 5: 14719 4584.258 10134.7423
```

???
order locs by datetime then calculate step length

---

# data.table special symbols

`.SD`
- **S**ubset of the **D**ata.table

`.BY`
- list of items in the by

`.I`
- row numbers in the data.table

`.GRP`
- group counter

`.N`
- number of rows (in each group)

---
# data.table .SD

.left-tight[
.SD is the list of columns in the data.table and represents the subset of the data.table when grouping.

.SDcols informs data.table that only a subset of the columns are needed for a function.

(note that .SDcols doesn't accept the .(col1, col2) syntax)

]

.right-wide[

```r
diamonds[, lapply(.SD, mean), .SDcols = c('price', 'carat')]
```

```
##     price     carat
## 1: 3932.8 0.7979397
```

```r
diamonds[, lapply(.SD, mean), 
         .SDcols = c('price', 'carat'), 
         by = color]
```

```
##    color    price     carat
## 1:     E 3076.752 0.6578667
## 2:     I 5091.875 1.0269273
## 3:     J 5323.818 1.1621368
## 4:     H 4486.669 0.9117991
## 5:     F 3724.886 0.7365385
## 6:     G 3999.136 0.7711902
## 7:     D 3169.954 0.6577948
```
]

---

# data.table .N

.pull-left[
## How many rows?

```r
diamonds[, .N]
```

```
## [1] 53940
```

]

.pull-right[
## In a group?

```r
diamonds[, .N, by = .(color, cut)]
```

```
##    color     cut    N
## 1:     E   Ideal 3903
## 2:     E Premium 2337
## 3:     E    Good  933
## 4:     I Premium 1428
## 5:     J    Good  307
```

]

???
recently juliana and I were running into an issue with
multiple locs for the same individual at the same time
this caused a major issue for

---

# data.table .BY

## For plotting:

```r
diamonds[, print(qplot(price, main = as.character(.BY))), by = color]
```

## Conditional statements on the value of the group

```r
bestCuts <- c('Ideal', 'Premium', 'Very Good')

diamonds[, .(adjustedMeanPrice = if(.BY %in% bestCuts){
  mean(price) + 200
} else {
  mean(price) - 200
}), by = cut]
```

## If using multiple by columns, simply select the .BY of interest

```r
diamonds[1:3, paste('the carat is...', .BY[1]),  by = .(carat, color)]
```

```r
diamonds[1:3, paste0(.BY[1], .BY[2]),  by = .(carat, color)]
```

# data.table .I

```r
diamonds[, rowN := .I]
```

```
##        cut color depth price rowN
## 1:   Ideal     E  61.5   326    1
## 2: Premium     E  59.8   326    2
## 3:    Good     E  56.9   327    3
## 4: Premium     I  62.4   334    4
## 5:    Good     J  63.3   335    5
```

```r
diamonds[, .I[which.max(price)], by = color]
```

```
##    color    V1
## 1:     E 27721
## 2:     I 27750
## 3:     J 27685
## 4:     H 27746
## 5:     F 27741
## 6:     G 27749
## 7:     D 27677
```

# data.table .GRP
Group counter

```
##    color GRP
## 1:     E   1
## 2:     I   2
## 3:     J   3
## 4:     H   4
## 5:     F   5
## 6:     G   6
## 7:     D   7
```

???
add grp as grp then use on grp later

---
class: important
# Good practices with data.table
build your expressions stepwise, ensuring each works before adding any columns to the DT.

1. check that your subsetting is working
1. with subsetting check the j
1. with sub and j, check group 
1. then add the column, once you are sure the result is good.

---
# Advanced methods

## Functions in data.table's i or by

```r
diamonds[price == max(price), .(carat, color, price, clarity, depth)]
```

```
##    carat color price clarity depth
## 1:  2.29     I 18823     VS2  60.8
```

```r
diamonds[sample(.N, 5), .(carat, color, price, clarity, depth)]
```

```
##    carat color price clarity depth
## 1:  1.00     D  5880     SI1  59.6
## 2:  1.73     H  6007     SI2  65.9
## 3:  0.73     E  2661     SI1  60.1
## 4:  1.11     H  6061     VS1  63.2
## 5:  0.72     I  2326     VS2  62.1
```

---

# Plenty of other functionality
* the amazing fread, fwrite
* setnames
* setcolorder
* shift
* merge
* dcast
* melt
* IDate, ITime

my favorite combo: lapply -> rbindlist

???
why you shouldn't use subset:
https://stackoverflow.com/questions/9860090/why-is-better-than-subset

---
class: clear

```r
### data.table Step Length ----
# Set columns
time.col <- 'datetime'
coord.cols <- c('EASTING', 'NORTHING')

# Create lag and dif column names
lag.cols <- paste('lag', coord.cols, sep = '')
difference.cols <- c('difX', 'difY')

lag.time.col <- paste0('lag', time.col)
dif.time.col <- paste0('dif', time.col)

# Use shift  to create lagged cols
locs[order(get(time.col)), (lag.cols) := shift(.SD, 1, NA, 'lag'),
     by = .(ANIMAL_ID, year),
     .SDcols = coord.cols]

# Find the difference squared between all points in each x,y separately
locs[, (difference.cols) := .((get(coord.cols[1]) - get(lag.cols[1])) ^2,
                              (get(coord.cols[2]) - get(lag.cols[2])) ^2)]

# Square root the summed difference for a simple step length
locs[, simpleStep := sqrt(rowSums(.SD)),
     .SDcols = difference.cols]
```

---

```r
## Delta Time
locs[order(get(time.col)), (lag.time.col) := shift(.SD, 1, NA, 'lag'), 
     by = .(ANIMAL_ID, year),
     .SDcols = time.col]

# difference in time in hours
locs[, (dif.time.col) := as.numeric(get(time.col) - get(lag.time.col), units = 'hours')]

# Simple step length divided by time difference
locs[, moveRate := simpleStep / (get(dif.time.col))]
```

---

class: clear

```r
# Drop columns leaving only needed
cols <- c('EASTING', 'NORTHING', 'blockByIDYear', 'ANIMAL_ID', 'block', 'HERD')

# and set all locs as observed
observed.locs <- locs[, ..cols][, observed := 1]

# Create identical for random with observed == 0
random.locs <- locs[, ..cols][, observed := 0]

# Generate an equivalent number of random points in vertices as observed (with encamped state)
# (updating the observed locs EASTING, NORTHING columns)
random.locs[, c('EASTING', 'NORTHING') :=
 as.data.table(spsample(vertices.95[vertices.95@data$id == .BY[[1]],],
      .N, iter = 100, type = 'random')@coords),
            by = blockByIDYear]
```