class: center, middle, inverse, title-slide # data.table or: How I Learned To Stop For-Looping and Love With I Do J By Group ### Alec Robitaille ### 2017-11-16 --- # What is wrong with my for loops? for loops (*in R*) are inefficient, especially when growing objects iteratively. they can be improved by pre-allocating the vectors or lists, if we know the expected size of the final objects. regardless, there are more concise and efficient ways of writing R (both in number of keystrokes and processing efficiency) --- class:middle, center, clear, review `DT[i, j, by]` which reads: > with I do J by group ??? whenever you want to write a data.table expression, repeat these words... --- # Modify by reference One of data.table's main driving philosophies is to modify objects *by reference*. This avoids the memory cost of assigning and reassigning objects by changing them in place. This also means that methods on small tables translate better to large ones. --- # Creating a data.table .pull-left[ data.tables can go anywhere data.frames go because they share the data.frame class: ```r diamonds <- data.frame(diamonds) setDT(diamonds) class(diamonds) ``` ``` ## [1] "data.table" "data.frame" ``` ] .pull-right[ Function | Second Header :------------ | :------------ `setDT` | coerce by reference lists or DFs `as.data.table` | cast to data.table (requires assignment) `data.table` | general constructor ] ??? differentiate setDT and data.table and as.data.table note that this isn't the only way if it is used though, it only needs to be run once. stick it on a line and move on --- # data.table "i" forget `base::subset` and its weird issues ```r diamonds[cut == 'Very Good'] ``` ``` ## carat cut color clarity depth table price x y z ## 1: 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 2: 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 ## 3: 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 ## 4: 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39 ## 5: 0.30 Very Good J SI1 62.7 59 351 4.21 4.27 2.66 ``` ```r diamonds[carat < 0.24 & color != 'J'] ``` ``` ## carat cut color clarity depth table price x y z ## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4: 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 ## 5: 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39 ``` --- # data.table "i": ordering the rows ```r diamonds[order(price)] ``` ``` ## carat cut color clarity depth table price x y z ## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ``` ```r diamonds[order(depth, price, carat)] ``` ``` ## carat cut color clarity depth table price x y z ## 1: 1.00 Fair G SI1 43.0 59 3634 6.32 6.27 3.97 ## 2: 1.09 Ideal J VS2 43.0 54 4778 6.53 6.55 4.12 ## 3: 1.00 Fair G VS2 44.0 53 4032 6.31 6.24 4.12 ## 4: 1.43 Fair I VS1 50.8 60 6727 7.73 7.25 3.93 ## 5: 0.30 Fair E VVS2 51.0 67 945 4.67 4.62 2.37 ``` ??? a quick note about non-standard evaluation fill more actual description about what non-standard eval is notice how in the previous example, we used the column names directly? so easy, so clean.. --- # data.table "j" this is where we can select, compute on, add or delete columns. the only requirement for j is to return a list, list of lists or data.frame/data.table. ie: no SpatialPolygons, etc. [1] ```r diamonds[, sum(price)] ``` ``` ## [1] 212135217 ``` ```r diamonds[, .(sum(price))] ``` ``` ## V1 ## 1: 212135217 ``` .footnote[ [1] You can store complex data in columns if they are within a list. See [here](http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#columns-of-lists) for an example. ] --- # data.table "j" ```r diamonds[, .(sumPrice = sum(price))] ``` ``` ## sumPrice ## 1: 212135217 ``` ```r diamonds[, .(sumPrice = sum(price), meanCarat = mean(carat))] ``` ``` ## sumPrice meanCarat ## 1: 212135217 0.7979397 ``` ??? j is the real workhorse. without intermediate output changed my life highlight the differences between these three lines --- # := add/modify columns For one column: ```r diamonds[, meanPrice := mean(price)] ``` ``` ## carat cut color clarity depth meanPrice ## 1: 0.23 Ideal E SI2 61.5 3932.8 ## 2: 0.21 Premium E SI1 59.8 3932.8 ## 3: 0.23 Good E VS1 56.9 3932.8 ``` --- # := add/modify columns For multiple columns: ```r diamonds[, c('meanPrice', 'meanCarat') := .(mean(price), mean(carat))] ``` Or with a list of column names, wrap with brackets. --- # := remove columns Our example data: ``` ## carat cut color clarity depth meanPrice ## 1: 0.23 Ideal E SI2 61.5 3932.8 ## 2: 0.21 Premium E SI1 59.8 3932.8 ## 3: 0.23 Good E VS1 56.9 3932.8 ``` Remove the meanPrice column ```r diamonds[, meanPrice := NULL] ``` ``` ## carat cut color clarity depth ## 1: 0.23 Ideal E SI2 61.5 ## 2: 0.21 Premium E SI1 59.8 ## 3: 0.23 Good E VS1 56.9 ``` --- # := remove columns Remove multiple columns ```r diamonds[, c('carat', 'cut') := NULL] ``` ``` ## color clarity depth ## 1: E SI2 61.5 ## 2: E SI1 59.8 ## 3: E VS1 56.9 ``` --- # data.table "by" ## Single column ```r diamonds[, mean(price), by = color] ``` ``` ## color V1 ## 1: E 3076.752 ## 2: I 5091.875 ## 3: J 5323.818 ## 4: H 4486.669 ## 5: F 3724.886 ## 6: G 3999.136 ## 7: D 3169.954 ``` ## Multiple columns ```r diamonds[, mean(price), by = .(color, clarity)] diamonds[, mean(price), by = c('color', 'clarity')] diamonds[, mean(price), by = color:clarity] ``` ??? how do we fix this column name? instant pop quiz review alternatively, how can we add it as a column highlight the clarity and unambiguous nature of this syntax we aren't using any strange numbers that have no meaning --- # data.table "by" ## Group by a conditional statement ```r diamonds[, mean(price), by = price > 500] ``` ``` ## price > 500 V1 ## 1: FALSE 445.4197 ## 2: TRUE 4049.6671 ``` ## Group by a list of columns ```r cols <- c('color', 'clarity') diamonds[, mean(price), by = cols] ``` --- # Chaining expressions ```r diamonds[, mnPrice := mean(price), by = cut][, dif := price - mnPrice] ``` ``` ## price mnPrice dif ## 1: 5499 3981.760 1517.2401 ## 2: 753 3981.760 -3228.7599 ## 3: 873 3457.542 -2584.5420 ## 4: 4760 3981.760 778.2401 ## 5: 14719 4584.258 10134.7423 ``` ??? order locs by datetime then calculate step length --- # data.table special symbols `.SD` - **S**ubset of the **D**ata.table `.BY` - list of items in the by `.I` - row numbers in the data.table `.GRP` - group counter `.N` - number of rows (in each group) --- # data.table .SD .left-tight[ .SD is the list of columns in the data.table and represents the subset of the data.table when grouping. .SDcols informs data.table that only a subset of the columns are needed for a function. (note that .SDcols doesn't accept the .(col1, col2) syntax) ] .right-wide[ ```r diamonds[, lapply(.SD, mean), .SDcols = c('price', 'carat')] ``` ``` ## price carat ## 1: 3932.8 0.7979397 ``` ```r diamonds[, lapply(.SD, mean), .SDcols = c('price', 'carat'), by = color] ``` ``` ## color price carat ## 1: E 3076.752 0.6578667 ## 2: I 5091.875 1.0269273 ## 3: J 5323.818 1.1621368 ## 4: H 4486.669 0.9117991 ## 5: F 3724.886 0.7365385 ## 6: G 3999.136 0.7711902 ## 7: D 3169.954 0.6577948 ``` ] --- # data.table .N .pull-left[ ## How many rows? ```r diamonds[, .N] ``` ``` ## [1] 53940 ``` ] .pull-right[ ## In a group? ```r diamonds[, .N, by = .(color, cut)] ``` ``` ## color cut N ## 1: E Ideal 3903 ## 2: E Premium 2337 ## 3: E Good 933 ## 4: I Premium 1428 ## 5: J Good 307 ``` ] ??? recently juliana and I were running into an issue with multiple locs for the same individual at the same time this caused a major issue for --- # data.table .BY ## For plotting: ```r diamonds[, print(qplot(price, main = as.character(.BY))), by = color] ``` ## Conditional statements on the value of the group ```r bestCuts <- c('Ideal', 'Premium', 'Very Good') diamonds[, .(adjustedMeanPrice = if(.BY %in% bestCuts){ mean(price) + 200 } else { mean(price) - 200 }), by = cut] ``` ## If using multiple by columns, simply select the .BY of interest ```r diamonds[1:3, paste('the carat is...', .BY[1]), by = .(carat, color)] ``` ```r diamonds[1:3, paste0(.BY[1], .BY[2]), by = .(carat, color)] ``` # data.table .I ```r diamonds[, rowN := .I] ``` ``` ## cut color depth price rowN ## 1: Ideal E 61.5 326 1 ## 2: Premium E 59.8 326 2 ## 3: Good E 56.9 327 3 ## 4: Premium I 62.4 334 4 ## 5: Good J 63.3 335 5 ``` ```r diamonds[, .I[which.max(price)], by = color] ``` ``` ## color V1 ## 1: E 27721 ## 2: I 27750 ## 3: J 27685 ## 4: H 27746 ## 5: F 27741 ## 6: G 27749 ## 7: D 27677 ``` # data.table .GRP Group counter ``` ## color GRP ## 1: E 1 ## 2: I 2 ## 3: J 3 ## 4: H 4 ## 5: F 5 ## 6: G 6 ## 7: D 7 ``` ??? add grp as grp then use on grp later --- class: important # Good practices with data.table build your expressions stepwise, ensuring each works before adding any columns to the DT. 1. check that your subsetting is working 1. with subsetting check the j 1. with sub and j, check group 1. then add the column, once you are sure the result is good. --- # Advanced methods ## Functions in data.table's i or by ```r diamonds[price == max(price), .(carat, color, price, clarity, depth)] ``` ``` ## carat color price clarity depth ## 1: 2.29 I 18823 VS2 60.8 ``` ```r diamonds[sample(.N, 5), .(carat, color, price, clarity, depth)] ``` ``` ## carat color price clarity depth ## 1: 1.00 D 5880 SI1 59.6 ## 2: 1.73 H 6007 SI2 65.9 ## 3: 0.73 E 2661 SI1 60.1 ## 4: 1.11 H 6061 VS1 63.2 ## 5: 0.72 I 2326 VS2 62.1 ``` --- # Plenty of other functionality * the amazing fread, fwrite * setnames * setcolorder * shift * merge * dcast * melt * IDate, ITime my favorite combo: lapply -> rbindlist ??? why you shouldn't use subset: https://stackoverflow.com/questions/9860090/why-is-better-than-subset --- class: clear ```r ### data.table Step Length ---- # Set columns time.col <- 'datetime' coord.cols <- c('EASTING', 'NORTHING') # Create lag and dif column names lag.cols <- paste('lag', coord.cols, sep = '') difference.cols <- c('difX', 'difY') lag.time.col <- paste0('lag', time.col) dif.time.col <- paste0('dif', time.col) # Use shift to create lagged cols locs[order(get(time.col)), (lag.cols) := shift(.SD, 1, NA, 'lag'), by = .(ANIMAL_ID, year), .SDcols = coord.cols] # Find the difference squared between all points in each x,y separately locs[, (difference.cols) := .((get(coord.cols[1]) - get(lag.cols[1])) ^2, (get(coord.cols[2]) - get(lag.cols[2])) ^2)] # Square root the summed difference for a simple step length locs[, simpleStep := sqrt(rowSums(.SD)), .SDcols = difference.cols] ``` --- ```r ## Delta Time locs[order(get(time.col)), (lag.time.col) := shift(.SD, 1, NA, 'lag'), by = .(ANIMAL_ID, year), .SDcols = time.col] # difference in time in hours locs[, (dif.time.col) := as.numeric(get(time.col) - get(lag.time.col), units = 'hours')] # Simple step length divided by time difference locs[, moveRate := simpleStep / (get(dif.time.col))] ``` --- class: clear ```r # Drop columns leaving only needed cols <- c('EASTING', 'NORTHING', 'blockByIDYear', 'ANIMAL_ID', 'block', 'HERD') # and set all locs as observed observed.locs <- locs[, ..cols][, observed := 1] # Create identical for random with observed == 0 random.locs <- locs[, ..cols][, observed := 0] # Generate an equivalent number of random points in vertices as observed (with encamped state) # (updating the observed locs EASTING, NORTHING columns) random.locs[, c('EASTING', 'NORTHING') := as.data.table(spsample(vertices.95[vertices.95@data$id == .BY[[1]],], .N, iter = 100, type = 'random')@coords), by = blockByIDYear] ```