Motivation
Some days back I had a tought to create use cases of all the functions listed in the {dplyr} cheatsheet. Eventhough the cheatsheet shows the syntax on how to use a given function and provides a lucid one line (more or less) explanation along with excellent visual cues, I feel for new users (new to {dplyr} or R) it can be daunting to see all these functions at once and not knowing what to exactly look for to address the problem they might be facing.
What one might need to know beforehand
The said cheat sheet is available in the resources section of the RStudio website under the name Data Transformation Cheat Sheet
To demonstrate the use of the functions listed in the cheat sheet I will be using the palmerpenguis
data. Want to know more about this data set? Look at the github page of Allison Horst’s {palmerpenguin}.
Another point that I would like to mention is that this is not a comprehensive resource that necessarily documents use cases of all possible valid combinations of the functions listed in the {dplyr} cheat sheet. The reason for this is that I am lazy and not as skilled as I would like to think and pretend.
I will try and keep adding more functions overtime. I took this wise suggestion from Paul Brennan, hope he does not mind me mentioning him here.
Finally, all the mistake that I make here are mine, all, the ones that are stupid and especially the one that are very stupid. All mine.
Also I am assuming people will know about %>%
operator. Yes, not explaining it. Scroll up and look at the lazy panda gif.
I will use the explanations from the Cheat Sheet and reporduce for the readers’ benefit. These will appear verbatim as shown below.
This is how the explanation from the cheat sheet will be reproduced
Summarise Cases
These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back).
summarise funciton
Let us use the summarise function to the obtain the mean bill length, bill depth and flipper length of the penguins.
penguins %>%
summarise(
average_bill_length = mean(bill_length_mm, na.rm = T),
average_bill_depth = mean(bill_depth_mm, na.rm = T),
average_flipper_length = mean(flipper_length_mm, na.rm = T),
)
## # A tibble: 1 x 3
## average_bill_length average_bill_depth average_flipper_length
## <dbl> <dbl> <dbl>
## 1 43.9 17.2 201.
The functions summarise_all
, summarise_at
and summarise_if
have been suspended after the introduction of the across
function in the {dplyr} release of 1.0.0. Though the across
function is not mentioned in the cheat sheet, I will try and attemp to demonstrate a use case.
across function
Since this function is not mentioned in the cheat sheet, I will reporduce the explanations from the documentation of the {dplyr} 1.0.0
across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in summarise() and mutate().
I want average of bill lenght, bill depth and flipper length. I will attempt the across funciton to achieve this.
penguins %>%
summarise(
across(.cols = ends_with("mm"),
.fns = mean, na.rm = T,
.names = "average_{col}")
)
## # A tibble: 1 x 3
## average_bill_length_mm average_bill_depth_mm average_flipper_length_mm
## <dbl> <dbl> <dbl>
## 1 43.9 17.2 201.
across
function can also be used within mutate
function.
Group Cases
Use group_by() to create a “grouped” copy of a table. dplyr functions will manipulate each “group” separately and then combine the results.
Assume one want to get the same mean values but for all different specied of penguins. In such cases group_by
functions prove useful.
penguins %>%
group_by(species) %>%
summarise(
across(.cols = ends_with("mm"),
.fns = mean, na.rm = T,
.names = "average_{col}")
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
## species average_bill_length_mm average_bill_depth_~ average_flipper_length_~
## <fct> <dbl> <dbl> <dbl>
## 1 Adelie 38.8 18.3 190.
## 2 Chinstrap 48.8 18.4 196.
## 3 Gentoo 47.5 15.0 217.
Notice the message summarise()
ungrouping output (override with .groups
argument). This is a feature of the new {dplyr} 1.0.0 where one doesnot have to explicitly call the ungroup()
Manipulate Cases
Row functions return a subset of rows as a new table.
filter function to find fluffy penguins
Extract rows that meet logical criteria. filter(iris, Sepal.Length > 7)
Say I want a data set that has observations of penguins that are more than 3.5 Kg. In such cases cases the filter function come handy.
penguins %>%
filter(body_mass_g > 3500)
## # A tibble: 264 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torge~ 39.1 18.7 181 3750
## 2 Adelie Torge~ 39.5 17.4 186 3800
## 3 Adelie Torge~ 39.3 20.6 190 3650
## 4 Adelie Torge~ 38.9 17.8 181 3625
## 5 Adelie Torge~ 39.2 19.6 195 4675
## 6 Adelie Torge~ 42 20.2 190 4250
## 7 Adelie Torge~ 37.8 17.3 180 3700
## 8 Adelie Torge~ 38.6 21.2 191 3800
## 9 Adelie Torge~ 34.6 21.1 198 4400
## 10 Adelie Torge~ 36.6 17.8 185 3700
## # ... with 254 more rows, and 2 more variables: sex <fct>, year <int>
Say one is interested in penguins from a particular island (Torgersen) that are fluffy. In that case, multiple conditions can be provided to the filter function.
penguins %>%
filter(body_mass_g > 3500 & island == "Torgersen")
## # A tibble: 31 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torge~ 39.1 18.7 181 3750
## 2 Adelie Torge~ 39.5 17.4 186 3800
## 3 Adelie Torge~ 39.3 20.6 190 3650
## 4 Adelie Torge~ 38.9 17.8 181 3625
## 5 Adelie Torge~ 39.2 19.6 195 4675
## 6 Adelie Torge~ 42 20.2 190 4250
## 7 Adelie Torge~ 37.8 17.3 180 3700
## 8 Adelie Torge~ 38.6 21.2 191 3800
## 9 Adelie Torge~ 34.6 21.1 198 4400
## 10 Adelie Torge~ 36.6 17.8 185 3700
## # ... with 21 more rows, and 2 more variables: sex <fct>, year <int>
distinct function
Remove rows with duplicate values.
This funciton can be used to remove duplicate rows from a table. Since penguins
data does not have duplicate rows I will use a dummy data to demonstrate a simple use case of this funciton.
Consider the following data
tibble(
Restaurant = c("Ayub's", "Saranga", "Saranga", "Kamath", "Ayub's", "Kamath",
"Ayub's"),
Dish = c("Mutton Baida Roti", "Surmai", "Prawns malwani", "Vada Sambhar",
"Chiken Sheekh Kebab", "Misal Pav", "Mutton Baida Roti"),
Flavour_rate = c(8,9,10,6,7,7,8)
) -> Flavours
Flavours
## # A tibble: 7 x 3
## Restaurant Dish Flavour_rate
## <chr> <chr> <dbl>
## 1 Ayub's Mutton Baida Roti 8
## 2 Saranga Surmai 9
## 3 Saranga Prawns malwani 10
## 4 Kamath Vada Sambhar 6
## 5 Ayub's Chiken Sheekh Kebab 7
## 6 Kamath Misal Pav 7
## 7 Ayub's Mutton Baida Roti 8
This table gives a list of dishes from different restaurants and the flavour rating for each dish. However, there is a data entry error, the first and the last dish are the same, from the same restaurant. Its a duplicate entry. I wish to remove the duplicate entry, here is how that can be done using the distinct()
function.
Flavours %>%
distinct()
## # A tibble: 6 x 3
## Restaurant Dish Flavour_rate
## <chr> <chr> <dbl>
## 1 Ayub's Mutton Baida Roti 8
## 2 Saranga Surmai 9
## 3 Saranga Prawns malwani 10
## 4 Kamath Vada Sambhar 6
## 5 Ayub's Chiken Sheekh Kebab 7
## 6 Kamath Misal Pav 7
The distint()
can be used to keep observations by using specific variable or column. Say, from the Flavours
data one only wants one observation from each restaurant along with all the variables. In that case, the distinct()
can be used as shown below.
Flavours %>%
distinct(Restaurant,.keep_all = T)
## # A tibble: 3 x 3
## Restaurant Dish Flavour_rate
## <chr> <chr> <dbl>
## 1 Ayub's Mutton Baida Roti 8
## 2 Saranga Surmai 9
## 3 Kamath Vada Sambhar 6
This would give us the first observations for each of the restaurants as they appear in the data.
sample_* functions
sample_frac function
Randomly select fraction of rows.
This function allows us to randomly sample fraction of observation from the data. We can also define if we want the sampling with replacement or without replacement.
In the code below, I randomly sample 50% of the observations without replacement.
penguins %>%
sample_frac(size = 0.5, replace = FALSE)
## # A tibble: 172 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 46.2 14.4 214 4650
## 2 Gentoo Biscoe 45.3 13.7 210 4300
## 3 Gentoo Biscoe 46.4 15 216 4700
## 4 Adelie Torge~ 42.9 17.6 196 4700
## 5 Chinst~ Dream 50.9 19.1 196 3550
## 6 Adelie Biscoe 40.5 17.9 187 3200
## 7 Adelie Dream 40.8 18.9 208 4300
## 8 Adelie Biscoe 35.3 18.9 187 3800
## 9 Gentoo Biscoe 50.4 15.7 222 5750
## 10 Adelie Biscoe 41.6 18 192 3950
## # ... with 162 more rows, and 2 more variables: sex <fct>, year <int>
sample_n function
Randomly select size rows.
This funcitons allows to select desired number of observations from the data with or without replacement.
Consider the following example where I select 50 rows with replacement.
penguins %>%
sample_n(size = 50,replace = TRUE)
## # A tibble: 50 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 45.4 14.6 211 4800
## 2 Adelie Dream 40.2 17.1 193 3400
## 3 Gentoo Biscoe 50 15.3 220 5550
## 4 Gentoo Biscoe 46.8 14.3 215 4850
## 5 Adelie Dream 37 16.5 185 3400
## 6 Gentoo Biscoe 43.3 13.4 209 4400
## 7 Gentoo Biscoe 43.5 15.2 213 4650
## 8 Adelie Biscoe 40.1 18.9 188 4300
## 9 Chinst~ Dream 52.8 20 205 4550
## 10 Gentoo Biscoe 49.1 14.8 220 5150
## # ... with 40 more rows, and 2 more variables: sex <fct>, year <int>
slice funciton
Select rows by position
This funciton gives us the abilty to select rows by the position in which they appear in the data.
Say, I want observations 25 to 32 from the penguins
data. Below code is how I would do it.
penguins %>%
slice(25:32)
## # A tibble: 8 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Biscoe 38.8 17.2 180 3800 male
## 2 Adelie Biscoe 35.3 18.9 187 3800 fema~
## 3 Adelie Biscoe 40.6 18.6 183 3550 male
## 4 Adelie Biscoe 40.5 17.9 187 3200 fema~
## 5 Adelie Biscoe 37.9 18.6 172 3150 fema~
## 6 Adelie Biscoe 40.5 18.9 180 3950 male
## 7 Adelie Dream 39.5 16.7 178 3250 fema~
## 8 Adelie Dream 37.2 18.1 178 3900 male
## # ... with 1 more variable: year <int>
top_n function
Select and order top n entries
This funciton lets one select observations that are the highest or top by a given variable.
Say, I want to select penguins that have mass in the range of the top 5 values that the body_mass_g
varible takes. This is how I would do it.
penguins %>%
top_n(n = 5, body_mass_g)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Gentoo Biscoe 49.2 15.2 221 6300 male
## 2 Gentoo Biscoe 59.6 17 230 6050 male
## 3 Gentoo Biscoe 51.1 16.3 220 6000 male
## 4 Gentoo Biscoe 45.2 16.4 223 5950 male
## 5 Gentoo Biscoe 49.8 15.9 229 5950 male
## 6 Gentoo Biscoe 48.8 16.2 222 6000 male
## # ... with 1 more variable: year <int>
arrange funciton
Order rows by values of a column or columns (low to high), use with desc() to order from high to low
Say, I want to order the penguins by their body mass.
penguins %>%
arrange(body_mass_g)
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinst~ Dream 46.9 16.6 192 2700
## 2 Adelie Biscoe 36.5 16.6 181 2850
## 3 Adelie Biscoe 36.4 17.1 184 2850
## 4 Adelie Biscoe 34.5 18.1 187 2900
## 5 Adelie Dream 33.1 16.1 178 2900
## 6 Adelie Torge~ 38.6 17 188 2900
## 7 Chinst~ Dream 43.2 16.6 187 2900
## 8 Adelie Biscoe 37.9 18.6 193 2925
## 9 Adelie Dream 37.5 18.9 179 2975
## 10 Adelie Dream 37 16.9 185 3000
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
This arranges the penguins from least fluffy to most fluffy. I can arrange in decreasing order with the use of desc()
within the arrange()
.
penguins %>%
arrange(desc(body_mass_g))
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 49.2 15.2 221 6300
## 2 Gentoo Biscoe 59.6 17 230 6050
## 3 Gentoo Biscoe 51.1 16.3 220 6000
## 4 Gentoo Biscoe 48.8 16.2 222 6000
## 5 Gentoo Biscoe 45.2 16.4 223 5950
## 6 Gentoo Biscoe 49.8 15.9 229 5950
## 7 Gentoo Biscoe 48.4 14.6 213 5850
## 8 Gentoo Biscoe 49.3 15.7 217 5850
## 9 Gentoo Biscoe 55.1 16 230 5850
## 10 Gentoo Biscoe 49.5 16.2 229 5800
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
add_row funtion
Add one or more rows to a table
Say, I went to Ayub’s again and tried another dish that I want to add to the Flavours
data.
Flavours %>%
add_row(Restaurant = "Ayub's", Dish = "Biryani", Flavour_rate = 8)
## # A tibble: 8 x 3
## Restaurant Dish Flavour_rate
## <chr> <chr> <dbl>
## 1 Ayub's Mutton Baida Roti 8
## 2 Saranga Surmai 9
## 3 Saranga Prawns malwani 10
## 4 Kamath Vada Sambhar 6
## 5 Ayub's Chiken Sheekh Kebab 7
## 6 Kamath Misal Pav 7
## 7 Ayub's Mutton Baida Roti 8
## 8 Ayub's Biryani 8