Simple use cases of functions in {dplyr} cheatsheet

Motivation

Some days back I had a tought to create use cases of all the functions listed in the {dplyr} cheatsheet. Eventhough the cheatsheet shows the syntax on how to use a given function and provides a lucid one line (more or less) explanation along with excellent visual cues, I feel for new users (new to {dplyr} or R) it can be daunting to see all these functions at once and not knowing what to exactly look for to address the problem they might be facing.

What one might need to know beforehand

The said cheat sheet is available in the resources section of the RStudio website under the name Data Transformation Cheat Sheet

To demonstrate the use of the functions listed in the cheat sheet I will be using the palmerpenguis data. Want to know more about this data set? Look at the github page of Allison Horst’s {palmerpenguin}.

Artwork by @allison_horst

Another point that I would like to mention is that this is not a comprehensive resource that necessarily documents use cases of all possible valid combinations of the functions listed in the {dplyr} cheat sheet. The reason for this is that I am lazy and not as skilled as I would like to think and pretend.

Lazy GIF from Lazy GIFs

I will try and keep adding more functions overtime. I took this wise suggestion from Paul Brennan, hope he does not mind me mentioning him here.

Finally, all the mistake that I make here are mine, all, the ones that are stupid and especially the one that are very stupid. All mine.

Yes Its My Fault Daniel Espinoza GIF from Yesitsmyfault GIFs

Also I am assuming people will know about %>% operator. Yes, not explaining it. Scroll up and look at the lazy panda gif.

I will use the explanations from the Cheat Sheet and reporduce for the readers’ benefit. These will appear verbatim as shown below.

This is how the explanation from the cheat sheet will be reproduced

Summarise Cases

These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back).

summarise funciton

Let us use the summarise function to the obtain the mean bill length, bill depth and flipper length of the penguins.

penguins %>% 
  summarise(
    average_bill_length = mean(bill_length_mm, na.rm = T),
    average_bill_depth = mean(bill_depth_mm, na.rm = T),
    average_flipper_length = mean(flipper_length_mm, na.rm = T),
  )

## # A tibble: 1 x 3
##   average_bill_length average_bill_depth average_flipper_length
##                 <dbl>              <dbl>                  <dbl>
## 1                43.9               17.2                   201.

The functions summarise_all , summarise_at and summarise_if have been suspended after the introduction of the across function in the {dplyr} release of 1.0.0. Though the across function is not mentioned in the cheat sheet, I will try and attemp to demonstrate a use case.

across function

Since this function is not mentioned in the cheat sheet, I will reporduce the explanations from the documentation of the {dplyr} 1.0.0

across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in summarise() and mutate().

I want average of bill lenght, bill depth and flipper length. I will attempt the across funciton to achieve this.

penguins %>% 
  summarise(
    across(.cols = ends_with("mm"),
           .fns = mean, na.rm = T,
           .names = "average_{col}")
  )

## # A tibble: 1 x 3
##   average_bill_length_mm average_bill_depth_mm average_flipper_length_mm
##                    <dbl>                 <dbl>                     <dbl>
## 1                   43.9                  17.2                      201.

across function can also be used within mutate function.

Group Cases

Use group_by() to create a “grouped” copy of a table. dplyr functions will manipulate each “group” separately and then combine the results.

Assume one want to get the same mean values but for all different specied of penguins. In such cases group_by functions prove useful.

penguins %>% 
  group_by(species) %>% 
  summarise(
    across(.cols = ends_with("mm"),
           .fns = mean, na.rm = T,
           .names = "average_{col}")
  )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 4
##   species   average_bill_length_mm average_bill_depth_~ average_flipper_length_~
##   <fct>                      <dbl>                <dbl>                    <dbl>
## 1 Adelie                      38.8                 18.3                     190.
## 2 Chinstrap                   48.8                 18.4                     196.
## 3 Gentoo                      47.5                 15.0                     217.

Notice the message summarise() ungrouping output (override with .groups argument). This is a feature of the new {dplyr} 1.0.0 where one doesnot have to explicitly call the ungroup()

Manipulate Cases

Row functions return a subset of rows as a new table.

filter function to find fluffy penguins

Extract rows that meet logical criteria. filter(iris, Sepal.Length > 7)

Say I want a data set that has observations of penguins that are more than 3.5 Kg. In such cases cases the filter function come handy.

penguins %>% 
  filter(body_mass_g > 3500)

## # A tibble: 264 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torge~           39.1          18.7              181        3750
##  2 Adelie  Torge~           39.5          17.4              186        3800
##  3 Adelie  Torge~           39.3          20.6              190        3650
##  4 Adelie  Torge~           38.9          17.8              181        3625
##  5 Adelie  Torge~           39.2          19.6              195        4675
##  6 Adelie  Torge~           42            20.2              190        4250
##  7 Adelie  Torge~           37.8          17.3              180        3700
##  8 Adelie  Torge~           38.6          21.2              191        3800
##  9 Adelie  Torge~           34.6          21.1              198        4400
## 10 Adelie  Torge~           36.6          17.8              185        3700
## # ... with 254 more rows, and 2 more variables: sex <fct>, year <int>

Say one is interested in penguins from a particular island (Torgersen) that are fluffy. In that case, multiple conditions can be provided to the filter function.

penguins %>% 
  filter(body_mass_g > 3500 & island == "Torgersen")

## # A tibble: 31 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torge~           39.1          18.7              181        3750
##  2 Adelie  Torge~           39.5          17.4              186        3800
##  3 Adelie  Torge~           39.3          20.6              190        3650
##  4 Adelie  Torge~           38.9          17.8              181        3625
##  5 Adelie  Torge~           39.2          19.6              195        4675
##  6 Adelie  Torge~           42            20.2              190        4250
##  7 Adelie  Torge~           37.8          17.3              180        3700
##  8 Adelie  Torge~           38.6          21.2              191        3800
##  9 Adelie  Torge~           34.6          21.1              198        4400
## 10 Adelie  Torge~           36.6          17.8              185        3700
## # ... with 21 more rows, and 2 more variables: sex <fct>, year <int>

distinct function

Remove rows with duplicate values.

This funciton can be used to remove duplicate rows from a table. Since penguins data does not have duplicate rows I will use a dummy data to demonstrate a simple use case of this funciton.

Consider the following data

tibble(
  Restaurant = c("Ayub's", "Saranga", "Saranga", "Kamath", "Ayub's", "Kamath",
                 "Ayub's"),
  Dish = c("Mutton Baida Roti", "Surmai", "Prawns malwani", "Vada Sambhar",
           "Chiken Sheekh Kebab", "Misal Pav", "Mutton Baida Roti"),
  Flavour_rate = c(8,9,10,6,7,7,8)
) -> Flavours

Flavours

## # A tibble: 7 x 3
##   Restaurant Dish                Flavour_rate
##   <chr>      <chr>                      <dbl>
## 1 Ayub's     Mutton Baida Roti              8
## 2 Saranga    Surmai                         9
## 3 Saranga    Prawns malwani                10
## 4 Kamath     Vada Sambhar                   6
## 5 Ayub's     Chiken Sheekh Kebab            7
## 6 Kamath     Misal Pav                      7
## 7 Ayub's     Mutton Baida Roti              8

This table gives a list of dishes from different restaurants and the flavour rating for each dish. However, there is a data entry error, the first and the last dish are the same, from the same restaurant. Its a duplicate entry. I wish to remove the duplicate entry, here is how that can be done using the distinct() function.

Flavours %>% 
  distinct()

## # A tibble: 6 x 3
##   Restaurant Dish                Flavour_rate
##   <chr>      <chr>                      <dbl>
## 1 Ayub's     Mutton Baida Roti              8
## 2 Saranga    Surmai                         9
## 3 Saranga    Prawns malwani                10
## 4 Kamath     Vada Sambhar                   6
## 5 Ayub's     Chiken Sheekh Kebab            7
## 6 Kamath     Misal Pav                      7

The distint() can be used to keep observations by using specific variable or column. Say, from the Flavours data one only wants one observation from each restaurant along with all the variables. In that case, the distinct() can be used as shown below.

Flavours %>% 
  distinct(Restaurant,.keep_all = T)

## # A tibble: 3 x 3
##   Restaurant Dish              Flavour_rate
##   <chr>      <chr>                    <dbl>
## 1 Ayub's     Mutton Baida Roti            8
## 2 Saranga    Surmai                       9
## 3 Kamath     Vada Sambhar                 6

This would give us the first observations for each of the restaurants as they appear in the data.

sample_* functions

sample_frac function

Randomly select fraction of rows.

This function allows us to randomly sample fraction of observation from the data. We can also define if we want the sampling with replacement or without replacement.

In the code below, I randomly sample 50% of the observations without replacement.

penguins %>% 
  sample_frac(size = 0.5, replace = FALSE)

## # A tibble: 172 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Gentoo  Biscoe           46.2          14.4              214        4650
##  2 Gentoo  Biscoe           45.3          13.7              210        4300
##  3 Gentoo  Biscoe           46.4          15                216        4700
##  4 Adelie  Torge~           42.9          17.6              196        4700
##  5 Chinst~ Dream            50.9          19.1              196        3550
##  6 Adelie  Biscoe           40.5          17.9              187        3200
##  7 Adelie  Dream            40.8          18.9              208        4300
##  8 Adelie  Biscoe           35.3          18.9              187        3800
##  9 Gentoo  Biscoe           50.4          15.7              222        5750
## 10 Adelie  Biscoe           41.6          18                192        3950
## # ... with 162 more rows, and 2 more variables: sex <fct>, year <int>

sample_n function

Randomly select size rows.

This funcitons allows to select desired number of observations from the data with or without replacement.

Consider the following example where I select 50 rows with replacement.

penguins %>% 
  sample_n(size = 50,replace = TRUE)

## # A tibble: 50 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Gentoo  Biscoe           45.4          14.6              211        4800
##  2 Adelie  Dream            40.2          17.1              193        3400
##  3 Gentoo  Biscoe           50            15.3              220        5550
##  4 Gentoo  Biscoe           46.8          14.3              215        4850
##  5 Adelie  Dream            37            16.5              185        3400
##  6 Gentoo  Biscoe           43.3          13.4              209        4400
##  7 Gentoo  Biscoe           43.5          15.2              213        4650
##  8 Adelie  Biscoe           40.1          18.9              188        4300
##  9 Chinst~ Dream            52.8          20                205        4550
## 10 Gentoo  Biscoe           49.1          14.8              220        5150
## # ... with 40 more rows, and 2 more variables: sex <fct>, year <int>

slice funciton

Select rows by position

This funciton gives us the abilty to select rows by the position in which they appear in the data.

Say, I want observations 25 to 32 from the penguins data. Below code is how I would do it.

penguins %>% 
  slice(25:32)

## # A tibble: 8 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Biscoe           38.8          17.2              180        3800 male 
## 2 Adelie  Biscoe           35.3          18.9              187        3800 fema~
## 3 Adelie  Biscoe           40.6          18.6              183        3550 male 
## 4 Adelie  Biscoe           40.5          17.9              187        3200 fema~
## 5 Adelie  Biscoe           37.9          18.6              172        3150 fema~
## 6 Adelie  Biscoe           40.5          18.9              180        3950 male 
## 7 Adelie  Dream            39.5          16.7              178        3250 fema~
## 8 Adelie  Dream            37.2          18.1              178        3900 male 
## # ... with 1 more variable: year <int>

top_n function

Select and order top n entries

This funciton lets one select observations that are the highest or top by a given variable.

Say, I want to select penguins that have mass in the range of the top 5 values that the body_mass_g varible takes. This is how I would do it.

penguins %>% 
  top_n(n = 5, body_mass_g)

## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Gentoo  Biscoe           49.2          15.2              221        6300 male 
## 2 Gentoo  Biscoe           59.6          17                230        6050 male 
## 3 Gentoo  Biscoe           51.1          16.3              220        6000 male 
## 4 Gentoo  Biscoe           45.2          16.4              223        5950 male 
## 5 Gentoo  Biscoe           49.8          15.9              229        5950 male 
## 6 Gentoo  Biscoe           48.8          16.2              222        6000 male 
## # ... with 1 more variable: year <int>

arrange funciton

Order rows by values of a column or columns (low to high), use with desc() to order from high to low

Say, I want to order the penguins by their body mass.

penguins %>% 
  arrange(body_mass_g)

## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Chinst~ Dream            46.9          16.6              192        2700
##  2 Adelie  Biscoe           36.5          16.6              181        2850
##  3 Adelie  Biscoe           36.4          17.1              184        2850
##  4 Adelie  Biscoe           34.5          18.1              187        2900
##  5 Adelie  Dream            33.1          16.1              178        2900
##  6 Adelie  Torge~           38.6          17                188        2900
##  7 Chinst~ Dream            43.2          16.6              187        2900
##  8 Adelie  Biscoe           37.9          18.6              193        2925
##  9 Adelie  Dream            37.5          18.9              179        2975
## 10 Adelie  Dream            37            16.9              185        3000
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

This arranges the penguins from least fluffy to most fluffy. I can arrange in decreasing order with the use of desc() within the arrange().

penguins %>% 
  arrange(desc(body_mass_g))

## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Gentoo  Biscoe           49.2          15.2              221        6300
##  2 Gentoo  Biscoe           59.6          17                230        6050
##  3 Gentoo  Biscoe           51.1          16.3              220        6000
##  4 Gentoo  Biscoe           48.8          16.2              222        6000
##  5 Gentoo  Biscoe           45.2          16.4              223        5950
##  6 Gentoo  Biscoe           49.8          15.9              229        5950
##  7 Gentoo  Biscoe           48.4          14.6              213        5850
##  8 Gentoo  Biscoe           49.3          15.7              217        5850
##  9 Gentoo  Biscoe           55.1          16                230        5850
## 10 Gentoo  Biscoe           49.5          16.2              229        5800
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

add_row funtion

Add one or more rows to a table

Say, I went to Ayub’s again and tried another dish that I want to add to the Flavours data.

Flavours %>% 
  add_row(Restaurant = "Ayub's", Dish = "Biryani", Flavour_rate = 8)

## # A tibble: 8 x 3
##   Restaurant Dish                Flavour_rate
##   <chr>      <chr>                      <dbl>
## 1 Ayub's     Mutton Baida Roti              8
## 2 Saranga    Surmai                         9
## 3 Saranga    Prawns malwani                10
## 4 Kamath     Vada Sambhar                   6
## 5 Ayub's     Chiken Sheekh Kebab            7
## 6 Kamath     Misal Pav                      7
## 7 Ayub's     Mutton Baida Roti              8
## 8 Ayub's     Biryani                        8