group_by() : Group then summarize with group_by

A common operation in data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men’s and women’s heights separately. The group_by function helps us do this.

If we type this:

heights %>% group_by(sex)
#> # A tibble: 1,050 x 2
#> # Groups:   sex [2]
#>   sex    height
#>   <fct>   <dbl>
#> 1 Male       75
#> 2 Male       70
#> 3 Male       68
#> 4 Male       74
#> 5 Male       61
#> 6 Female     65
#> # ... with 1,044 more rows

The result does not look very different from heights, except we see Groups: sex [2] when we print the object. Although not immediately obvious from its appearance, this is now a special data frame called a grouped data frame and dplyr functions, in particular summarize, will behave differently when acting on this object. Conceptually, you can think of this table as many tables, with the same columns but not necessarily the same number of rows, stacked together in one object. When we summarize the data after grouping, this is what happens:

heights %>% 
  group_by(sex) %>%
  summarize(average = mean(height), standard_deviation = sd(height))
#> # A tibble: 2 x 3
#>   sex    average standard_deviation
#>   <fct>    <dbl>              <dbl>
#> 1 Female    64.9               3.76
#> 2 Male      69.3               3.61

The summarize function applies the summarization to each group separately.

For another example, let’s compute the median murder rate in the four regions of the country:

murders %>% 
  group_by(region) %>%
  summarize(median_rate = median(rate))
#> # A tibble: 4 x 2
#>   region        median_rate
#>   <fct>               <dbl>
#> 1 Northeast            1.80
#> 2 South                3.40
#> 3 North Central        1.97
#> 4 West                 1.29

Instruction

Run the sample code to see how group_by() function works.

# load important packages library(dplyr) library(dslabs) data(heights) # Add group categorical information heights %>% group_by(sex) # Summarize by group heights %>% group_by(sex) %>% summarize(average = mean(height), standard_deviation = sd(height))

Previous: 3-8 | pull(): Accessing a data object in a piped data

Next: 3-10 | arrange(): Sorting data frames

Back to Main