group_by()
: Group then summarize with group_by
A common operation in data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men’s and women’s heights separately. The group_by
function helps us do this.
If we type this:
heights %>% group_by(sex)
#> # A tibble: 1,050 x 2
#> # Groups: sex [2]
#> sex height
#> <fct> <dbl>
#> 1 Male 75
#> 2 Male 70
#> 3 Male 68
#> 4 Male 74
#> 5 Male 61
#> 6 Female 65
#> # ... with 1,044 more rows
The result does not look very different from heights
, except we see Groups: sex [2]
when we print the object. Although not immediately obvious from its appearance, this is now a special data frame called a grouped data frame and dplyr functions, in particular summarize
, will behave differently when acting on this object. Conceptually, you can think of this table as many tables, with the same columns but not necessarily the same number of rows, stacked together in one object. When we summarize the data after grouping, this is what happens:
heights %>%
group_by(sex) %>%
summarize(average = mean(height), standard_deviation = sd(height))
#> # A tibble: 2 x 3
#> sex average standard_deviation
#> <fct> <dbl> <dbl>
#> 1 Female 64.9 3.76
#> 2 Male 69.3 3.61
The summarize
function applies the summarization to each group separately.
For another example, let’s compute the median murder rate in the four regions of the country:
murders %>%
group_by(region) %>%
summarize(median_rate = median(rate))
#> # A tibble: 4 x 2
#> region median_rate
#> <fct> <dbl>
#> 1 Northeast 1.80
#> 2 South 3.40
#> 3 North Central 1.97
#> 4 West 1.29
Run the sample code to see how group_by() function works.
# load important packages
library(dplyr)
library(dslabs)
data(heights)
# Add group categorical information
heights %>% group_by(sex)
# Summarize by group
heights %>%
group_by(sex) %>%
summarize(average = mean(height), standard_deviation = sd(height))