An important part of exploratory data analysis is summarizing data. The average and standard deviation are two examples of widely used summary statistics. More informative summaries can often be achieved by first splitting data into groups. In this section, we cover two new dplyr verbs that make these computations easier: summarize
and group_by
. We learn to access resulting values using the pull
function.
summarize
The summarize
function in dplyr provides a way to compute summary statistics with intuitive and readable code. We start with a simple example based on heights. The heights
dataset includes heights and sex reported by students in an in-class survey.
library(dplyr)
library(dslabs)
data(heights)
The following code computes the average and standard deviation for females:
s <- heights %>%
filter(sex == "Female") %>%
summarize(average = mean(height), standard_deviation = sd(height))
s
#> average standard_deviation
#> 1 64.9 3.76
This takes our original data table as input, filters it to keep only females, and then produces a new summarized table with just the average and the standard deviation of heights. We get to choose the names of the columns of the resulting table. For example, above we decided to use average
and standard_deviation
, but we could have used other names just the same.
Run the sample code to see how summarize() function works.
library(dplyr)
library(dslabs)
data(heights)
data(murders)
murders <- murders %>% mutate(rate = total/population*100000)
# Add the rate column
heights %>%
filter(sex == "Female") %>%
summarize(median = median(height), minimum = min(height), maximum = max(height))
heights %>%
filter(sex == "Female") %>%
summarize(range = quantile(height, c(0, 0.5, 1)))
# Summarize mean of rate in murders dataset.
murders %>% summarize(mean(rate))