Summarize(): Compute summary statistics

An important part of exploratory data analysis is summarizing data. The average and standard deviation are two examples of widely used summary statistics. More informative summaries can often be achieved by first splitting data into groups. In this section, we cover two new dplyr verbs that make these computations easier: summarize and group_by. We learn to access resulting values using the pull function.

5.7.1 summarize

The summarize function in dplyr provides a way to compute summary statistics with intuitive and readable code. We start with a simple example based on heights. The heights dataset includes heights and sex reported by students in an in-class survey.

library(dplyr)
library(dslabs)
data(heights)

The following code computes the average and standard deviation for females:

s <- heights %>% 
  filter(sex == "Female") %>%
  summarize(average = mean(height), standard_deviation = sd(height))
s
#>   average standard_deviation
#> 1    64.9               3.76

This takes our original data table as input, filters it to keep only females, and then produces a new summarized table with just the average and the standard deviation of heights. We get to choose the names of the columns of the resulting table. For example, above we decided to use average and standard_deviation, but we could have used other names just the same.

Instruction

Run the sample code to see how summarize() function works.

library(dplyr) library(dslabs) data(heights) data(murders) murders <- murders %>% mutate(rate = total/population*100000) # Add the rate column heights %>% filter(sex == "Female") %>% summarize(median = median(height), minimum = min(height), maximum = max(height)) heights %>% filter(sex == "Female") %>% summarize(range = quantile(height, c(0, 0.5, 1))) # Summarize mean of rate in murders dataset. murders %>% summarize(mean(rate))

Previous: 3-6 | %>% : Pipe

Next: 3-8 | pull(): Accessing a data object in a piped data

Back to Main