23  Data Summarization

23.1 Descriptive Statistics

Here, we will go over some of the basic syntax to obtain basic statistics. We will use the variables mpg and cyl from the mtcars data set. To view the data set use the head():

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The variable mpg would be used as a continuous variable, and the variable cyl would be used as a categorical variable.

23.1.1 Point Estimates

The first basic statistic you can compute are point estimates. These are your means, medians, etc. Here we will calculate these estimates.

23.1.1.1 Mean

To obtain the mean, use the mean(), you only need to specify x= for the data to compute the mean:

mean(mtcars$mpg)
[1] 20.09062

23.1.1.2 Median

To obtain the median, use the median(), you only need to specify x= for the data to compute the median:

median(mtcars$mpg)
[1] 19.2

23.1.1.3 Frequency

To obtain a frequency table, use the table(), you only need to specify the data as the first argument to compute the frequency table:

table(mtcars$cyl)

 4  6  8 
11  7 14 

23.1.1.4 Proportion

To obtain a the proportions for the frequency table, use the prop.table(). However the first argument must be the results from the table(). Use the table() inside the prop.table() to get the proportions:

prop.table(table(mtcars$cyl))

      4       6       8 
0.34375 0.21875 0.43750 

23.1.2 Variability

In addition to point estimates, variability is an important statistic to report to let a user know about the spread of the data. Here we will calculate certain variability statistics.

23.1.2.1 Variance

To obtain the variance, use the var(), you only need to specify x= for the data to compute the variance:

var(mtcars$mpg)
[1] 36.3241

23.1.2.2 Standard deviation

To obtain the standard deviation, use the sd(), you only need to specify x= for the data to compute the standard deviation:

sd(mtcars$mpg)
[1] 6.026948

23.1.2.3 Max and Min

To obtain the max and min, use the max() and min(), respectively. You only need to specify the data as the first argument to compute the max and min:

max(mtcars$mpg)
[1] 33.9
min(mtcars$mpg)
[1] 10.4

23.1.2.4 Q1 and Q3

To obtain the Q1 and Q3, use the quantile() and specify the desired quantile with probs=. You only need to specify the data as the first argument and probs= (as a decimal) to compute the Q1 and Q3:

quantile(mtcars$mpg, .25)
   25% 
15.425 
quantile(mtcars$mpg, .75)
 75% 
22.8 

23.1.3 Associations

In statistics, we may be interested on how different variables are related to each other. These associations can be represented in a numerical value.

23.1.3.1 Continuous and Continuous

When we measure the association between to continuous variables, we tend to use a correlation statistic. This statistic tells us how linearly associated are the variables are to each other. Essentially, as one variable increases, what happens to the other variable? Does it increase (positive association) or does it decrease (negative association). To find the correlation in R, use the cor(). You will need to specify the x= and y= which represents vectors for each variable. Find the correlation between mpg and hp from the mtcars data set.

cor(mtcars$mpg, mtcars$hp)
[1] -0.7761684

23.1.3.2 Categorical and Continuous

When comparing categorical variables, it becomes a bit more nuanced in how to report associations. Most of time you will discuss key differences in certain groups. Here, we will talk about how to get the means for different groups of data. Our continuous variable is the mpg variable, and our categorical variable is the cyl variable. Both are from the mtcars data set. The tapply() allows us to split the data into different groups and then calculate different statistics. We only need to specify X= of the R object to split, INDEX= which is a list of factors or categories indicating how to split the data set, and FUN= which is the function that needs to be computed. Use the tapply() and find the mean mpg for each cyl group: 4, 5, and 6.

tapply(mtcars$mpg, list(mtcars$cyl), mean)
       4        6        8 
26.66364 19.74286 15.10000 

23.1.3.3 Categorical and Categorical

Reporting the association between two categorical variables is may be challenging. If you have a \(2\times 2\) table, you can report a ratio of association. However, any other case may be challenging. You can report a hypothesis test to indicate an association, but it does not provide much information about the effect of each variable. You can also report row, column, or table proportions. Here we will talk about creating cross tables and report these proportions. To create a cross table, use the table() and use the first two arguments to specify the two categorical variables. Create a cross tabulation between cyl and carb from the mtcars data set.

table(mtcars$cyl, mtcars$carb)
   
    1 2 3 4 6 8
  4 5 6 0 0 0 0
  6 2 0 0 4 1 0
  8 0 4 3 6 0 1

Notice how the first argument is represented in the rows and the second argument is in the columns. Now create table proportions using both of the variables. You first need to create the table and store it in a variable and then use the prop.table().

prop.table(table(mtcars$cyl, mtcars$carb))
   
          1       2       3       4       6       8
  4 0.15625 0.18750 0.00000 0.00000 0.00000 0.00000
  6 0.06250 0.00000 0.00000 0.12500 0.03125 0.00000
  8 0.00000 0.12500 0.09375 0.18750 0.00000 0.03125

To get the row proportions, use the argument margin = 1 within the prop.table().

prop.table(table(mtcars$cyl, mtcars$carb), 
           margin = 1)
   
             1          2          3          4          6          8
  4 0.45454545 0.54545455 0.00000000 0.00000000 0.00000000 0.00000000
  6 0.28571429 0.00000000 0.00000000 0.57142857 0.14285714 0.00000000
  8 0.00000000 0.28571429 0.21428571 0.42857143 0.00000000 0.07142857

To get the column proportions, use the argument margin = 2 within the prop.table().

prop.table(table(mtcars$cyl, mtcars$carb), 
           margin = 2)
   
            1         2         3         4         6         8
  4 0.7142857 0.6000000 0.0000000 0.0000000 0.0000000 0.0000000
  6 0.2857143 0.0000000 0.0000000 0.4000000 1.0000000 0.0000000
  8 0.0000000 0.4000000 1.0000000 0.6000000 0.0000000 1.0000000

23.2 Summarizing with Tidyverse

library(magrittr)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract()   masks magrittr::extract()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::lag()       masks stats::lag()
✖ purrr::set_names() masks magrittr::set_names()
f <- function(x){
  mtcars %>% split(~.$cyl) %>% map(~shapiro.test(.$mpg)) 
  return(1)}
g <- function(x){
  mtcars %>% group_by(cyl) %>% nest() %>% mutate(shapiro = map(data, ~shapiro.test(.$mpg)))
  return(1)}
bench::mark(f(1),g(1))
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 f(1)        404.4µs  434.7µs    2242.   134.23KB    16.9 
2 g(1)         11.7ms   11.8ms      82.8    3.65MB     8.95