Data Processing

Isaac Quintanilla Salinas

UC Riverside

4/21/2022

Presentation Online

Presentation:

www.inqs.info/files/hiss_3/hiss_3.html

RMD:

www.inqs.info/files/hiss_3/hiss_3.qmd

Website:

www.inqs.info

Email:

iquin002@ucr.edu

Data Cleaning

dplyr

dplyr Functions

  • mutate() adds new variables
  • select() selects variables
  • filter() filters data
  • if_else() conditional function that returns 2 values
  • group_by() a dataset is grouped by factors
  • summarise() provides summaries of data

tidyr

tidyr Functions

  • pivot_longer() (formerly gather()) transforms the data from wide to long

  • pivot_wider() (formerly spread()) transforms the data from long to wide

  • separate() separates a one variable to multiple variables

  • unite() merge multiple variable to one variable

Pipe Operator %>%

  • The pipe operator is the real power of tidyverse.

  • It takes the output of a function and uses it as input for another function.

  • Tidyverse works best when data frames (tibbles) are used a inputs.

Data Set

  • We will work on manipulating the mtcars data set

  • Below prints out the code:

mtcars %>% 
  head(n=3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

mutate()

  • Adds a new variable to a data frame

  • Example:

mtcars %>% 
  mutate(log_mpg=log(mpg)) %>% 
  head(n=3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761

mutate()

  • Each argument adds a new variable added

  • Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% 
  head(n=3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761
                log_hp
Mazda RX4     4.700480
Mazda RX4 Wag 4.700480
Datsun 710    4.532599

select()

-This selects the variables to keep in the data frame

-Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>% 
  head(n=3)
               mpg  log_mpg  hp   log_hp
Mazda RX4     21.0 3.044522 110 4.700480
Mazda RX4 Wag 21.0 3.044522 110 4.700480
Datsun 710    22.8 3.126761  93 4.532599

filter()

  • Selects observations that satisfy a condition

  • Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>%
  filter(log_hp<5) %>% 
  head(n=3)
               mpg  log_mpg  hp   log_hp
Mazda RX4     21.0 3.044522 110 4.700480
Mazda RX4 Wag 21.0 3.044522 110 4.700480
Datsun 710    22.8 3.126761  93 4.532599

if_else()

  • A function that provides T (1) if the condition is met and F (0) otherwise

  • Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>%
  filter(log_hp<5) %>%
  mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>%
  head(n=3)
               mpg  log_mpg  hp   log_hp hilhp
Mazda RX4     21.0 3.044522 110 4.700480     1
Mazda RX4 Wag 21.0 3.044522 110 4.700480     1
Datsun 710    22.8 3.126761  93 4.532599     1

group_by()

  • This groups the data frame

  • Example:

mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>%
  filter(log_hp<5) %>%
  mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>%
  group_by(hilhp) %>% 
  head(n=3)
# A tibble: 3 × 5
# Groups:   hilhp [1]
    mpg log_mpg    hp log_hp hilhp
  <dbl>   <dbl> <dbl>  <dbl> <dbl>
1  21      3.04   110   4.70     1
2  21      3.04   110   4.70     1
3  22.8    3.13    93   4.53     1

summarise()

  • Creates summary statistics for variables
mtcars %>% 
  mutate(log_mpg=log(mpg),log_hp=log(hp)) %>%
  select(mpg,log_mpg,hp,log_hp) %>%
  filter(log_hp<5) %>%
  mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>%
  group_by(hilhp) %>%
  summarise(mean_mpg=mean(mpg),mean_lmpg=mean(log_mpg),
            sd_mpg=sd(mpg),sd_lmpg=sd(log_mpg)) %>%
  head(n=3)
# A tibble: 2 × 5
  hilhp mean_mpg mean_lmpg sd_mpg sd_lmpg
  <dbl>    <dbl>     <dbl>  <dbl>   <dbl>
1     0     29.7      3.38   3.85   0.133
2     1     22.0      3.08   3.46   0.148

Wide to Long Example

Wide to Long Data Example

We work on converting data from wide to long using the functions in the tidyr package. For many statistical analysis, long data is necessary.

Load Data

Use the read_csv() to read data_3_4.csv into an object called data1;

data1 <- read_csv(file="http://www.inqs.info/files/hiss_3/data_3_4.csv")

Wide Data

 [1] "ID1"       "v1/mean"   "v1/sd"     "v1/median" "v2/mean"   "v2/sd"    
 [7] "v2/median" "v3/mean"   "v3/sd"     "v3/median" "v4/mean"   "v4/sd"    
[13] "v4/median"
# A tibble: 6 × 13
  ID1   v1/me…¹ `v1/sd` v1/me…² v2/me…³ `v2/sd` v2/med…⁴ v3/me…⁵ `v3/sd` v3/me…⁶
  <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
1 Ad91…   3.11    2.86     4.50   1.93    3.21   3.27       2.65  -0.383    3.23
2 A9c5…   2.03    2.90     2.08   0.709   2.27   4.13       1.45   2.01     2.84
3 A28a…  -0.415   2.42     2.47   2.38   -0.820  1.22       3.44   1.63     2.10
4 Aaf5…   1.25    2.24     3.71   4.00    0.456  4.32       1.54   0.789    4.08
5 A370…  -0.984   0.972    3.73   2.19   -0.184  2.14       4.32  -0.804    5.38
6 Aea9…   1.42    1.34     2.35   2.77    4.16  -0.00874   -3.02   4.25     6.36
# … with 3 more variables: `v4/mean` <dbl>, `v4/sd` <dbl>, `v4/median` <dbl>,
#   and abbreviated variable names ¹​`v1/mean`, ²​`v1/median`, ³​`v2/mean`,
#   ⁴​`v2/median`, ⁵​`v3/mean`, ⁶​`v3/median`

Long Data

# A tibble: 10 × 5
   ID1       time    mean     sd  median
   <chr>     <chr>  <dbl>  <dbl>   <dbl>
 1 Ad9131ee9 v1     3.11   2.86   4.50  
 2 Ad9131ee9 v2     1.93   3.21   3.27  
 3 Ad9131ee9 v3     2.65  -0.383  3.23  
 4 Ad9131ee9 v4     0.605  0.883  4.65  
 5 A9c5988ea v1     2.03   2.90   2.08  
 6 A9c5988ea v2     0.709  2.27   4.13  
 7 A9c5988ea v3     1.45   2.01   2.84  
 8 A9c5988ea v4     0.710  3.03  -0.0898
 9 A28a5479d v1    -0.415  2.42   2.47  
10 A28a5479d v2     2.38  -0.820  1.22  

pivot_longer()

  • The pivot_longer() function grabs the variables that repeated in an observation places them in one variable:
data1 %>% 
  pivot_longer(cols=`v1/mean`:`v4/median`,names_to = "measurement",values_to = "value") %>% 
  head()
# A tibble: 6 × 3
  ID1       measurement value
  <chr>     <chr>       <dbl>
1 Ad9131ee9 v1/mean      3.11
2 Ad9131ee9 v1/sd        2.86
3 Ad9131ee9 v1/median    4.50
4 Ad9131ee9 v2/mean      1.93
5 Ad9131ee9 v2/sd        3.21
6 Ad9131ee9 v2/median    3.27

separate()

  • The separate() function will separate a variable to multiple variables:
data1 %>% 
  pivot_longer(cols=`v1/mean`:`v4/median`,names_to = "measurement",values_to = "value") %>% 
  separate(col=measurement,into=c("time","stat"),sep="/") %>% 
  head()
# A tibble: 6 × 4
  ID1       time  stat   value
  <chr>     <chr> <chr>  <dbl>
1 Ad9131ee9 v1    mean    3.11
2 Ad9131ee9 v1    sd      2.86
3 Ad9131ee9 v1    median  4.50
4 Ad9131ee9 v2    mean    1.93
5 Ad9131ee9 v2    sd      3.21
6 Ad9131ee9 v2    median  3.27

pivot_wider()

  • The pivot_wider() function then converts long data to wide data.
data1 %>% 
  pivot_longer(`v1/mean`:`v4/median`,"measurement","value") %>% 
  separate(measurement,c("time","stat"),sep="/") %>% 
  pivot_wider(names_from = stat,values_from = value) %>% 
  head()      
# A tibble: 6 × 5
  ID1       time   mean     sd median
  <chr>     <chr> <dbl>  <dbl>  <dbl>
1 Ad9131ee9 v1    3.11   2.86    4.50
2 Ad9131ee9 v2    1.93   3.21    3.27
3 Ad9131ee9 v3    2.65  -0.383   3.23
4 Ad9131ee9 v4    0.605  0.883   4.65
5 A9c5988ea v1    2.03   2.90    2.08
6 A9c5988ea v2    0.709  2.27    4.13

Graphics

ggplot2

Basics

  • ggplot2 creates a plot by layering graphical elements on top of a plot

  • A base plot is created with the data

    • The data must be a data frame or tibble
  • Additional layers are added to base plot with + sign

Using ggplot2

  • Create Base Plot

  • Add geometrical Elements

  • Customize Plot

  • Google

Base Plot

  • A base plot is created using ggplot2()

    • data: specifies data frame to construct the base plot

    • mapping: specifies the aesthetic mapping for the plot

      • aes(): creates the mapping function
base_plot <- ggplot(mtcars, aes(x=mpg))

Base Plot

base_plot

Univariate

  • Histograms
    • geom_histogram()
  • Density Plots
    • geom_density()
  • qq plot
    • geom_qq()
    • geom_qq_line()

Histograms

base_plot + geom_histogram()

Density Plot

base_plot + geom_density()

QQ Plot

ggplot(mtcars, aes(sample = mpg)) + 
  geom_qq() + 
  geom_qq_line()

Bivariate

  • Scatter Plot
    • geom_point()
  • Line Plot
    • geom_line()

Bivariate Base Plot

base_plot2 <- ggplot(mtcars, aes(x=mpg, y = hp))
base_plot2

Scatter Plot

base_plot2 + geom_point()

Line Plot

base_plot2 + geom_line()

Line & Scatter Plot

base_plot2 + 
  geom_point() +
  geom_line()

Special Cases

Bivariate

  • Heat Map
    • geom_bin2d()
  • Contour Map
    • geom_density_2d()

Trivariate

  • Heat Map
    • geom_contour_filled()
  • Contour Map
    • geom_contour()

Heat Map

base_plot2 + geom_bin2d()

Contour Map

base_plot2 + 
  geom_density2d()

Trend Lines

  • Regression Line

    • geom_smooth(method = "lm")
  • LOESS

    • geom_smooth()

Regression Line

base_plot2 + 
  geom_point() +
  geom_smooth(method = "lm")

LOESS Line

base_plot2 + 
  geom_point() +
  geom_smooth()

Grouping Plots

  • Faceting: Facet allows you to subset the data by a categorical variable

    • facet_grid()

    • facet_wrap()

  • Grouping can be done within the mapping function: aes()

    • color

    • group

    • shape

Facet

ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point() +
  facet_grid(vars(cyl))

Mapping

ggplot(mtcars, aes(x = mpg, y = hp, col = factor(cyl))) +
  geom_point() 

Customization

  • Title
    • ggtitle()
  • Labels
    • X Label: xlab()
    • Y Label: ylab()

Themes

  • The theme() function allows you to change any component in the plot

  • ggplot2 has several prebuilt themes:

  • theme_bw()

  • theme_void()

  • Legends can be adjusted using the scale_XX_YY()

  • XX: the type grouping factor

  • YY: the type variable

Advanced Example

Advanced Example

  • Base Plot

  • Scatter Plot

  • Add Regression Line

  • Split The Plot

  • Change the Labels

  • Adjust the Legend

  • Change the theme

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) 

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") 

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") +
  facet_grid(cols = vars(am), 
    labeller = as_labeller(c(
      `1` = "Manual",
      `0` =  "Automatic"))) 

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") +
  facet_grid(cols = vars(am), 
    labeller = as_labeller(c(
      `1` = "Manual",
      `0` =  "Automatic"))) +
  ggtitle("Mtcars Plot") + 
  xlab("Miles Per Gallon") +
  ylab("Horse Power") 

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") +
  facet_grid(cols = vars(am), 
    labeller = as_labeller(c(
      `1` = "Manual",
      `0` =  "Automatic"))) + 
  ggtitle("Mtcars Plot") + 
  xlab("Miles Per Gallon") + 
  ylab("Horse Power") +
  scale_color_discrete(
    labels = c("V-Shaped", "Straight"),
    name = "")

Plot Code

ggplot(mtcars, 
       aes(mpg, hp, 
           color = factor(vs))) +
  geom_point()+
  geom_smooth(method = "lm") +
  facet_grid(cols = vars(am), 
    labeller = as_labeller(c(
      `1` = "Manual",
      `0` =  "Automatic"))) + 
  ggtitle("Mtcars Plot") + 
  xlab("Miles Per Gallon") + 
  ylab("Horse Power") +
  scale_color_discrete(
    labels = c("V-Shaped", "Straight"),
    name = "") +
  theme_bw()

Final Thoughts

  • Google is your friend!

  • Practice!

  • Read the documentation!

  • Utilize Cheatsheets!

Resources