Basic readr

Functions tibble and read_*

  • The read_* functions are extremely clever, helping you read data and always returning a tibble.

  • A tibble is a “nicer” and more honest data.frame

  • Will this turn out as you imagined?

data.frame(x=list("="="=", "B"="b"), # length 2
           y=c(1,2,3)) # length 3
tibble(x=list("="="=", "B"="b"),
       y=c(1,2,3))

data.frame(x=list("A"="a", "B"="b"),
           y=list("C"="c", "D"="d"))
tibble(x=list("A"="a", "B"="b"),
       y=list("C"="c", "D"="d"))

read_*

  • Faster than base-R read.*, used in the same way.
  • Don’t make strings into factors.
  • Nothing strange about your column names.
  • Does not depend on environment variables in your system, good for reproducibility among machines/systems.
  • Useful arguments col_types (colClasses from read.*) and guess_max
  • We will complain if you use read.csv instead of read_csv

readxl::read_excel

  • Package readxl, excel files are special
  • readxl::read_excel function to read .xls/ .xlsx files.
  • Unique to read_excel the sheet argument.
  • Will be used in Homework 2.

Basic dplyr

arrange arranges the rows in the data

mtcars <- tibble::rownames_to_column(mtcars, var = "model")
kable(head(arrange(mtcars, mpg), n = 4))
model mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4

arrange arranges the rows in the data

kable(head(arrange(mtcars, mpg, disp), n = 4))
model mpg cyl disp hp drat wt qsec vs am gear carb
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4

filter select rows (observations)

#only those with manual transmission
kable(head(filter(mtcars, am == 1), n=4))
model mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

filter select rows (observations)

kable(head(filter(mtcars, mpg < 30), n=4))
model mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

mutate introduces new/transforms variable

kable(head(mutate(mtcars, lpm = 235 / mpg), n=4))
model mpg cyl disp hp drat wt qsec vs am gear carb lpm
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 11.19048
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 11.19048
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 10.30702
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.98131

select selects variables (columns)

kable(head(select(mtcars, model, mpg), n=4))
model mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
Hornet 4 Drive 21.4

Pipe %>%

What function does the argument x=3 take?

df <- f1(f2(f3(f4(mtcars, x=2), x), x=3))

This is not easy to read! How can we solve it?

Pipe %>%

Determine \(h\circ g \circ f(a) = h(g(f(a)))\)

Three different ways to calculate this in R:

b <- f(a)
c <- g(b)
h(c)
h(g(f(a)))
a %>%
    f() %>%
    g() %>%
    h()

Pipe %>%

mtcars <- mutate(mtcars, lpm = 235 / mpg)
mtcars <- filter(mtcars, am == 1)
ggplot(mtcars, aes(x = hp, y = lpm)) + geom_point()
ggplot(
    filter(
        mutate(mtcars, lpm = 235 / mpg)
        , am ==1),
    aes(x = hp, y = lpm)) + geom_point()
mtcars %>%
    mutate(lpm = 235 / mpg) %>%
    filter(am == 1) %>%
    ggplot(aes(x = hp, y = lpm)) + geom_point()
    

Basic ggplot2

ggplot2

A statistical plot has components

  • data
  • geom: type of geometric objects (points, lines, …)
  • cord: coordinate system
  • mapping: binds data to the dimensions/“aesthetics” of the coordinate system (position, color, shape, size, …)

ggplot2

A scatterplot

  • data: mpg and hp for a number of cars
  • geom: points
  • coord: Cartesian
  • mapping: binds hp to position on the x axis and mpg to the y axis

ggplot2

ggplot(data = mtcars, mapping = aes(x = hp, y = mpg)) + geom_point()

ggplot2 with some color and sizes

ggplot(mtcars,
       aes(x = hp, y = mpg, size = wt, color = cyl)) +
    geom_point()

What is cyl?

ggplot2: the types in mtcars matter!

ggplot(mtcars,
       aes(x = hp, y = mpg, size = wt, color = as.factor(cyl))) +
    geom_point()

ggplot2 arguments outside of aes but in a geom

ggplot(mtcars,
       aes(x = hp, y = mpg, size = wt)) +
    geom_point(color = cyl)
# ERROR
ggplot(mtcars,
       aes(x = hp, y = mpg, size = wt)) +
    geom_point(color = "red")
# OK

aes looks in mtcars but not geom!

ggplot2

Be careful about reusing names in columns, variables, etc.

cyl <- "blue"
ggplot(mtcars,
       aes(x = hp, y = mpg, size = wt)) +
    geom_point(color = cyl)

Error?