readr
tibble
and read_*
The read_*
functions are extremely clever, helping you read data and always returning a tibble
.
A tibble
is a “nicer” and more honest data.frame
Will this turn out as you imagined?
data.frame(x=list("="="=", "B"="b"), # length 2 y=c(1,2,3)) # length 3 tibble(x=list("="="=", "B"="b"), y=c(1,2,3)) data.frame(x=list("A"="a", "B"="b"), y=list("C"="c", "D"="d")) tibble(x=list("A"="a", "B"="b"), y=list("C"="c", "D"="d"))
read_*
read.*
, used in the same way.col_types
(colClasses
from read.*
) and guess_max
read.csv
instead of read_csv
readxl::read_excel
readxl
, excel files are specialreadxl::read_excel
function to read .xls
/ .xlsx
files.read_excel
the sheet
argument.dplyr
arrange
arranges the rows in the datamtcars <- tibble::rownames_to_column(mtcars, var = "model") kable(head(arrange(mtcars, mpg), n = 4))
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Cadillac Fleetwood | 10.4 | 8 | 472 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Camaro Z28 | 13.3 | 8 | 350 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Duster 360 | 14.3 | 8 | 360 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
arrange
arranges the rows in the datakable(head(arrange(mtcars, mpg, disp), n = 4))
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Lincoln Continental | 10.4 | 8 | 460 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Cadillac Fleetwood | 10.4 | 8 | 472 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Camaro Z28 | 13.3 | 8 | 350 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Duster 360 | 14.3 | 8 | 360 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
filter
select rows (observations)#only those with manual transmission kable(head(filter(mtcars, am == 1), n=4))
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
filter
select rows (observations)kable(head(filter(mtcars, mpg < 30), n=4))
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
mutate
introduces new/transforms variablekable(head(mutate(mtcars, lpm = 235 / mpg), n=4))
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | lpm |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | 11.19048 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 | 11.19048 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 10.30702 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 10.98131 |
select
selects variables (columns)kable(head(select(mtcars, model, mpg), n=4))
model | mpg |
---|---|
Mazda RX4 | 21.0 |
Mazda RX4 Wag | 21.0 |
Datsun 710 | 22.8 |
Hornet 4 Drive | 21.4 |
%>%
What function does the argument x=3
take?
df <- f1(f2(f3(f4(mtcars, x=2), x), x=3))
This is not easy to read! How can we solve it?
%>%
Determine \(h\circ g \circ f(a) = h(g(f(a)))\)
Three different ways to calculate this in R:
b <- f(a) c <- g(b) h(c)
h(g(f(a)))
a %>% f() %>% g() %>% h()
%>%
mtcars <- mutate(mtcars, lpm = 235 / mpg) mtcars <- filter(mtcars, am == 1) ggplot(mtcars, aes(x = hp, y = lpm)) + geom_point()
ggplot( filter( mutate(mtcars, lpm = 235 / mpg) , am ==1), aes(x = hp, y = lpm)) + geom_point()
mtcars %>% mutate(lpm = 235 / mpg) %>% filter(am == 1) %>% ggplot(aes(x = hp, y = lpm)) + geom_point()
ggplot2
ggplot2
A statistical plot has components
data
geom
: type of geometric objects (points, lines, …)cord
: coordinate systemmapping
: binds data to the dimensions/“aesthetics” of the coordinate system (position, color, shape, size, …)ggplot2
A scatterplot
data
: mpg
and hp
for a number of carsgeom
: pointscoord
: Cartesianmapping
: binds hp
to position on the x
axis and mpg
to the y
axisggplot2
ggplot(data = mtcars, mapping = aes(x = hp, y = mpg)) + geom_point()
ggplot2
with some color and sizesggplot(mtcars, aes(x = hp, y = mpg, size = wt, color = cyl)) + geom_point()
What is cyl?
ggplot2
: the types in mtcars
matter!ggplot(mtcars, aes(x = hp, y = mpg, size = wt, color = as.factor(cyl))) + geom_point()
ggplot2
arguments outside of aes
but in a geom
ggplot(mtcars, aes(x = hp, y = mpg, size = wt)) + geom_point(color = cyl) # ERROR
ggplot(mtcars, aes(x = hp, y = mpg, size = wt)) + geom_point(color = "red") # OK
aes
looks in mtcars
but not geom
!
ggplot2
Be careful about reusing names in columns, variables, etc.
cyl <- "blue" ggplot(mtcars, aes(x = hp, y = mpg, size = wt)) + geom_point(color = cyl)
Error?