Day 3: Tidyverse: group_by and summarise, more ggplot2

Do this before class
During class
SR songs
- Excercises
Insurance claims from kammarkollegiet
- Excercises
Election 2022
- Excercises
Systembolaget’s assortment
- Excercises

Do this before class

Read R4DS chapters 5.6-5.7, 3.7-3.10.

Solve the Grouping and Summarizing and Types of vizualisations chapters of the Introduction to the Tidyverse course at DataCamp.

During class

Open your class_files-project and “Pull Branch” (under Tools > Version control in RStudio) in order to make sure you have files ready and updated.

SR songs

The script class_files/SR_music.R contains a simple function get_SR_music for grabbing music played on Swedish Radio channels from their open API. Load it by

source("class_files/SR_music.R")

and grab e.g. the songs on P3 (channel 164) played very recently, i.e. 2022-10-31, by

get_SR_music(channel = 164, date = "2022-10-31") %>% select(title, artist, start_time) %>% head(n=2)

##              title          artist          start_time
## 1       Body Paint  Arctic Monkeys 2022-10-31 23:58:41
## 2 Take It Personal Ella Tiritiello 2022-10-31 23:53:56

If you want multiple dates, the map-functions from the purrr-package (included in the tidyverse) are convenient (more about these later on in the course). Grabbing music played, e.g., in the last week of October (2022-10-25 to 2022-10-31 into music is done by

days <- seq(as.Date("2022-10-25"), as.Date("2022-10-31"), "days")
music <- map_df(days, get_SR_music, channel = 164)

Excercises

Note: Data is not entirely clean and the same artist/song may be coded in multiple ways (e.g. Cherrie & Z.E., Cherrie, Z.e and Cherrie, Z.E). You may ignore this for now.

Pick a date, or a sequence of dates, and list the 5 most played songs.
What artist has the most number of different songs played over some sequence of days?
Visualise the distribution of song durations.
Pick a sequence of dates and visualise how the songs start_times are distributed over the day. Repeat for another channel, e.g. P2 (channel 163). You can grab components of a date-time (POSIXct) object with format as in

as.POSIXct("2019-01-01 23:57:04 CET") %>% format("%H:%M")

## [1] "23:57"

for extracting the hour and minute, see ?format.POSIXct for more examples. Note that the above code results in a value of character-type, you may want to further convert to numeric format (e.g. minutes or hours after midnight) before plotting. The tidyverse package for date formatting is called lubridate and has function to extract hours and minutes as well, e.g.

as.POSIXct("2019-01-01 23:57:04 CET") %>% lubridate::hour()

## [1] 23

Visualise the number of hours spent playing music each day over a sequence of dates.

Insurance claims from kammarkollegiet

Kammarkollegiet is a public agency that among other things issue insurances. The file class_files/claims.csv contains data on claims from one of their personal insurances. Each claim has an unique Claim id, a Claim date, a Closing date and a number of Payments disbursed at Payment dates. If the claim is not closed (there may be more payments coming) Closing date is given value NA. Null claims, i.e. claims that has been closed without payment, are not included.

Read the data by

claim_data <- read_csv("class_files/claims.csv")

Excercises

Plot the number of claims per year (each Claim id should only be counted once!).

Actuaries are very fond of loss triangles. This is a table where the value on row \(i\), column \(j\) is the sum of all payments on claims with Claim date in year \(i\) that are disbursed until the \(j\):th calendar year after the year of the claim/accident. The table will be a triangle since future payments are not available.

For claims made since 2010, compute the loss triange and print it with knitr::kable. Try to do it in a single sequence of pipes. If future payments are coded as NA, using options(knitr.kable.NA = '') will result in a nicer looking table.

Election 2022

All political parties participating in the 2022 Swedish elections can be downloaded from Valmyndigheten by

party_url = "https://www.val.se/download/18.75995f7b17f5a986a4eebb/1664362507785/deltagande-partier.csv"
parties_2022 <- read_delim(party_url, delim = ";", locale = locale("sv", encoding = "UTF-8"))

Excercises

There is a warning about parsing issues when reading the data set. Can you find out where the problem is coming from?
How many unique parties participated in each of the three elections (VALTYP equals RD for Riksdag, RF for Regionfullmäktige and KF for Kommunfullmäktige)? Note that the same party may appear multiple times (based on e.g. multiple reasons of inclusion in DELTAGANDEGRUND)
How many local parties (parties only participating within a single VALKRETSKOD) participated in the Kommunalval (VALTYP equals K)?

Systembolaget’s assortment

As in last class load Systembolaget’s assortment and select the regular product range.

Excercises

How many beverages are there in each group of products (Varugrupp)? Use filter and is.na to filter out beverages where Varugrupp is not available.
Select red wines of vintage 2011-2018. Compute the mean PrisPerLiter for each vintage and visualise using ggplot.
List the cheapest beverage (by PrisPerLiter) in each Varugrupp.