Solutions to the exercises of this homework 3 should, just as for
HW1-HW2, be written in an R-Markdown document with output:
github_document
. Both the R-Markdown document
(.Rmd-file) and the compiled Markdown document (.md
file), as well as any figures needed for properly rendering the Markdown
file on GitHub, should be uploaded to your Homework repository as part
of a HW3
folder. Code should be written clearly in a
consistent style, see in particular Hadley Wickham’s tidyverse style guide. As an
example, code should be easily readable and avoid unnecessary repetition
of variable names.
Open the project hw_data
and pull the most recent
changes. If this does not work, delete the folder and clone a new
version through an R project.
Note: In this homework you will get into contact with SQL databases for the first time. We will spent more time on this topic during next week(s). For now we will only load data of stored databases into R and corresponding code is given in the assignment. If you would like some background on SQL databases you can check out chapter 1, “Relational Databases” of the Introduction to SQL course on DataCamp.
In the file hw_data/seo_marketing_data.sqlite
we have an example of a database often seen in search engine marketing.
It contains data on who buys on that site (conversions). The data has
been simplified a lot, but you can imagine it being from a site that
rents out rooms for hotels. The database represents customers coming
from a search engine (e.g. google), pressing an ad (a “click”) and
whether or not they bought something (if a conversion occurred). There
are two tables, “Clicks” and “Conversion_value”. Each id
represent a unique customer. Use the following code snippet to read
tables from the database.
library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), "../hw_data/seo_marketing_data.sqlite")
# List all available tables
dbListTables(con)
# Tables are "Conversion value" and "Clicks"
# Pull all data in a table to R.
df_clicks <- dbReadTable(con, "Clicks")
Visualize the number of clicks made on each day (date).
Given that there is a cost to each click, which day of the week has been most costly? Hint: there are many ways to solve to problem, but have a look at lubridate::wday.
Make a histogram of the conversion value customers provided.
The “Conversion lag” (CL) of a customer is the number of days
from the customer clicked on an ad to the customer bought
something/converted. An example is that the customer clicked an add on
the first on January and rented a room on the third of January. The CL
is then equal to 2. Plot its distribution in R, interpret and comment on
the results. Hint: Google on how to take differences of
dates. If you want to write a full SQL solution you can use
julianday
.
Present two tables on the following format
Date of click | CL - 25 days | CL - 26 days | CL - 27 days |
2020-.. | a number | number | number |
2020-.. | a number | number | number |
2020-.. | a number | number | number |
2020-.. | a number | number | number |
2020-.. | a number | number | number |
The two tables should contain the sum of clicks and average conversion value. Display the last five rows with knitr::kable. The tables will be sparse (many zeros).
Hint: Use a suitable function to make the data in wide format.
The file HW_data/sl-api.sqlite
is a SQLite database containing all SL’s current stops, lines, and the
stops on each line. The data are obtained from a call to TrafikLab’s SL
Hållplatser
och Linjer 2 API on 2019-11-17 using the httr
package.
Note: Trafiklab
has further documentation of the API and the variables.
List all available tables if the HW_data/sl-api.sqlite database using similar code as shown in Exercise 1. Figure out and describe how the tables relate to each other. Note: Answer the question by both studying the descriptive information on the linked Website and, to a smaller degree, by manually looking at the tables of the database. The latter could, e.g., be useful to confirm what is written in the description. At the end, your answers is supposed to consists purely of text and possibly a figure, but no code.
Pull all data from the SQLite database into R.
Important: The SiteId
,
StopPointNumber
, StopAreaNumber
,
JourneyPatternPointNumber
variables in the different data
sets are all of the type integer according to the data model, but have
not been properly converted as part of the API call. As a consequence,
convert them manually to numerical values after importing them into
R.
Present a table of the number of active unique rail traffic stops
(i.e. train, tram or metro stops in each ticket zone
(ZoneShortName
in
stopAreas
/stopPoints
). By “active” we mean
stops that are part of the journey pattern (as defined in
journeyPatterns
) of a line.
Choose a line, and plot the stops as points on a map with the name of each stop as a label. Write the code in such a way that it is easily reusable if you want to plot another line. In order to produce a map use the latitude and longitude coordinates and generate the plot using leaflet package. As an example:
library(leaflet)
kräftriket <- data.frame(lon = 18.055, lat = 59.357, name = "Kräftriket")
m <- leaflet(kräftriket) %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=~lon, lat=~lat, popup = ~name)
m
For documents with output: html_output
or when working
with the code in interactive mode you can then just print m
to see the map. This might, however, not work with .md
files (i.e. github_document
)! Instead, you can store the
Leaflet input into a file and then read and display this again with
knitr::includegraphics
:
#Store leaflet widget as a PNG, see https://stackoverflow.com/questions/31336898/how-to-save-leaflet-in-r-map-as-png-or-jpg-file
library(mapview)
mapshot(m, file = "Leaflet-plot.png")
knitr::include_graphics("Leaflet-plot.png")
stopAreas
and stopPoints
tables and comment on the sparsity of this data presentation, e.g., are
there any (unnecessary) redundancies? Suggest a more sparse data model
for the stopAreas
table and perform the appropriate table
operations to obtain this sparser representation and store it in the
data.frame stopAreas_sparse
. Explain how one would get the
original stopAreas
data.frame using joins. Note: A very
operational way to check memory consumption for storing a variable in R
is to use the object.size
function.