Repo admin

From the first homework assignment, you should now have a Homework folder on your computer containing a subfolder HW1 with the first assignment. For the coming assignments, you will need data that is made available in the repo at https://github.com/mt4007-ht22/HW_data. Clone this repo (by creating a new R-project) into a subfolder HW_data of Homework. We recommend that you start a fresh R-project in Homework/HW2, but communicate with GitHub through the project in Homework (i.e. open this project when you need to commit/push).

Summary of repos and folders

  • The Homework folder is connected to your HW_username repository on GitHub. When you want to push your work to GitHub, open the R-project in this folder and commit-push. It has a subfolder Homework/HW_data and one subfolder for each homework (Homework/HW1, Homework/HW2, …). It also contains the README.md-file where you insert links to your homeworks.
  • The Homework/HW_data folder is connected to https://github.com/mt4007-ht22/HW_data. When a new homework is issued, you need to open the R-project in this folder and pull the new data from GitHub. You should never change the files in this folder. If you do so by mistake, delete it and make a new clone.
  • The Homework/HW[1-6] folders. This is where you keep your rmarkdown and markdown document for each homework. You should keep a separate R-project in each, but these need not be under version control.

Finally

You should not push the Homework/HW_data to GitHub. In order to avoid this:

  1. Open the R-project in Homework.
  2. Choose “Open file…” and open the file .gitignore (a list of files that git ignores).
  3. Add a new line containing HW_data and save the file.

HW instructions

Solutions to the following tasks should be presented in an R-Markdown document with output: github_document. Both the R-Markdown document (.Rmd-file) and the compiled Markdown document (.md file), as well as any figures needed for properly rendering the Markdown file on GitHub needs to be pushed as part of the HW2 subdirectory. Code should be written clearly in a consistent style, see in particular Hadley Wickham’s tidyverse style guide. As an example, code should be easily readable and avoid unnecessary repetition of variable names.

Your submitted code should be self-contained and results should reproducible for someone having access to the HW_data-repo directory. Once ou are ready to submit and before the deadline, use the same procedure as for HW1: open an issue in your HW_<username> repository with the title “HW2 ready for grading!”.

Exercise 1: Apartment prices

The file ../HW_data/booli_sold.csv contains sales data on 158 apartments in Ekhagen (next to Lappis).

Tasks

  1. Illustrate how Soldprice depends on Livingarea with a suitable figure.
  2. Illustrate trends in Soldprice / Livingarea over the period.
  3. Illustrate an aspect of data using a table.
  4. Illustrate an aspect of data using a histogram.
  5. Illustrate an aspect of data using a boxplot (geom_boxplot).

Exercise 2: Folkhälsomyndigheten COVID cases and why excel might not be your friend

The file ../HW_data/Folkhalsomyndigheten_Covid19.xlsx contains data on COVID-19 cases in Sweden. The data was obtained through Folkhälsomyndigheten’s webpage on the 1st of October 2020. Due to the fact that we downloaded it manually on a specific date, reproduceability might be an issue since COVID cases might be updated.

Tasks

Answer the listed questions below.

Data wrangling

  1. Open the .xlsx file in any way of your choosing and have a look at the numbers. What does the file contain and what does the data represent? Use the Folkhälsomyndigheten’s webpage to gather the necessary information. Declare what information is in which sheet. Depending on your OS and how you opened the file to begin with, you might get some info of the sheets using the function excel_sheets.
  2. From the readxl package, use an appropriate read_* function to read all sheets in the .xlsx file and store them as tibbles (data.frames). The read_* function will be simply referred to as the “read function” in the coming questions. When you read these sheets, you should see a lot of warning messages. We will investigate those in the coming questions.
  3. Display the first and the last five rows of the second sheet called “Antal avlidna per dag” using knitr::kable and head. What are the column names? Does anything seem strange? Using the argument n_max in the read function, remove the last row.
  4. In the sheet corresponding to “Veckodata Kommun_stadsdel” look at the columns and their types. What type has your read_* function parsed for the column Statsdel? Read the documentation and the appropriate function, give an explanation to why this happens and how to fix it.
  5. In the same sheet there are two columns called tot_antal_fall and nya_fall_vecka. What is the type of these variables and why has it been parsed as such? Correct these (in some way) such that these become numeric variables.

Statistics and plotting

  1. Using the summarise, and across function, reproduce the number of COVID-19 cases for each region as well as for the total (here named Totalt_antal_fall) based on sheet 1 of the excel file. Notice that this information can be found in another part of the excel file. What is the total number of cases? Which region has had most cases so far? Which has had the least? Argue why looking at counts might be misleading when comparing regions.
  2. Plot the daily number of reported cases (for whole Sweden) since the 15th of March in a line chart. Do the same for the cumulative number of reported cases (Hint: use the cumsum function inside of mutate).
  3. Derive the total number of cases per week in Sweden based on the column Antal_fall_vecka from sheet 7. Plot it as a barplot against the week number veckonummer using geom_col().