A course in “Data Science”

The Data Science Venn Diagram

Statistical data processing

We will not touch…

  • …special tools for things that are too big for the computer’s internal memory (Big Data?)…
  • …tools for regression and prediction (Machine learning, Deep learning, …)…
  • …but maybe Data mining??

Course structure

Course book

Details on DataCamp

Provides basic training and preparation for class. Not grading-relevant, but we assume that you worked through the material.

Course structure

  • Preparation: Tasks at DataCamp before each lesson.
  • Lessons: Programming tasks with tutorials. Small number of “regular” lessons!
  • Examination:
    • 6 assignments. (Felix + (Emilia + Vera))
    • Digital exam. (Felix (+ Michael))
    • Project. (Michael)

Assignments (3 credits, pass/fail)

  • Six tasks, solved individually and independently (limited supervision).
  • New task sheet available Monday morning. Provide a solution on the the following Sunday.
  • Monday morning: mandatory peer review for fellow student assigned by us.
  • Final submission of your solution (after peer review): Tuesday 18:00.

Assignments (3 credits, pass/fail)

  • Solution will be checked and graded as pass/fail.
  • All 6 tasks have to be passed to obtain the 3 credits.
  • Missed deadline/failed assignment: Re-examination takes place in February at the earliest.
  • To qualify for re-exam 3 out of 6 assignments have to be passed
  • Workload in re-exam depends on number of failed assignments.

E-examination (1.5 credits, graded A-F)

  • Note, the exam is remote!
  • Problem solving in RStudio. Some questions will be taken from lesson materials.
  • Aids: Relevant Cheatsheets from RStudio (posit).
  • Examination in January, re-examination in February, see timeedit.

Project (3 credits, A-F, Michael)

  • An entry in a data blog. The basis of your bachelor thesis?
  • Illuminates a question using a “unique” data material.
  • Think about the topic early! Less is more.
  • HW6 requires initial thoughts.
  • Tutorial sessions in December.
  • Short (5 min!) oral presentation 2023-01-12, see timeedit.

Tutorial

  • Most of the course is work on your own, everything will be available online. Therefore, we reduce the hours you need to be on site/available.
  • We offer 2 meetings per week on campus, where we can answer questions, help with coding and you can exchange with others.

Reproducibility

Cut, paste, click load data and other problems

  • As a statistician/actuary/mathematician, you will do many analyses.
  • You will write many reports.
  • Many are tempted to create tables and figures by cutting, pasting, clicking in and between Excel sheets.

Problem

  • How did you handle NA/“missing values”?
  • What model did you use to get these results, with what parameters?
  • Nice figure, but can I get a table instead?
  • Next year, a colleague (or yourself) gets the same or similar tasks and wonders how you obtained you results from available data.

Reproducible data analysis

Reproducibility is the ability to get the same research results or inferences, based on the raw data and computer programs provided by researchers. (Wikipedia)

  • Cf. Replicability, the ability to arrive at the same conclusion based on independent data/analysis.

  • You can never guarantee that you did “right”, but you can at least document what you did.

  • It can also be difficult to guarantee that everything works the same between OSs.

Reproducible data analysis

  • Everything written in code (no clicking or cutting/pasting results/tables/figures)

  • Portable (the code must be executable, not just on your computer today)

  • Accessible (others should be able to easily access and reproduce your analysis)

  • Automated from raw data to report (a button press should be enough to generate the final product)

Tools for reproducible data analysis

Code: R or Python?

Code: R or Python!

But every thing can be done in many, a little too many, ways in R.

summary(mtcars$mpg)
summary(mtcars$"mpg")
summary(mtcars[, "mpg"])
summary(mtcars["mpg"])
summary(mtcars[["mpg"]])
summary(mtcars[1])
summary(mtcars[, 1])
summary(mtcars[[1]])
with(mtcars, summary(mpg))
attach(mtcars); summary(mpg)
summary(subset(mtcars, select=mpg))

From http://r4stats.com/articles/why-r-is-hard-to-learn/

Code: Hadleyverse Tidyverse

A series of R packages from RStudio (posit). Design philosophy: Fast, consistent, purposeful functions. Focus in this course.

Non-tidyverse solutions over time

Automatic report generation

Automatic report generation: Markdown

Automatic report generation: R Markdown

An evolution of Markdown that includes executable code.

knitr: .Rmd → .md

Availability and good programming style

  • An important aspect of making code accessible is making it readable

  • In this course we will use The tidyverse style guide by Hadley Wickham

  • The styler package has a convenient Rstudio Add-on that helps you transform your code according to the style guide

Availability and Versioning: Git and GitHub

  • A version management software
  • A web-based storage service for your history

Version management: Git

Availability: GitHub

All well integrated through R Studio.

Also provides .Rproj for increased portability.

How do I use X to Y? (1)

How do I use X to Y? (2)

Summary

  • All written in code: R

  • Portable: .Rproj (RStudio)

  • Available: GitHub

  • Automated: R Markdown