Statistisk databehandling MT4007-HT22

A course in “Data Science”

The Data Science Venn Diagram

From http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Statistical data processing

Matstat courses usually focus entirely on the Model stage, this course aims to touch on the others.

Pic from: http://r4ds.had.co.nz/introduction.html

We will not touch…

…special tools for things that are too big for the computer’s internal memory (Big Data?)…
…tools for regression and prediction (Machine learning, Deep learning, …)…
…but maybe Data mining??

Course structure

Course book

http://r4ds.had.co.nz/

Details on DataCamp

Provides basic training and preparation for class. Not grading-relevant, but we assume that you worked through the material.

Course structure

Preparation: Tasks at DataCamp before each lesson.
Lessons: Programming tasks with tutorials. Small number of “regular” lessons!
Examination:
- 6 assignments. (Felix + (Emilia + Vera))
- Digital exam. (Felix (+ Michael))
- Project. (Michael)

Assignments (3 credits, pass/fail)

Six tasks, solved individually and independently (limited supervision).
New task sheet available Monday morning. Provide a solution on the the following Sunday.
Monday morning: mandatory peer review for fellow student assigned by us.
Final submission of your solution (after peer review): Tuesday 18:00.

Assignments (3 credits, pass/fail)

Solution will be checked and graded as pass/fail.
All 6 tasks have to be passed to obtain the 3 credits.
Missed deadline/failed assignment: Re-examination takes place in February at the earliest.
To qualify for re-exam 3 out of 6 assignments have to be passed
Workload in re-exam depends on number of failed assignments.

E-examination (1.5 credits, graded A-F)

Note, the exam is remote!
Problem solving in RStudio. Some questions will be taken from lesson materials.
Aids: Relevant Cheatsheets from RStudio (posit).
Examination in January, re-examination in February, see timeedit.

Project (3 credits, A-F, Michael)

An entry in a data blog. The basis of your bachelor thesis?
Illuminates a question using a “unique” data material.
Think about the topic early! Less is more.
HW6 requires initial thoughts.
Tutorial sessions in December.
Short (5 min!) oral presentation 2023-01-12, see timeedit.

Tutorial

Most of the course is work on your own, everything will be available online. Therefore, we reduce the hours you need to be on site/available.
We offer 2 meetings per week on campus, where we can answer questions, help with coding and you can exchange with others.

Reproducibility

Cut, paste, click load data and other problems

As a statistician/actuary/mathematician, you will do many analyses.
You will write many reports.
Many are tempted to create tables and figures by cutting, pasting, clicking in and between Excel sheets.

Problem

How did you handle NA/“missing values”?
What model did you use to get these results, with what parameters?
Nice figure, but can I get a table instead?
Next year, a colleague (or yourself) gets the same or similar tasks and wonders how you obtained you results from available data.

Reproducible data analysis

Reproducibility is the ability to get the same research results or inferences, based on the raw data and computer programs provided by researchers. (Wikipedia)

Cf. Replicability, the ability to arrive at the same conclusion based on independent data/analysis.
You can never guarantee that you did “right”, but you can at least document what you did.
It can also be difficult to guarantee that everything works the same between OSs.

Reproducible data analysis

Everything written in code (no clicking or cutting/pasting results/tables/figures)
Portable (the code must be executable, not just on your computer today)
Accessible (others should be able to easily access and reproduce your analysis)
Automated from raw data to report (a button press should be enough to generate the final product)

Tools for reproducible data analysis

Code: R or Python?

Code: R or Python!

But every thing can be done in many, a little too many, ways in R.

summary(mtcars$mpg)
summary(mtcars$"mpg")
summary(mtcars[, "mpg"])
summary(mtcars["mpg"])
summary(mtcars[["mpg"]])
summary(mtcars[1])
summary(mtcars[, 1])
summary(mtcars[[1]])
with(mtcars, summary(mpg))
attach(mtcars); summary(mpg)
summary(subset(mtcars, select=mpg))

From http://r4stats.com/articles/why-r-is-hard-to-learn/

Code: Hadleyverse Tidyverse

A series of R packages from RStudio (posit). Design philosophy: Fast, consistent, purposeful functions. Focus in this course.

Non-tidyverse solutions over time

Automatic report generation

We need to automatically combine text, results, tables and figures:

Literate programming

Image from https://rosannavanhespenresearch.files.wordpress.com/

Automatic report generation: Markdown

A markup language for typing.

http://writeme.mattstow.com/

Automatic report generation: R Markdown

An evolution of Markdown that includes executable code.

knitr: .Rmd → .md

Availability and good programming style

An important aspect of making code accessible is making it readable
In this course we will use The tidyverse style guide by Hadley Wickham
The styler package has a convenient Rstudio Add-on that helps you transform your code according to the style guide

Availability and Versioning: Git and GitHub

A version management software
A web-based storage service for your history

Version management: Git

Image from http://phdcomics.com/comics/archive.php?comicid=1531

Not necessary for reproducibility, but a must for large projects over a long period of time.
Version management supports working with code projects in teams

Availability: GitHub

A web-hosting for sharing files, mainly program code
Builds on the Git version control program with advanced functionality to facilitate collaboration
https://github.com/Microsoft/malmo
https://github.com/tidyverse/dplyr

A course in “Data Science”

The Data Science Venn Diagram

Statistical data processing

We will not touch…

Course structure

Course book

Details on DataCamp

Course structure

Assignments (3 credits, pass/fail)

Assignments (3 credits, pass/fail)

E-examination (1.5 credits, graded A-F)

Project (3 credits, A-F, Michael)

Tutorial

Reproducibility

Cut, paste, click load data and other problems

Problem

Reproducible data analysis

Reproducible data analysis

Tools for reproducible data analysis

Code: R or Python?

Code: R or Python!

But every thing can be done in many, a little too many, ways in R.

Code: Hadleyverse Tidyverse

Non-tidyverse solutions over time

Automatic report generation

Automatic report generation: Markdown

Automatic report generation: R Markdown

knitr: .Rmd → .md

Availability and good programming style

Availability and Versioning: Git and GitHub

Version management: Git

Availability: GitHub

All well integrated through R Studio.

How do I use X to Y? (1)

How do I use X to Y? (2)

Summary