Lecture 1: Introduction to data science

Daniel Hammarström

Why data science and statistics?

Why data science and statistics?

Data literacy: “the ability to understand the principles behind learning from data, carry out basic data analyses, and critique the quality of claims made on the basis of data” (Spiegelhalter 2019)

Transferable skills

New York Post

Most research is false!

Replication

  • A lot of the scientific process is about confirming claims and refining what we think we know
  • Replication is a key component in scientific research → can a phenomena be replicated in an independent setting?

Replication and Reproducibility

From (Peng 2011).

Replication and Reproducibility

  • Replication: Confirming scientific claims with independent data.
    • Essential for verifying research findings.
    • Challenges with replication due to study size, cost, and urgency.
  • Reproducibility: Minimum standard in scientific research.
    • Independent researchers can replicate the results using the same data.

Reproducibility

Reproducibility

  • Available data
  • Computer code
  • Documentation
  • Data and code sharing methods

(Peng 2011)

How to enable reproducibility

What is a computer program?

“A computer program is a sequence or set of instructions in a programming language for a computer to execute. It is one component of software, which also includes documentation and other intangible components.”

A sequence of instructions

Coko, a programmable crocodile

Data analysis as a computer program

(Wickham and Grolemund 2017)

Tools in Data Science

  • Microsoft Excel: Widely used, versatile tool.
  • SPSS, Stata, Jamovi: Specialized software for statistical analysis.
  • R: Preferred for reproducible data analysis:
    • Free, open-source, and script-based.
    • Steep learning curve.

Research software

Why R?

Why R?

How to learn?

  • Practice by imitation
  • Understand that there are multiple solutions to problems
  • Use online resources like Stack Overflow and Google
  • Stay motivated by using what you learn on problems that interest you
  • Have patience: “the capacity to accept or tolerate delay, problems, or suffering without becoming annoyed or anxious.”

Thank you!

Peng, R. D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27. https://doi.org/10.1126/science.1213847.
Spiegelhalter, D. J. 2019. The Art of Statistics : How to Learn from Data. Book. First US edition. New York: Basic Books.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. Paperback; O’Reilly Media. http://r4ds.had.co.nz/.