Introduction to data science

Data is everywhere

The PPDAC cycle1

  • A general approach to data-driven problem-solving
  • Data literacy and data science as opposed to statistics

Replication and Reproducibility

A fully reproducible1 study has

  • Available data.
  • Computer code (software) that produces the results of the study.
  • Documentation that describes the software and data used in the study, and
  • ways to share the data and code.

Reproducibility and transparency outside science

  • Data are valuable and often hard to get
  • Data may guide good decision-making, if understood
  • Efficient management and use of data requires data literacy and data science skills

Tools in data science

We want software where analyses can be:

  • Human- and computer-readable, meaning that we want to be able to write scripts or computer programs that execute the analyses.
  • Documented, meaning that along the code, we want to be able to describe what the code does.
  • Available and able to share with others, meaning that our analyses can be run on open and free software to maximize the ability to share them.

Tools in data science

  • R and RStudio
  • Quarto
  • Git and GitHub

Tools in data science 1

How to learn how to code

  • There are (almost always) multiple solutions to a problem.
  • Someone else has already had the same problem
  • Find your motivation
  • “Microdosing”

Data in practice - Storing data for everyday use

  • Spreadsheets can be used for efficient storage of data for everyday use
  • Spreadsheet software contains functions that adds information on top of the data…


All happy families are alike; each unhappy family is unhappy in its own way.

Data Organization in Spreadsheets: Empty cells1

Data Organization in Spreadsheets: A tidy version (happy family)

Data Organization in Spreadsheets: Nonrectangular layouts

Data Organization in Spreadsheets: Data dictionary

Data Organization in Spreadsheets: Plain text

Minimize the risk of failure by adopting good habits!

References

Broman, Karl W., and Kara H. Woo. 2018. “Data Organization in Spreadsheets.” Journal Article. The American Statistician 72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.
Peng, R. D., F. Dominici, and S. L. Zeger. 2006. “Reproducible Epidemiologic Research.” Journal Article. Am J Epidemiol 163 (9): 783–89. https://doi.org/10.1093/aje/kwj093.
Spiegelhalter, D. J. 2019. The Art of Statistics : How to Learn from Data. Book. First US edition. New York: Basic Books.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. Paperback; O’Reilly Media. http://r4ds.had.co.nz/.