The goal of this course is to provide students with a foundational understanding of classical statistics and how it fits within the broader context of data science. Students will learn to apply the most common statistical procedures correctly, checking assumptions and responding appropriately when they appear violated. Emphasis is placed on different practices that constitute an effective analysis, including formulating research questions, operationalizing variables, exploring data, selecting hypothesis tests, and communicating results.
The course begins with an introduction to probability theory, with pencil-and-paper problem sets to build intuition for the mathematical objects that comprise statistical models. Next, we describe the use of estimators to learn about model parameters, emphasizing guarantees that hold as sample sizes increase to infinity. We then turn to the logic of hypothesis testing and survey a variety of tests used to compare two groups. Finally, we devote several weeks to a discussion of classical linear regression, stressing its flexibility as a tool. We devote special units to the building of regression models in the context of description and of causal inference. Throughout the course, students will practice analyzing real-world data using the open-source language R
.
- Proficiency with calculus (including an ability to take simple derivatives and integrals)
- Familiarity with basic matrix operations
- Ability to write proofs
-
Before Live Session - The first task for each unit is to complete the asynchronous videos and corresponding readings. We recommend that you begin by watching the first videos. As you work forward, you will encounter special "Reading" pages, which direct you to read specific pages in the textbooks. Where possible, we have further organized each week into two or three "sprints." You may want to take a break after each sprint, or tackle each sprint on a separate day. It is important that you complete all videos and readings before live session.
-
During Live Session - In live session, students engage in activities that build upon the asynchronous videos and readings. These include discussions, collaborative problem solving, and programming exercises. We expect that you dial in from a quiet location, that you will arrive on time, and that you connect over both audio and video. Most importantly, we ask that you treat all classmates with respect and help us create a supportive learning environment.
-
After Live Session - After live session, students complete the homework for the corresponding unit, as well as any tests or lab assignments.
-
The main text for the class is Foundations of Agnostic Statistics, written by Peter M. Aronow and Benjamin T. Miller. A physical copy of the book is available for approximately $30 (Amazon Link, Cambridge University Press Link). As well, for individuals who would like to have a digital copy of the book, it is available through the UC Library (link).
-
There is a required course packet, which you may purchase at study.net. This includes select chapters from other textbooks, namely Probability and Statistics, 8th Edition by Devore.
Much of the work for this course can be completed on the UC Berkeley datahub. Students may also run the course materials with the stable rocker/tidyverse
image, or through a local install of R and RStudio.
Grading for the course will following the following rubric. All components will be graded out of 100 points.
Component | Percentage |
---|---|
All Homework Combined | 25% |
Test 1: Probability Theory | 10% |
Test 2: CE, BLP, & Sampling | 10% |
Hypothesis Testing Lab | 20% |
Linear Regression Lab | 25% |
Participation | 10% |