w203: Statistics for Data Science

UC Berkeley School of Information

The goal of this course is to provide students with a foundational understanding of classical statistics and how it fits within the broader context of data science. Students will learn to apply the most common statistical procedures correctly, checking assumptions and responding appropriately when they appear violated. Emphasis is placed on different practices that constitute an effective analysis, including formulating research questions, operationalizing variables, exploring data, selecting hypothesis tests, and communicating results.

The course begins with an introduction to probability theory, with pencil-and-paper problem sets to build intuition for the mathematical objects that comprise statistical models. Next, we describe the use of estimators to learn about model parameters, emphasizing guarantees that hold as sample sizes increase to infinity. We then turn to the logic of hypothesis testing and survey a variety of tests used to compare two groups. Finally, we devote several weeks to a discussion of classical linear regression, stressing its flexibility as a tool. We devote special units to the building of regression models in the context of description and of causal inference. Throughout the course, students will practice analyzing real-world data using the open-source language R.

Course Prerequisites

Proficiency with calculus (including an ability to take simple derivatives and integrals)
Familiarity with basic matrix operations
Ability to write proofs

Weekly Workflow

Before Live Session - The first task for each unit is to complete the asynchronous videos and corresponding readings. We recommend that you begin by watching the first videos. As you work forward, you will encounter special "Reading" pages, which direct you to read specific pages in the textbooks. Where possible, we have further organized each week into two or three "sprints." You may want to take a break after each sprint, or tackle each sprint on a separate day. It is important that you complete all videos and readings before live session.
During Live Session - In live session, students engage in activities that build upon the asynchronous videos and readings. These include discussions, collaborative problem solving, and programming exercises. We expect that you dial in from a quiet location, that you will arrive on time, and that you connect over both audio and video. Most importantly, we ask that you treat all classmates with respect and help us create a supportive learning environment.
After Live Session - After live session, students complete the homework for the corresponding unit, as well as any tests or lab assignments.

Required Textbooks

The main text for the class is Foundations of Agnostic Statistics, written by Peter M. Aronow and Benjamin T. Miller. A physical copy of the book is available for approximately $30 (Amazon Link, Cambridge University Press Link). As well, for individuals who would like to have a digital copy of the book, it is available through the UC Library (link).
There is a required course packet, which you may purchase at study.net. This includes select chapters from other textbooks, namely Probability and Statistics, 8th Edition by Devore.

Required Compute Resources

Much of the work for this course can be completed on the UC Berkeley datahub. Students may also run the course materials with the stable rocker/tidyverse image, or through a local install of R and RStudio.

Grading

Grading for the course will following the following rubric. All components will be graded out of 100 points.

Component	Percentage
All Homework Combined	25%
Test 1: Probability Theory	10%
Test 2: CE, BLP, & Sampling	10%
Hypothesis Testing Lab	20%
Linear Regression Lab	25%
Participation	10%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syllabus.md

syllabus.md

w203: Statistics for Data Science

UC Berkeley School of Information

Course Prerequisites

Weekly Workflow

Required Textbooks

Required Compute Resources

Grading

Files

syllabus.md

Latest commit

History

syllabus.md

File metadata and controls

w203: Statistics for Data Science

UC Berkeley School of Information

Course Prerequisites

Weekly Workflow

Required Textbooks

Required Compute Resources

Grading