Skip to content

Latest commit

 

History

History
43 lines (27 loc) · 4.45 KB

syllabus.md

File metadata and controls

43 lines (27 loc) · 4.45 KB

w203: Statistics for Data Science

UC Berkeley School of Information

The goal of this course is to provide students with a foundational understanding of classical statistics and how it fits within the broader context of data science. Students will learn to apply the most common statistical procedures correctly, checking assumptions and responding appropriately when they appear violated. Emphasis is placed on different practices that constitute an effective analysis, including formulating research questions, operationalizing variables, exploring data, selecting hypothesis tests, and communicating results.

The course begins with an introduction to probability theory, with pencil-and-paper problem sets to build intuition for the mathematical objects that comprise statistical models. Next, we describe the use of estimators to learn about model parameters, emphasizing guarantees that hold as sample sizes increase to infinity. We then turn to the logic of hypothesis testing and survey a variety of tests used to compare two groups. Finally, we devote several weeks to a discussion of classical linear regression, stressing its flexibility as a tool. We devote special units to the building of regression models in the context of description and of causal inference. Throughout the course, students will practice analyzing real-world data using the open-source language R.

Course Prerequisites

  • Proficiency with calculus (including an ability to take simple derivatives and integrals)
  • Familiarity with basic matrix operations
  • Ability to write proofs

Weekly Workflow

  • Before Live Session - The first task for each unit is to complete the asynchronous videos and corresponding readings. We recommend that you begin by watching the first videos. As you work forward, you will encounter special "Reading" pages, which direct you to read specific pages in the textbooks. Where possible, we have further organized each week into two or three "sprints." You may want to take a break after each sprint, or tackle each sprint on a separate day. It is important that you complete all videos and readings before live session.

  • During Live Session - In live session, students engage in activities that build upon the asynchronous videos and readings. These include discussions, collaborative problem solving, and programming exercises. We expect that you dial in from a quiet location, that you will arrive on time, and that you connect over both audio and video. Most importantly, we ask that you treat all classmates with respect and help us create a supportive learning environment.

  • After Live Session - After live session, students complete the homework for the corresponding unit, as well as any tests or lab assignments.

Required Textbooks

  • The main text for the class is Foundations of Agnostic Statistics, written by Peter M. Aronow and Benjamin T. Miller. A physical copy of the book is available for approximately $30 (Amazon Link, Cambridge University Press Link). As well, for individuals who would like to have a digital copy of the book, it is available through the UC Library (link).

  • There is a required course packet, which you may purchase at study.net. This includes select chapters from other textbooks, namely Probability and Statistics, 8th Edition by Devore.

Required Compute Resources

Much of the work for this course can be completed on the UC Berkeley datahub. Students may also run the course materials with the stable rocker/tidyverse image, or through a local install of R and RStudio.

Grading

Grading for the course will following the following rubric. All components will be graded out of 100 points.

Component Percentage
All Homework Combined 25%
Test 1: Probability Theory 10%
Test 2: CE, BLP, & Sampling 10%
Hypothesis Testing Lab 20%
Linear Regression Lab 25%
Participation 10%