-
Notifications
You must be signed in to change notification settings - Fork 0
/
13-reproducible-research.Rmd
127 lines (76 loc) · 5.88 KB
/
13-reproducible-research.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# Reproducible Research
## Learning Objectives
1.
2.
3.
## Class Announcements
## Roadmap
**Rearview Mirror**
**Today**
**Looking Ahead**
## What data science hopes to accomplish
- As a data scientist, our goal is to learn about the world:
- *Theorists* and *theologians* build systems of explanations that are consistent with themselves
- *Analysts* build systems of explanations that are consistent with the past
- *Scientists* build systems of explanations that usefully predict events, **or data**, that hasn't yet been seen
## Learning from Data
- As a data scientist, the way we learn about the world is through the streams of data that **real world** events produce
- Machine processes
- Political outcomes
- Customer actions
- The watershed moment in our field has been the profusion of data available, from many places, that is richer than at any other point in our past.
- In 251, and 266 we place structure on data series like audio, video and text that are *transcendently* rich
- In 261 we bring together flows of data that are generated at massive scales
- In 209 we ask, "How can we take data, and produce a *new* form of it that is most effectively understood by the human visual and interactive mind?
## Data Science and Statistics
- So why statistics?
- And why the way we've chosen to approach statistics in 203?
## Why Statistics?: A Closing Argument for Statistics
- Business, policy, education and medical decisions are made *by humans* based on data
- A central task when we observe some pattern in data is to **infer** whether the pattern will occur in some novel context
- Statistics, as we practice it in 203, allows us to characterize:
- What we have seen
- What we *could have seen*
- Whether any guarantees exist about what we have seen
- What we can infer about the population
- So that we can either describe, explain or predict behavior.
## Course Goals
### Course Section III: Purpose-Driven Models
- Statistical models are unknowing transformations of data
- Because they're built on the foundation of probability, we have certain guarantees what a model "says"
- Because they're unknowing, the models themselves know-not what they say.
- As the data scientist, bring them alive to achieve our modeling goals
- In Lab 2 we have expanded our ability to parse the world using regression, built a model that accomplishes our goals, and done so in a way that brings the ability to test under a *"null"* scenario
- **Key insight**: regression is little more than conditional averages
### Course Section II: Sampling Theory and Testing
- Under **very** general assumptions, sample averages follow a predictable, known, distribution -- the *Gaussian distribution*
- This is true, even when the underlying probability distribution is *very* complex, or unknown!
- Due to this common distribution, we can produce reliable, general tests!
- In Lab 1 we computed simple statistics, and used guarantees from sampling theory to **test** whether these differences were likely to arise under a *"null"* scenario
### Course Section I: Probability Theory
- Probability theory
- Underlies modeling and regression (Part III);
- Underlies sampling, inference, and testing (Part II)
- **Every** model built in **every** corner of data science
We can:
- Model the complex world that we live in using probability theory;
- Move from a probability density function that is defined in terms of a single variable, into a function that is defined in terms of many variables
- Compute useful summaries -- i.e. the BLP, expected value, and covariance -- even with *highly* complex probability density functions.
### Statistics as a Foundation for MIDS
- In w203, we hope to have laid a foundation in probability that can be used not only in statistical applications, but also in every other machine learning application that are likely to ever encounter
## Reproducibility Discussion
**Green Jelly Beans**
<img src=https://imgs.xkcd.com/comics/significant.png></img>
**What went wrong here?**
### Discussion
**Status Update**
You have a dataset of the number of Facebook status updates by day of the week. You run 7 different t-tests, one for posts on Monday (versus all other days), or for Tuesday (versus all other days), etc. Only the test for Sunday is significant, with a p-value of .045, so you throw out the other tests.
Should you conclude that Sunday has a significant effect on number of posts? (How can you address this situation responsibly when you publish your results?)
**Such Update**
As before, you have a dataset of the number of Facebook status updates by day of the week. You do a little EDA and notice that Sunday seems to have more "status updates" than all other days, so you recode your "day of the week" variable into a binary one: Sunday = 1, All other days = 0. You run a t-test and get a p-value of .045. Should you conclude that Sunday has a significant effect on number of posts?
**Sunday Funday**
Suppose researcher A tests if Monday has an effect (versus all other days), Researcher B tests Tuesday (versus all other days), and so forth. Only Researcher G, who tests Sunday finds a significant effect with a p-value of .045. Only Researcher G gets to publish her work. If you read the paper, should you conclude that Sunday has a significant effect on number of posts?
**Sunday Repentence**
What if researcher G above is a sociologist that chooses to measure the effect of Sunday based on years of observing the way people behave on weekends? Researcher G is not interested in the other tests, because Sunday is the interesting day from her perspective, and she wouldn't expect any of the other tests to be significant.
**Decreasing Effect Sizes**
Many observers have noted that as studies yielding statistically significant results are repeated, estimated effect sizes go down and often become insignificant. Why is this the case?