Add flag_imputed_data function #141

tjburch · 2020-10-01T16:37:17Z

I've added a function to utils.py that will add a flag to any Statcast DataFrame with a boolean flag if it's possibly imputed as a result of the no-nulls approach in StatCast in the TrackMan era. For info on the no-nulls approach see here. This boolean can then be used for filtering if desired, the following pictures are launch speed and launch angle before and after filtering on this flag.

The requirements have 3 criteria, specific combinations of launch angle, speed and bb type. If and only if all three criteria are simultaneously exactly matched, the boolean is flipped to true.

schorrm · 2020-10-02T09:07:46Z

This is really good. Thank you very much!

schorrm · 2020-10-02T09:10:16Z

pybaseball/utils.py

+        pd.DataFrame: Copy of original dataframe with "possible_imputation" flag
+    """
+
+    ParameterSet = namedtuple('ParameterSet',"ev angle bb_type")


Can this be changed to: ['ev', 'angle', 'bb_type']? A bit clearer I think.

schorrm · 2020-10-02T09:15:40Z

pybaseball/utils.py

+    for param_set in impute_combinations:
+        bool_logic = (df_return["launch_speed"] == param_set.ev) & (df_return["launch_angle"] == param_set.angle) & (df_return["bb_type"] == param_set.bb_type)
+        df_return["possible_imputation"] = df_return["possible_imputation"] | bool_logic


This is essentially equivalent to a join on the LS / LA / BBType, just without joining, only whether it would be joined.
Would making the params into a DataFrame with a column imputed and a left join work?
Is that a better or worse way to do this?

I think a left join should be equivalent. In my pandas experience, joins are more efficient than for loops, so probably the better approach of the two, but I'll give it a test today to check that's the case.

schorrm · 2020-10-02T09:16:11Z

Also, can I please encourage you to make a PR with that visualization?

tjburch · 2020-10-02T12:08:45Z

Sure, not a problem. Were you thinking just a short notebook in the EXAMPLES directory? Or something else?

tjburch · 2020-10-02T12:53:30Z

The new push has the changes as indicated. As far as the merging via join, performance-wise the two ended up being basically equivalent, but it's a bit more readable. This screenshot shows timing benchmarks between the original code (temporarily renamed flag_imputed_data_original for the test) and via join (flag_imputed_data_join). In the commit, the name is the same as before.

tjburch · 2020-10-02T13:45:30Z

Ah - there's definitely a bug in that implementation. Don't merge just yet...

tjburch · 2020-10-02T13:52:06Z

Resolved.

bdilday · 2020-10-03T19:17:24Z

pybaseball/utils.py

+    # Flyout
+    impute_combinations.append(ParameterSet(ev=71.4, angle=36.0, bb_type="fly_ball"))
+    impute_combinations.append(ParameterSet(ev=89, angle=39, bb_type="fly_ball"))
+    impute_combinations.append(ParameterSet(ev=89.2, angle=39.3, bb_type="fly_ball"))


when I look at the statcast data, I see a lot of values at ev=89.2, angle=39.0. and not ev=89.2, angle=39.3. I'm wondering if this record with angle=39.3 might be a typo?

in fact it looks like maybe all the imputed values have angle that are integers?

So that originally came from the tables found here, under "Stringer Fly Balls," then rounded to the precision in the DFs we get. I'll do some more investigating though to see if something was missed.

it looks like they've started rounding all angles to integer?

Ah yeah - so I guess it varies by year... pains :( Maybe this is a product of HawkEye? As I commented below, probably your thought of using a heuristic derivation is the right move.

bdilday · 2020-10-03T19:23:45Z

pybaseball/utils.py

+
+    df_imputations = pd.DataFrame(data=impute_combinations)
+    df_imputations["possible_imputation"] = True
+    df_return = statcast_df.merge(df_imputations, how="left",


because of the merge here, df_return has columns ev and angle, which the original didn't have and which are either null or identical to launch_speed, launch_angle. It might be better to drop the ev and angle columns before returning, or just name them launch_speed and launch_angle in the impute_combinations data to begin with.

schorrm · 2020-10-04T10:31:09Z

@tjburch maybe even a library function -- e.g. pybaseball.plot_bb_profile()?

tjburch · 2020-10-04T23:00:44Z

Working through these proposed edits now, thanks for the eyes on the code.

tjburch · 2020-10-04T23:02:51Z

@tjburch maybe even a library function -- e.g. pybaseball.plot_bb_profile()?

Just committed a pretty generic mock-up of this. One question was if there were any preferences about mpl figure size, I set it as dpi=300 for now, but can edit it.

TheCleric · 2020-10-05T05:46:29Z

FYI Tests are failing here due to a bug that is fixed in PR #144

schorrm · 2020-10-05T11:54:35Z

@tjburch That looks good, but you can drop the visualization from this PR and submit separately?

schorrm · 2020-10-05T11:55:38Z

I am thinking, by the ways, that it makes sense to expose this primarily to the user as a boolean flag_imputed_data=True in the statcast batting calls, rather than as a separate thingy? Zero implementation change involved, the question is purely in terms of what the docs should have (plus the three LOC of handling the flag in the statcast_batting call).

tjburch · 2020-10-05T12:27:07Z

@tjburch That looks good, but you can drop the visualization from this PR and submit separately?

Done

I am thinking, by the ways, that it makes sense to expose this primarily to the user as a boolean flag_imputed_data=True in the statcast batting calls, rather than as a separate thingy?

Sure, we can do that. For what it's worth, the original implementation was loosely motivated from a similar function in baseballr, but we're not held to their approach by any means. I'll get that implementation asap.

bdilday · 2020-10-05T13:32:30Z

Sure, we can do that. For what it's worth, the original implementation was loosely motivated from a similar function in baseballr, but we're not held to their approach by any means. I'll get that implementation asap.

I'm the person that implemented the baseballr version, BTW. I can say the reason it's a separate function instead of an argument to the statcast scrape function because it was considered "experimental" and I don't think baseballr core maintainer(s) wanted it in the core API at that time. Seems like it's no problem to include it as option in pybaseball.

BTW I opened a PR at baseballr yesterday that adds the script I used to derive the values,

tjburch · 2020-10-05T13:38:47Z

@bdilday Thanks for that! In your code I see:

# use 5 here? some other number? 99.X percentile? this is why I referred to 
# it as a heuristic in the `label_statcast_imputed_data` documentation

And the 5 value ultimately used. Do you find this heuristic works? I could also think if you did some percentage of total rows of bb_type (e.g. if there's 500 fly balls, probably no more than 5 should be a given EV/LA, but if there's 10,000 maybe that number should go up?)

bdilday · 2020-10-05T13:54:42Z

@bdilday Thanks for that! In your code I see:
# use 5 here? some other number? 99.X percentile? this is why I referred to 
# it as a heuristic in the `label_statcast_imputed_data` documentation
And the 5 value ultimately used. Do you find this heuristic works? I could also think if you did some percentage of total rows of bb_type (e.g. if there's 500 fly balls, probably no more than 5 should be a given EV/LA, but if there's 10,000 maybe that number should go up?)

It worked for the data set I was using at the time, but yeah it's not appropriate for general usage. I think what you mentioned, setting some threshold of percentage of overall balls, is a great starting point, and it could be refined in subsequent PRs if there's a need.

schorrm · 2020-10-05T14:53:22Z

I will defer to your judgment on this one.

schorrm · 2020-10-06T14:54:23Z

@tjburch @bdilday what's the status on this?

tjburch · 2020-10-06T14:57:24Z

I have a notebook with a more motivated derivation of values - will submit a PR with it in the next hour so the values can be agreed on, then update this PR with those values.

tjburch · 2020-10-07T12:19:46Z

Updated numbers in accordance with the derivation in PR #149

schorrm · 2020-10-09T14:31:38Z

@tjburch is this ready to go?
Also, check out the dev slack -- see #119

tjburch · 2020-10-09T14:32:45Z

@schorrm Yep! Good to go.

Add flag_imputed_data function

95f0a37

schorrm reviewed Oct 2, 2020

View reviewed changes

Change df joininig to pd.merge, set namedtuple with list

a748088

Make merge left

c43213a

Switch NaN on merge to False, catch a couple missing parameter sets

62acc95

bdilday reviewed Oct 3, 2020

View reviewed changes

Add plot_bb_profile

9d4abb5

Drop extraneous columns from return dataframe

7783c78

rm plot_bb_profile for seperate PR

58c8400

tjburch mentioned this pull request Oct 5, 2020

Add plot_bb_profile #145

Merged

Merge branch 'master' into flag-imputes

8c26843

tjburch mentioned this pull request Oct 6, 2020

Add derivation notebook for imputed values #149

Merged

tjburch added 2 commits October 7, 2020 07:16

Update to derived parameter sets

26a5557

Merge branch 'master' into flag-imputes

8ab67a5

Add missing type hinting

3567cad

schorrm merged commit f073e0e into jldbc:master Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag_imputed_data function #141

Add flag_imputed_data function #141

tjburch commented Oct 1, 2020

schorrm commented Oct 2, 2020

schorrm Oct 2, 2020

schorrm Oct 2, 2020

tjburch Oct 2, 2020

schorrm commented Oct 2, 2020

tjburch commented Oct 2, 2020

tjburch commented Oct 2, 2020

tjburch commented Oct 2, 2020

tjburch commented Oct 2, 2020

bdilday Oct 3, 2020 •

edited

Loading

bdilday Oct 3, 2020

tjburch Oct 5, 2020

bdilday Oct 5, 2020

tjburch Oct 5, 2020

bdilday Oct 3, 2020

schorrm commented Oct 4, 2020

tjburch commented Oct 4, 2020

tjburch commented Oct 4, 2020

TheCleric commented Oct 5, 2020

schorrm commented Oct 5, 2020 •

edited

Loading

schorrm commented Oct 5, 2020 •

edited

Loading

tjburch commented Oct 5, 2020

bdilday commented Oct 5, 2020

tjburch commented Oct 5, 2020

bdilday commented Oct 5, 2020

schorrm commented Oct 5, 2020

schorrm commented Oct 6, 2020

tjburch commented Oct 6, 2020

tjburch commented Oct 7, 2020

schorrm commented Oct 9, 2020

tjburch commented Oct 9, 2020 •

edited

Loading

Add flag_imputed_data function #141

Add flag_imputed_data function #141

Conversation

tjburch commented Oct 1, 2020

schorrm commented Oct 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schorrm commented Oct 2, 2020

tjburch commented Oct 2, 2020

tjburch commented Oct 2, 2020

tjburch commented Oct 2, 2020

tjburch commented Oct 2, 2020

bdilday Oct 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schorrm commented Oct 4, 2020

tjburch commented Oct 4, 2020

tjburch commented Oct 4, 2020

TheCleric commented Oct 5, 2020

schorrm commented Oct 5, 2020 • edited Loading

schorrm commented Oct 5, 2020 • edited Loading

tjburch commented Oct 5, 2020

bdilday commented Oct 5, 2020

tjburch commented Oct 5, 2020

bdilday commented Oct 5, 2020

schorrm commented Oct 5, 2020

schorrm commented Oct 6, 2020

tjburch commented Oct 6, 2020

tjburch commented Oct 7, 2020

schorrm commented Oct 9, 2020

tjburch commented Oct 9, 2020 • edited Loading

bdilday Oct 3, 2020 •

edited

Loading

schorrm commented Oct 5, 2020 •

edited

Loading

schorrm commented Oct 5, 2020 •

edited

Loading

tjburch commented Oct 9, 2020 •

edited

Loading