Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag_imputed_data function #141

Merged
merged 11 commits into from
Oct 9, 2020
Merged

Add flag_imputed_data function #141

merged 11 commits into from
Oct 9, 2020

Conversation

tjburch
Copy link
Collaborator

@tjburch tjburch commented Oct 1, 2020

I've added a function to utils.py that will add a flag to any Statcast DataFrame with a boolean flag if it's possibly imputed as a result of the no-nulls approach in StatCast in the TrackMan era. For info on the no-nulls approach see here. This boolean can then be used for filtering if desired, the following pictures are launch speed and launch angle before and after filtering on this flag.

Before filtering

After filtering

The requirements have 3 criteria, specific combinations of launch angle, speed and bb type. If and only if all three criteria are simultaneously exactly matched, the boolean is flipped to true.

@schorrm
Copy link
Collaborator

schorrm commented Oct 2, 2020

This is really good. Thank you very much!

pd.DataFrame: Copy of original dataframe with "possible_imputation" flag
"""

ParameterSet = namedtuple('ParameterSet',"ev angle bb_type")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be changed to: ['ev', 'angle', 'bb_type']? A bit clearer I think.

Comment on lines 163 to 165
for param_set in impute_combinations:
bool_logic = (df_return["launch_speed"] == param_set.ev) & (df_return["launch_angle"] == param_set.angle) & (df_return["bb_type"] == param_set.bb_type)
df_return["possible_imputation"] = df_return["possible_imputation"] | bool_logic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially equivalent to a join on the LS / LA / BBType, just without joining, only whether it would be joined.
Would making the params into a DataFrame with a column imputed and a left join work?
Is that a better or worse way to do this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a left join should be equivalent. In my pandas experience, joins are more efficient than for loops, so probably the better approach of the two, but I'll give it a test today to check that's the case.

@schorrm
Copy link
Collaborator

schorrm commented Oct 2, 2020

Also, can I please encourage you to make a PR with that visualization?

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 2, 2020

Sure, not a problem. Were you thinking just a short notebook in the EXAMPLES directory? Or something else?

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 2, 2020

The new push has the changes as indicated. As far as the merging via join, performance-wise the two ended up being basically equivalent, but it's a bit more readable. This screenshot shows timing benchmarks between the original code (temporarily renamed flag_imputed_data_original for the test) and via join (flag_imputed_data_join). In the commit, the name is the same as before.

Screen Shot 2020-10-02 at 7 40 07 AM

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 2, 2020

Ah - there's definitely a bug in that implementation. Don't merge just yet...

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 2, 2020

Resolved.

# Flyout
impute_combinations.append(ParameterSet(ev=71.4, angle=36.0, bb_type="fly_ball"))
impute_combinations.append(ParameterSet(ev=89, angle=39, bb_type="fly_ball"))
impute_combinations.append(ParameterSet(ev=89.2, angle=39.3, bb_type="fly_ball"))
Copy link
Contributor

@bdilday bdilday Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I look at the statcast data, I see a lot of values at ev=89.2, angle=39.0. and not ev=89.2, angle=39.3. I'm wondering if this record with angle=39.3 might be a typo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact it looks like maybe all the imputed values have angle that are integers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that originally came from the tables found here, under "Stringer Fly Balls," then rounded to the precision in the DFs we get. I'll do some more investigating though to see if something was missed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like they've started rounding all angles to integer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah - so I guess it varies by year... pains :( Maybe this is a product of HawkEye? As I commented below, probably your thought of using a heuristic derivation is the right move.


df_imputations = pd.DataFrame(data=impute_combinations)
df_imputations["possible_imputation"] = True
df_return = statcast_df.merge(df_imputations, how="left",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of the merge here, df_return has columns ev and angle, which the original didn't have and which are either null or identical to launch_speed, launch_angle. It might be better to drop the ev and angle columns before returning, or just name them launch_speed and launch_angle in the impute_combinations data to begin with.

@schorrm
Copy link
Collaborator

schorrm commented Oct 4, 2020

@tjburch maybe even a library function -- e.g. pybaseball.plot_bb_profile()?

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 4, 2020

Working through these proposed edits now, thanks for the eyes on the code.

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 4, 2020

@tjburch maybe even a library function -- e.g. pybaseball.plot_bb_profile()?

Just committed a pretty generic mock-up of this. One question was if there were any preferences about mpl figure size, I set it as dpi=300 for now, but can edit it.

@TheCleric
Copy link
Contributor

FYI Tests are failing here due to a bug that is fixed in PR #144

@schorrm
Copy link
Collaborator

schorrm commented Oct 5, 2020

@tjburch That looks good, but you can drop the visualization from this PR and submit separately?

@schorrm
Copy link
Collaborator

schorrm commented Oct 5, 2020

I am thinking, by the ways, that it makes sense to expose this primarily to the user as a boolean flag_imputed_data=True in the statcast batting calls, rather than as a separate thingy? Zero implementation change involved, the question is purely in terms of what the docs should have (plus the three LOC of handling the flag in the statcast_batting call).

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 5, 2020

@tjburch That looks good, but you can drop the visualization from this PR and submit separately?

Done

I am thinking, by the ways, that it makes sense to expose this primarily to the user as a boolean flag_imputed_data=True in the statcast batting calls, rather than as a separate thingy?

Sure, we can do that. For what it's worth, the original implementation was loosely motivated from a similar function in baseballr, but we're not held to their approach by any means. I'll get that implementation asap.

@tjburch tjburch mentioned this pull request Oct 5, 2020
@bdilday
Copy link
Contributor

bdilday commented Oct 5, 2020

Sure, we can do that. For what it's worth, the original implementation was loosely motivated from a similar function in baseballr, but we're not held to their approach by any means. I'll get that implementation asap.

I'm the person that implemented the baseballr version, BTW. I can say the reason it's a separate function instead of an argument to the statcast scrape function because it was considered "experimental" and I don't think baseballr core maintainer(s) wanted it in the core API at that time. Seems like it's no problem to include it as option in pybaseball.

BTW I opened a PR at baseballr yesterday that adds the script I used to derive the values,

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 5, 2020

@bdilday Thanks for that! In your code I see:

# use 5 here? some other number? 99.X percentile? this is why I referred to 
# it as a heuristic in the `label_statcast_imputed_data` documentation

And the 5 value ultimately used. Do you find this heuristic works? I could also think if you did some percentage of total rows of bb_type (e.g. if there's 500 fly balls, probably no more than 5 should be a given EV/LA, but if there's 10,000 maybe that number should go up?)

@bdilday
Copy link
Contributor

bdilday commented Oct 5, 2020

@bdilday Thanks for that! In your code I see:

# use 5 here? some other number? 99.X percentile? this is why I referred to 
# it as a heuristic in the `label_statcast_imputed_data` documentation

And the 5 value ultimately used. Do you find this heuristic works? I could also think if you did some percentage of total rows of bb_type (e.g. if there's 500 fly balls, probably no more than 5 should be a given EV/LA, but if there's 10,000 maybe that number should go up?)

It worked for the data set I was using at the time, but yeah it's not appropriate for general usage. I think what you mentioned, setting some threshold of percentage of overall balls, is a great starting point, and it could be refined in subsequent PRs if there's a need.

@schorrm
Copy link
Collaborator

schorrm commented Oct 5, 2020

I will defer to your judgment on this one.

@schorrm
Copy link
Collaborator

schorrm commented Oct 6, 2020

@tjburch @bdilday what's the status on this?

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 6, 2020

I have a notebook with a more motivated derivation of values - will submit a PR with it in the next hour so the values can be agreed on, then update this PR with those values.

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 7, 2020

Updated numbers in accordance with the derivation in PR #149

@schorrm
Copy link
Collaborator

schorrm commented Oct 9, 2020

@tjburch is this ready to go?
Also, check out the dev slack -- see #119

@tjburch
Copy link
Collaborator Author

tjburch commented Oct 9, 2020

@schorrm Yep! Good to go.

@schorrm schorrm merged commit f073e0e into jldbc:master Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants