-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flag_imputed_data function #141
Conversation
This is really good. Thank you very much! |
pybaseball/utils.py
Outdated
pd.DataFrame: Copy of original dataframe with "possible_imputation" flag | ||
""" | ||
|
||
ParameterSet = namedtuple('ParameterSet',"ev angle bb_type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be changed to: ['ev', 'angle', 'bb_type']
? A bit clearer I think.
pybaseball/utils.py
Outdated
for param_set in impute_combinations: | ||
bool_logic = (df_return["launch_speed"] == param_set.ev) & (df_return["launch_angle"] == param_set.angle) & (df_return["bb_type"] == param_set.bb_type) | ||
df_return["possible_imputation"] = df_return["possible_imputation"] | bool_logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially equivalent to a join on the LS / LA / BBType, just without joining, only whether it would be joined.
Would making the params into a DataFrame with a column imputed
and a left join work?
Is that a better or worse way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a left join should be equivalent. In my pandas experience, joins are more efficient than for
loops, so probably the better approach of the two, but I'll give it a test today to check that's the case.
Also, can I please encourage you to make a PR with that visualization? |
Sure, not a problem. Were you thinking just a short notebook in the |
The new push has the changes as indicated. As far as the merging via join, performance-wise the two ended up being basically equivalent, but it's a bit more readable. This screenshot shows timing benchmarks between the original code (temporarily renamed |
Ah - there's definitely a bug in that implementation. Don't merge just yet... |
Resolved. |
pybaseball/utils.py
Outdated
# Flyout | ||
impute_combinations.append(ParameterSet(ev=71.4, angle=36.0, bb_type="fly_ball")) | ||
impute_combinations.append(ParameterSet(ev=89, angle=39, bb_type="fly_ball")) | ||
impute_combinations.append(ParameterSet(ev=89.2, angle=39.3, bb_type="fly_ball")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I look at the statcast data, I see a lot of values at ev=89.2
, angle=39.0
. and not ev=89.2
, angle=39.3
. I'm wondering if this record with angle=39.3
might be a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact it looks like maybe all the imputed values have angle that are integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that originally came from the tables found here, under "Stringer Fly Balls," then rounded to the precision in the DFs we get. I'll do some more investigating though to see if something was missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like they've started rounding all angles to integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah - so I guess it varies by year... pains :( Maybe this is a product of HawkEye? As I commented below, probably your thought of using a heuristic derivation is the right move.
|
||
df_imputations = pd.DataFrame(data=impute_combinations) | ||
df_imputations["possible_imputation"] = True | ||
df_return = statcast_df.merge(df_imputations, how="left", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of the merge here, df_return
has columns ev
and angle
, which the original didn't have and which are either null or identical to launch_speed
, launch_angle
. It might be better to drop the ev
and angle
columns before returning, or just name them launch_speed
and launch_angle
in the impute_combinations
data to begin with.
@tjburch maybe even a library function -- e.g. |
Working through these proposed edits now, thanks for the eyes on the code. |
Just committed a pretty generic mock-up of this. One question was if there were any preferences about mpl figure size, I set it as dpi=300 for now, but can edit it. |
FYI Tests are failing here due to a bug that is fixed in PR #144 |
@tjburch That looks good, but you can drop the visualization from this PR and submit separately? |
I am thinking, by the ways, that it makes sense to expose this primarily to the user as a boolean |
Done
Sure, we can do that. For what it's worth, the original implementation was loosely motivated from a similar function in baseballr, but we're not held to their approach by any means. I'll get that implementation asap. |
I'm the person that implemented the BTW I opened a PR at |
@bdilday Thanks for that! In your code I see:
And the 5 value ultimately used. Do you find this heuristic works? I could also think if you did some percentage of total rows of bb_type (e.g. if there's 500 fly balls, probably no more than 5 should be a given EV/LA, but if there's 10,000 maybe that number should go up?) |
It worked for the data set I was using at the time, but yeah it's not appropriate for general usage. I think what you mentioned, setting some threshold of percentage of overall balls, is a great starting point, and it could be refined in subsequent PRs if there's a need. |
I will defer to your judgment on this one. |
I have a notebook with a more motivated derivation of values - will submit a PR with it in the next hour so the values can be agreed on, then update this PR with those values. |
Updated numbers in accordance with the derivation in PR #149 |
@schorrm Yep! Good to go. |
I've added a function to utils.py that will add a flag to any Statcast DataFrame with a boolean flag if it's possibly imputed as a result of the no-nulls approach in StatCast in the TrackMan era. For info on the no-nulls approach see here. This boolean can then be used for filtering if desired, the following pictures are launch speed and launch angle before and after filtering on this flag.
The requirements have 3 criteria, specific combinations of launch angle, speed and bb type. If and only if all three criteria are simultaneously exactly matched, the boolean is flipped to true.