-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Percentile qc - test #162
base: main
Are you sure you want to change the base?
Percentile qc - test #162
Conversation
…f percentiles.db existst
diff.fillna(method='ffill', inplace=True) # forward filling all NaNs! | ||
diff = np.array(diff) | ||
|
||
diff_period = np.ones_like(diff) * False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff_period = np.ones_like(diff) * False | |
diff_period = np.zeros_like(diff, dtype='bool') |
if sum(abs(diff[i-diff_h:i])) < static_lim: | ||
diff_period[i-diff_h:i] = True | ||
|
||
diff_period = np.array(diff_period).astype('bool') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unnecessary to assign to boolean data type if the array is instantiated as suggested above
diff_period = np.array(diff_period).astype('bool') |
|
||
base_path = os.getcwd() | ||
|
||
file_path1 = base_path + '/main/src/pypromice/qc/percentiles.db' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment specific paths/variables shall not be assigned statically in the function body. Use input parameters or a class instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be very careful when adding binary files to a code repository because It will increase the repository size while being hard to version control.
Alternatively rely on the committed script to generate the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all *.db
files are included in .gitignore
else: | ||
print(f'percentiles.db does not exist running {script_path2}') | ||
subprocess.call(['python',script_path2]) | ||
file_path = file_path2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the above should be implemented in a separate module and maybe a python class wrapping the database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a database instead of just a datafile like NetCDF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ladsmund I wrote this original code that uses sqlite. The general idea is to provide processing effficiency. If the percentile distributions (for each variable) are stored in a sql database, we are able to extract the percentiles of interest very fast with a sql query, rather than opening/reading a static file for each station. sqlite is the simplest database solution, where we can have a small file on disk.
However, this is of more significant benefit for much larger datasets (e.g. 10s or 100s of thousands of stations). A solution using a static file could also be implemented if desired. Either way, you are persisting something on disk between runs that holds the percentile distributions.
|
||
for i,d in enumerate(diff): # algorithm that ensures values can stay the same within the diff_period | ||
if i > (diff_h-1): | ||
if sum(abs(diff[i-diff_h:i])) < static_lim: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using mean
instead of sum
to make the threshold invariant of window size
@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1., | |||
T_0, T_100, ews, ei0) | |||
return ds | |||
|
|||
def differenceQC(ds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to add such functions to a separate module like pypromice.qc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it might make sense to take this out of L1toL2.py
and have all qc-related functions in the qc
directory? Not sure if there are any implications of moving this other than organizational, but worth considering.
Also, it would be great to add a short description at the top of the docstrings for both differenceQC
and percentileQC
that gives a basic description of what the function does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definitely something we will do in the future, probably in a future update. A better described docsrting is also a good idea
cc = calcCloudCoverage(ds['t_u'], T_0, eps_overcast, eps_clear, # Calculate cloud coverage | ||
ds['dlr'], ds.attrs['station_id']) | ||
ds['cc'] = (('time'), cc.data) | ||
# Determiune cloud cover for on-ice stations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why this change for cloud cover is here? This has nothing to do with percentiles I assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something Baptiste implemented in the meantime, so no, nothing to do with the qc.
else: | ||
print(f'percentiles.db does not exist running {script_path2}') | ||
subprocess.call(['python',script_path2]) | ||
file_path = file_path2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ladsmund I wrote this original code that uses sqlite. The general idea is to provide processing effficiency. If the percentile distributions (for each variable) are stored in a sql database, we are able to extract the percentiles of interest very fast with a sql query, rather than opening/reading a static file for each station. sqlite is the simplest database solution, where we can have a small file on disk.
However, this is of more significant benefit for much larger datasets (e.g. 10s or 100s of thousands of stations). A solution using a static file could also be implemented if desired. Either way, you are persisting something on disk between runs that holds the percentile distributions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all *.db
files are included in .gitignore
@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1., | |||
T_0, T_100, ews, ei0) | |||
return ds | |||
|
|||
def differenceQC(ds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it might make sense to take this out of L1toL2.py
and have all qc-related functions in the qc
directory? Not sure if there are any implications of moving this other than organizational, but worth considering.
Also, it would be great to add a short description at the top of the docstrings for both differenceQC
and percentileQC
that gives a basic description of what the function does.
@@ -141,6 +156,248 @@ def toL2(L1, T_0=273.15, ews=1013.246, ei0=6.1071, eps_overcast=1., | |||
T_0, T_100, ews, ei0) | |||
return ds | |||
|
|||
def differenceQC(ds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is differenceQC
? Just glancing at the code, this appears to be a persistence check? That is, checking if a value has gotten stuck in a persisting state for a certain period of time. If so, I would rename to persistenceQC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it is a persistence/static check. We will rename it to staticQC.
# Define threshold dict to hold limit values, and 'hi' and 'lo' percentile. | ||
# Limit values indicate how far we will go beyond the hi and lo percentiles to flag outliers. | ||
# *_u are used to calculate and define all limits, which are then applied to *_u, *_l and *_i | ||
var_threshold = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am noticing that the values in this var_threshold
are different that the values in var_threshold
in percentileQC
. Is this intentional? Was this function used for testing and determining thresholds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting that, I will change it back! I tested some different thresholds and apparently did not change them back.
@RasmusBahbah I left a few comments, mostly minor. Overall, I am curious what kind of testing and plotting was done to determine both the limit thresholds (the values in I remember in our initial testing, we were certainly finding many outliers that could be removed with the percentiles check, but we were also finding instances where good data would be removed. I assume there is no tolerance for removing any good data from the historical dataset, so were you able to do enough testing for each station to determine that the limits in this PR are conservative enough to preserve all good data? Was air temp the only variable that uses seasonal distributions? All other variables use the full dataset to calculate percentiles? Due to my limited time for this review, I would also highly encourage you get a review from @PennyHow as well. Great that this is being pushed forward! |
Thanks for your feedback, @patrickjwright, much appreciated. And yes, airtemp is the only variable with seasonal percentiles, and the other variables use the whole time-series. It will be engaging in a future update to test if monthly percentiles or another conf. could improve the QC. Maybe also use it on the other variables. I really appreciate your time on this! @ladsmund and I will make some improvements on the structure, the staticQC, and how to test it. After that, we will definitely include @PennyHow for the final review. |
|
||
|
||
ds = differenceQC(ds) # Flag and Remove difference outliers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ds
the entire time series?
Can this PR be closed now that #183 is merged? @ladsmund @RasmusBahbah |
No description provided.