-
Notifications
You must be signed in to change notification settings - Fork 333
[WIP] Fixed #606: DAMP Notebook Reproducer / Tutorial #872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #872 +/- ##
=======================================
Coverage 98.93% 98.93%
=======================================
Files 84 84
Lines 14292 14292
=======================================
Hits 14140 14140
Misses 152 152 ☔ View full report in Codecov by Sentry. |
[update]
|
[update] I improved the performance by:
Also, I simplified the code by removing if/else and applying necessary changes to the code I was able to find top-1 discord in a time series with length ~250k, in less than three minutes. |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:50Z "A subsequence is A DISCORD"
"left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"
I am purposely trying to avoid using "anomaly" where possible and to try and be precise NimaSarajpoor commented on 2023-06-12T16:13:21Z "left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"
just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well? NimaSarajpoor commented on 2023-06-12T16:14:04Z I am purposely trying to avoid using "anomaly" where possible and to try and be precise I see! Thanks for the input! seanlaw commented on 2023-06-12T17:24:22Z just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well? Oops! How about: "left nearest neighbors so AS NOT TO miss the case where you have two rare subsequences that happen to also be similar to each other (i.e., a "twin freak")" NimaSarajpoor commented on 2023-06-12T20:30:42Z Sounds good. I will revise the code accordingly. |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:51Z "interested in finding A DISCORD in an offline setting"
My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly NimaSarajpoor commented on 2023-06-12T16:17:48Z My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly Cool! |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:52Z "Getting Started" instead of "Import Libraries" NimaSarajpoor commented on 2023-06-12T16:18:57Z Noted. Will revise it accordingly. |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:53Z "can compute THE (left) matrix profile"
"that is the start index of the DISCORD" NimaSarajpoor commented on 2023-06-12T16:20:32Z Again, the missing article THE :) Shame on me :D Thanks for catching that. |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:54Z Line #27. stumpy.config.STUMPY_EXCL_ZONE_DENOM = 1 I think this should be left OUTSIDE of
excl_zone = stumpy.config.STUMPY_EXCL_ZONE_DENOM NimaSarajpoor commented on 2023-06-12T16:30:18Z I will revise the code accordinly.
Side note: I think I designed the code in a way that should work for any exclusion zone. I need to test it. (The algorithms in the paper are written in way that only consider |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:55Z Line #39. PL_modified = np.where(PL==np.inf, np.NINF, PL) It feels like we should make this optional rather than only allowing discords to be finite values? Maybe add a parameter of NimaSarajpoor commented on 2023-06-12T16:36:17Z I think the only way one can get a discord with infinte value (score) is to have at least one non-finite value in the time series. Wouldn't it be better to not add the proposed argument to the API and instead just let user know regarding this matter?
I mean they can just do:
What do you think? seanlaw commented on 2023-06-12T17:29:11Z I agree. Let's just put a clear note in the description for PL (i.e., all infinite distances are ignored) |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:55Z Line #9. discord_score, Instead of NimaSarajpoor commented on 2023-06-12T16:36:53Z Will do so. |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:57Z Line #10. is_subseq_pruned, Just call this |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:58Z Line #36. query_idx : int Why have separate NimaSarajpoor commented on 2023-06-12T20:59:06Z You are right! In |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:51:58Z Line #52. lookahead = np.power(2, int(np.ceil(np.log(m) / np.log(2)))) I recommend creating a helper function: import math |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:52:00Z Line #54. start = query_idx In the DAMP paper, why do they have NimaSarajpoor commented on 2023-06-12T21:11:49Z The paper considers a hardcoded exclusion zone of size
So, the paper computes the distance between
So, I could do something like:
My approach is not optimal, particularly if
T[i+excl_zone+1, i+excl_zone+1+lookahead]
What do you think? seanlaw commented on 2023-06-13T00:32:53Z I think that sounds reasonable |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:52:00Z Line #58. dist_profile = core._mass( Just use |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:52:01Z Line #72. IDX = start + np.flatnonzero(dist_profile < discord_score) Replace
mask = np.argwhere(D < bsf) + start NimaSarajpoor commented on 2023-06-12T21:14:12Z Will revise the code accordingly. |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:52:02Z Line #121. left_nn_distance : float just call it |
View / edit / reply to this conversation on ReviewNB seanlaw commented on 2023-06-12T14:52:03Z Line #124. discord_score : float Just call it |
@NimaSarajpoor This looks great. I've left some initial comments for you to consider |
"left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"
just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well? View entire conversation on ReviewNB |
I am purposely trying to avoid using "anomaly" where possible and to try and be precise I see! Thanks for the input! View entire conversation on ReviewNB |
My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly Cool! View entire conversation on ReviewNB |
@seanlaw |
@@ -0,0 +1,778 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this considered "offline/batch"?
Separately, it feels too advanced to start off with splitting the data. You haven't really motivated why this splitting is necessary. Why not just take a time series, say that you are applying DAMP to all of it (without splitting) and then find the discord(s)?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[I will read the paper to provide the answers to your questions. For now, I am going to provide some explanations that come to my mind as I do not want to forget them later. I will keep this series of comments open]
I think the only reason for the term "offline" here is that we use the look-ahead (_forward_process
) to prune some of the upcoming subsequences (so we know that those subsequences cannot be a discord and hence we do not need to do _backward_process
for them). As mentioned in the MATLAB code (and I think in the paper too), when look-ahead is set to 0, i.e. no pruning, then the algorithm becomes online.
You haven't really motivated why this splitting is necessary. Why not just take a time series, say that you are applying DAMP to all of it (without splitting) and then find the discord(s)?
[Right! I need to go and check the paper to see what justification they have provided regarding this matter.]
I think one of the reason is that we are looking LEFT discord! So, some of the first subsequences have no non-trivial LEFT neighbour. The very first subsequence that has one non-trivial neighbour cannot be good candidate as we haven't got enough information to whether choose that subsequence as discord or not.
[I will elaborate more in the upcoming days]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "why offline
":
As a follow up to my previous comment, below I have provided three bullet points mentioned in the DAMP paper regarding lookahead
- If lookahead is zero, DAMP is a fast true online algorithm.
- If lookahead is allowed to be arbitrarily large, DAMP is an ultrafast batch algorithm
- Even if lookahead is a small (but non-zero) number, DAMP is effectively online algorithm,
The paper also says:
The algorithm as explained in Table 1 is a batch algorithm. To make it an online algorithm, we simply must reduce the size of the lookahead (Table 3, line 1) to the largest delay we are willing to accept (including possibly zero delay)
The current implementation in the notebook follows the algo provided in the Tables 1,2,3 of the paper. The current implementation uses next-power-of-two of m
as the value of lookahead
and thus, it is not small. Hence, it is offline algorithm.
Regarding "why splitting is neccary":
Personally I think the term train/test split, as used in the paper, may not make sense. The reason behind my opposition is that the train set (i.e. T[:split_idx]
) is not fixed and it will change as we move towars to the end of time series. Note that for the query T[query_idx : query_idx+m]
, we compare it with all of its left neighbors, i.e subsequences in T[:query_idx]
, and not just the ones in T[:split_idx]
The paper also provides discussion on the split_idx
:
Recall that Classic DAMP has a parameter called spIndex, which sets the location of the split point between the training and test data in the initial state. When Classic DAMP processes a time series, it assumes that the data before spIndex, T[1:spIndex-1] are normal, which may lead to three issues. First, this causes the algorithm to ignore the potential anomalous behavior present in T[1:spIndex-1], resulting in certain false-negative results. Second, this approach may have the algorithm wasting time searching redundant data. It is possible that the patterns in T[1:spIndex-1] are highly redundant, such as 1,000 heartbeats that are essentially identical. If the heartbeats all have the same pattern, it would suffice for the algorithm to take just one of them to learn3 ; there is no need to consider the same pattern 1,000 times, which will waste a lot of time. Further, it may be difficult for T[1:spIndex-1] to contain every normal pattern, which can cause the algorithm to incorrectly identify normal behavior that does not appear in T[1:spIndex-1] as an anomaly. For example, if T[1:spIndex-1] only contains data on the solar zenith angle during the day, the algorithm may incorrectly identify normal solar zenith angles at night as anomalies. These potential problems can undermine the accuracy and efficiency of the algorithm
Golden DAMP is our proposed solution to the above three problems
Two notes:
(1) As discussed above, split_idx
is a timestamp before which ALL regular behaviors of the system appear with No anomaly.
(2) As discussed above, the paper will handle the split_idx
with the algorithm called Golden DAMP. In this algorithm, we have a vector (a time series) called Golden batch that contains ALL regular behaviors (without any redundant patters) . We can start from the very first index of the main input T
. Although it has no left neighbors, that is not an issue sinec we do NOT look at the left neighbors anymore! We just compare a query with a Golden Batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let know if further discussion is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to build up the intuition step-by-step. To me split_idx
seems like an advanced topic and not a parameter that is automatically intuitive to the average user. How can we build up to using split_idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the paper, split_idx
is problem dependent and user needs to have some insight about the system and its usual / regular patterns. So, they can say "upto the time stamp split_idx
, the data shows [all kind of] regular behaviors and there is no anomaly".
I added some explanation regarding this matter to the notebook.
@seanlaw |
@NimaSarajpoor Can you please merge the latest commits to this PR? I believe that it should fix the failed tests |
@@ -0,0 +1,885 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #37. IL[:split_idx] = -1 #split_idx
orsplit_idx-m
?
I added this comment for our future reference. In the paper, a subsequence is compared with all subsequences in T[:split_idx]
, hence, the latest subsequence in the aforementioned chunk is a subsequence that starts at location split_idx - m
. Although excl_zone
is set to m
in the paper, the value split_idx - m
should still remain the same even with different excl_zone [IMO].
Should we think about how data arrives? [e.g. In batch of size of m
, etc]
@NimaSarajpoor: Read the paper regarding this matter. [note to self]
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve readability, the manipulation of IL is avoided. Instead, we just start computing PL from split_idx
@@ -0,0 +1,885 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #60. for _ in range(n_chunks_upperbound):
Originally, there was a while loop. I replaced it with a for-loop and added a couple of break
statements to terminate the loop. Note that the number of iterations shoud be, at least, equal to the number of chunks in T[:query_idx]
So, one easy answer is query_idx
since it is definitely larger than the number of chunks. But, just to add some sense to the code and make it a bit more readable, I computed a tighter upperbound for total number of chunks.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now changed to make the code closer to the proposed algorithm
@NimaSarajpoor How does it feel to return to this code after 2 months? I think this is a good example where you get to critique your own code for readability. |
Correct! I am going to put a time gap between my visits to the PR :) Next time, I may provide some other comments with fresh eye. |
@@ -0,0 +1,900 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #79. chunk_stop = chunk_start + m
In the original Backward processing algorithm proposed in the paper, the chunk's stop index is always i
, meaning each new iteration will have a chunk that contains the subsequences of the previous chunk (of previous iteration). In the MATLAB code, however, they have the optimized version where the algorithm updates stop index of chunk to make sure each chunk has not-yet-considered subsequences.
#MATLAB code;X_start = i-X+1+(expansion_num * SubsequenceLength);
X_end = i-(X/2)+(expansion_num * SubsequenceLength);
approximate_distance = min( real(MASS_V2(T(X_start:X_end), query)));
#Update X and expansion_num at the end of each iteration in while-loop
X = 2*X;
expansion_num = expansion_num+1;
As observed in the snippet code above, X_end
is changing in each iteration.
[Note to self]
Further analysis is needed to understand the reason behind using expansion_num
Reply via ReviewNB
I have tried a couple of times before to implement DAMP (see #606) but I noticed a couple of issues. Recently, I decided to give myself another chance and implement
DAMP
.Tasks
Implemented naive function for testing
Implemented Table 1, 2, 3 of DAMP paper
Reproduce a couple of figures
[Data_to_Reproduce_Fig2_Fig3.zip]
Enhance performance (partially done! Let's look for more opportunities and see MATLAB code as well)
Benchmarking
Implement the remaining tables / different versions of DAMP
Data
The DAMP's supporting webapge provides that data used in the paper:
see: https://sites.google.com/view/discord-aware-matrix-profile/dataset?authuser=0
The data used to reproduce Fig.2 and Fig.3 of the paper is provided here for convenience:
Data_to_Reproduce_Fig2_Fig3.zip
Some notes from MATLAB / powerpoint slides
m
is used instead of16*m
(Personal opinion: better to avoid16
unless we find a good justifcation)ignore_constant
to API.)i
is pruned:Left_MP(i) = Left_MP(i-1)-0.00001
; Note that this is different than the initial version where we hadLeft_MP(i) = Left_MP(i-1)
(The reason behind substracting a small value: To prevent a pruned subsequence from having the same score as the real best-so-far discord)lookahead
can be set as parameter. Whenever it is0
, thenDAMP
becomes a pure online algorithm with no pruningidx
, then the the second discord is the index of the global maximum inPL[:idx]
, wherePL
is the approximate left matrix profile obtained using DAMP algorithm. (Personal opinion: I think, for a given approx. left matrix profile computed by DAMP, the maximum value fork
in top-k discord can be computed after the computation of approx. left matrix profile. Also, exclusion zone should be applied as well. So, I am not sure how user can setk
and expect to see exactlyk
discords in the output. We should take another look at top-k DAMP MATLAB code)