Skip to content

[WIP] Fixed #606: DAMP Notebook Reproducer / Tutorial #872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

NimaSarajpoor
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor commented Jun 11, 2023

I have tried a couple of times before to implement DAMP (see #606) but I noticed a couple of issues. Recently, I decided to give myself another chance and implement DAMP.

Tasks

  • Implemented naive function for testing

  • Implemented Table 1, 2, 3 of DAMP paper

  • Reproduce a couple of figures
    [Data_to_Reproduce_Fig2_Fig3.zip]

  • Enhance performance (partially done! Let's look for more opportunities and see MATLAB code as well)

  • Benchmarking

  • Implement the remaining tables / different versions of DAMP


Data
The DAMP's supporting webapge provides that data used in the paper:
see: https://sites.google.com/view/discord-aware-matrix-profile/dataset?authuser=0

The data used to reproduce Fig.2 and Fig.3 of the paper is provided here for convenience:
Data_to_Reproduce_Fig2_Fig3.zip


Some notes from MATLAB / powerpoint slides

  • In most cases, the subsequence length you should use is between about 50 to 90% of a typical period (as rule of thumb, one may consider 70%.)
  • A good initial value of lookahead is about 2^nearest_power_of_two(16 * m); Note that this is different than the initial version where m is used instead of 16*m (Personal opinion: better to avoid 16 unless we find a good justifcation)
  • (Almost) constant subsequence should be handled with care. (Personal opinion: We may add the argument ignore_constant to API.)
  • When the subsequence with start index i is pruned: Left_MP(i) = Left_MP(i-1)-0.00001; Note that this is different than the initial version where we had Left_MP(i) = Left_MP(i-1) (The reason behind substracting a small value: To prevent a pruned subsequence from having the same score as the real best-so-far discord)
  • lookahead can be set as parameter. Whenever it is 0, then DAMP becomes a pure online algorithm with no pruning
  • Find top-k discords: If top-1 discord is at index idx, then the the second discord is the index of the global maximum in PL[:idx], where PL is the approximate left matrix profile obtained using DAMP algorithm. (Personal opinion: I think, for a given approx. left matrix profile computed by DAMP, the maximum value for k in top-k discord can be computed after the computation of approx. left matrix profile. Also, exclusion zone should be applied as well. So, I am not sure how user can set k and expect to see exactly k discords in the output. We should take another look at top-k DAMP MATLAB code)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov-commenter
Copy link

codecov-commenter commented Jun 11, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (857063c) 98.93% compared to head (df6e201) 98.93%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #872   +/-   ##
=======================================
  Coverage   98.93%   98.93%           
=======================================
  Files          84       84           
  Lines       14292    14292           
=======================================
  Hits        14140    14140           
  Misses        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jun 11, 2023

[update]

  • I made some minor improvements to the code such as enhancing docstrings, texts, etc.

  • Fig. 2 and Fig. 3 of the paper are reproduced.

  • To reproduce the figures and check the answers, we can use link1 and/or link2

  • DAMP's MATLAB code (in particular, the "BackProcessing" part) is a bit different compared to what presented in the paper. As mentioned in one of the previous comments in this PR, the "BackProcessing" algorithm provided in the table is not optimized as it recomputes some of previously-analyzed subsequences. The MATLAB code, however, considers this: see this MATLAB code

  • DAMP's performance needs to be improved. We can check if the MATLAB code considers any other logic to improve performance as it seems it is different than what presented in the paper. The current notebook in this PR is based on the algorithm provided in the paper. We may also apply this method on LONG time series, say 500k, and compare its performance against the naive version.

@NimaSarajpoor NimaSarajpoor changed the title Fixed #606: DAMP Notebook Reproducer / Tutorial [WIP] Fixed #606: DAMP Notebook Reproducer / Tutorial Jun 12, 2023
@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jun 12, 2023

[update]

I improved the performance by:

  • using core._mass intead of core.mass
  • avoiding recomputing distances for the already-analyzed subsequences (this is different than the paper but somewhat similar to the code provided in MATLAB)

Also, I simplified the code by removing if/else and applying necessary changes to the code

I was able to find top-1 discord in a time series with length ~250k, in less than three minutes.

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:50Z
----------------------------------------------------------------

"A subsequence is A DISCORD"

"left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"

I am purposely trying to avoid using "anomaly" where possible and to try and be precise


NimaSarajpoor commented on 2023-06-12T16:13:21Z
----------------------------------------------------------------

"left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"

just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well?

NimaSarajpoor commented on 2023-06-12T16:14:04Z
----------------------------------------------------------------

I am purposely trying to avoid using "anomaly" where possible and to try and be precise

I see! Thanks for the input!

seanlaw commented on 2023-06-12T17:24:22Z
----------------------------------------------------------------

just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well?

Oops! How about:

"left nearest neighbors so AS NOT TO miss the case where you have two rare subsequences that happen to also be similar to each other (i.e., a "twin freak")"

NimaSarajpoor commented on 2023-06-12T20:30:42Z
----------------------------------------------------------------

Sounds good. I will revise the code accordingly.

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:51Z
----------------------------------------------------------------

"interested in finding A DISCORD in an offline setting"

My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly


NimaSarajpoor commented on 2023-06-12T16:17:48Z
----------------------------------------------------------------

My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly

Cool!

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:52Z
----------------------------------------------------------------

"Getting Started" instead of "Import Libraries"


NimaSarajpoor commented on 2023-06-12T16:18:57Z
----------------------------------------------------------------

Noted. Will revise it accordingly.

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:53Z
----------------------------------------------------------------

"can compute THE (left) matrix profile"

"that is the start index of the DISCORD"


NimaSarajpoor commented on 2023-06-12T16:20:32Z
----------------------------------------------------------------

Again, the missing article THE :) Shame on me :D Thanks for catching that.

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:54Z
----------------------------------------------------------------

Line #27.        stumpy.config.STUMPY_EXCL_ZONE_DENOM = 1

I think this should be left OUTSIDE of naive_DAMP and, instead, in main, you'd do:

excl_zone = stumpy.config.STUMPY_EXCL_ZONE_DENOM

stumpy.config.STUMPY_EXCL_ZONE_DENOM = 1
naive_DAMP(T, m, split_idx)
stumpy.config.STUMPY_EXCL_ZONE_DENOM = excl_zone



NimaSarajpoor commented on 2023-06-12T16:30:18Z
----------------------------------------------------------------

I will revise the code accordinly.

Side note:

I think I designed the code in a way that should work for any exclusion zone. I need to test it. (The algorithms in the paper are written in way that only consider excl_zone == m. For now, I am trying to be faithful to the paper as much as possible. That is why I am setting the excl_zone denom to 1 to reflect the paper's implementation)

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:55Z
----------------------------------------------------------------

Line #39.        PL_modified = np.where(PL==np.inf, np.NINF, PL)

It feels like we should make this optional rather than only allowing discords to be finite values? Maybe add a parameter of keep_infinite=False.


NimaSarajpoor commented on 2023-06-12T16:36:17Z
----------------------------------------------------------------

I think the only way one can get a discord with infinte value (score) is to have at least one non-finite value in the time series. Wouldn't it be better to not add the proposed argument to the API and instead just let user know regarding this matter?

I mean they can just do: core.rolling_finite(T[split_index:], m) .

What do you think?

seanlaw commented on 2023-06-12T17:29:11Z
----------------------------------------------------------------

I agree. Let's just put a clear note in the description for PL (i.e., all infinite distances are ignored)

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:55Z
----------------------------------------------------------------

Line #9.        discord_score, 

Instead of discord_score, just use bsf so it follows the paper


NimaSarajpoor commented on 2023-06-12T16:36:53Z
----------------------------------------------------------------

Will do so.

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:57Z
----------------------------------------------------------------

Line #10.        is_subseq_pruned,

Just call this pruned


@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:58Z
----------------------------------------------------------------

Line #36.        query_idx : int

Why have separate query_idx and start ? Aren't they ALWAYS the same value?


NimaSarajpoor commented on 2023-06-12T20:59:06Z
----------------------------------------------------------------

You are right! In _foreward_processing they are always the same value. I will remove the intermediate variablestart.

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:58Z
----------------------------------------------------------------

Line #52.        lookahead = np.power(2, int(np.ceil(np.log(m) / np.log(2))))

I recommend creating a helper function:

import math

def next_pow2(val):
    return int(math.pow(2, math.ceil(log2(val))))



@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:00Z
----------------------------------------------------------------

Line #54.        start = query_idx

In the DAMP paper, why do they have start = i + m? But yours is different?


NimaSarajpoor commented on 2023-06-12T21:11:49Z
----------------------------------------------------------------

The paper considers a hardcoded exclusion zone of size m. However, I am trying to be flexible here and consider excl_zone as a paramater /variable.

So, the paper computes the distance between Q = T[i : i +m] and T[i+m : i+m+lookahead] . This makes sense as we do not want to compare Q with its trivial neighbors. (note that excl zone is m in the paper). As a side note, I think, in STUMPY, we should do add +1 to the range of indices, i.e. T[i+m+1 : i+m+1+lookahead]

So, I could do something like: T[i+excl_zone+1, i+excl_zone+1+lookahead ].But I thought it might be better to just consider T[i : i + lookahead] (so the Q is basically the first subsequence of this slice) and then, after computing the distance between Qand this slice, I can do:

core.apply_exclusion_zone(distance_profile, 0, ...)

My approach is not optimal, particularly if m is large. Because, in that case, excl_zone will also be large and we are computing distance between Q and a lot of its neighbors that are in excl_zone . So, I think we may just go with this:

T[i+excl_zone+1, i+excl_zone+1+lookahead]

What do you think?

seanlaw commented on 2023-06-13T00:32:53Z
----------------------------------------------------------------

I think that sounds reasonable

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:00Z
----------------------------------------------------------------

Line #58.            dist_profile = core._mass(

Just use D inplace of dist_profile


@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:01Z
----------------------------------------------------------------

Line #72.            IDX = start + np.flatnonzero(dist_profile < discord_score)

Replace IDX with mask

mask = np.argwhere(D < bsf) + start

pruned[mask] = True



NimaSarajpoor commented on 2023-06-12T21:14:12Z
----------------------------------------------------------------

Will revise the code accordingly.

@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:02Z
----------------------------------------------------------------

Line #121.        left_nn_distance : float

just call it distance 


@review-notebook-app
Copy link

review-notebook-app bot commented Jun 12, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:03Z
----------------------------------------------------------------

Line #124.        discord_score : float

Just call it bsf 


@seanlaw
Copy link
Contributor

seanlaw commented Jun 12, 2023

@NimaSarajpoor This looks great. I've left some initial comments for you to consider

Copy link
Collaborator Author

"left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"

just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well?


View entire conversation on ReviewNB

Copy link
Collaborator Author

I am purposely trying to avoid using "anomaly" where possible and to try and be precise

I see! Thanks for the input!


View entire conversation on ReviewNB

Copy link
Collaborator Author

My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly

Cool!


View entire conversation on ReviewNB

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
I have added a few notes to the initial post of this PR at the top of this page. These note are based on the MATLAB code and the slides that are avaialble on the supporting webpage. While there are more notes to add regarding DAMP, I tried to avoid putting them all here.

@@ -0,0 +1,778 @@
{
Copy link
Contributor

@seanlaw seanlaw Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this considered "offline/batch"?

Separately, it feels too advanced to start off with splitting the data. You haven't really motivated why this splitting is necessary. Why not just take a time series, say that you are applying DAMP to all of it (without splitting) and then find the discord(s)?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I will read the paper to provide the answers to your questions. For now, I am going to provide some explanations that come to my mind as I do not want to forget them later. I will keep this series of comments open]

I think the only reason for the term "offline" here is that we use the look-ahead (_forward_process ) to prune some of the upcoming subsequences (so we know that those subsequences cannot be a discord and hence we do not need to do _backward_process for them). As mentioned in the MATLAB code (and I think in the paper too), when look-ahead is set to 0, i.e. no pruning, then the algorithm becomes online.

You haven't really motivated why this splitting is necessary. Why not just take a time series, say that you are applying DAMP to all of it (without splitting) and then find the discord(s)?

[Right! I need to go and check the paper to see what justification they have provided regarding this matter.]

I think one of the reason is that we are looking LEFT discord! So, some of the first subsequences have no non-trivial LEFT neighbour. The very first subsequence that has one non-trivial neighbour cannot be good candidate as we haven't got enough information to whether choose that subsequence as discord or not.

[I will elaborate more in the upcoming days]

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding "why offline ":

As a follow up to my previous comment, below I have provided three bullet points mentioned in the DAMP paper regarding lookahead

  • If lookahead is zero, DAMP is a fast true online algorithm.
  • If lookahead is allowed to be arbitrarily large, DAMP is an ultrafast batch algorithm
  • Even if lookahead is a small (but non-zero) number, DAMP is effectively online algorithm,

The paper also says:

The algorithm as explained in Table 1 is a batch algorithm. To make it an online algorithm, we simply must reduce the size of the lookahead (Table 3, line 1) to the largest delay we are willing to accept (including possibly zero delay)

The current implementation in the notebook follows the algo provided in the Tables 1,2,3 of the paper. The current implementation uses next-power-of-two of m as the value of lookahead and thus, it is not small. Hence, it is offline algorithm.

Regarding "why splitting is neccary":

Personally I think the term train/test split, as used in the paper, may not make sense. The reason behind my opposition is that the train set (i.e. T[:split_idx] ) is not fixed and it will change as we move towars to the end of time series. Note that for the query T[query_idx : query_idx+m] , we compare it with all of its left neighbors, i.e subsequences in T[:query_idx] , and not just the ones in T[:split_idx]

The paper also provides discussion on the split_idx :

Recall that Classic DAMP has a parameter called spIndex, which sets the location of the split point between the training and test data in the initial state. When Classic DAMP processes a time series, it assumes that the data before spIndex, T[1:spIndex-1] are normal, which may lead to three issues. First, this causes the algorithm to ignore the potential anomalous behavior present in T[1:spIndex-1], resulting in certain false-negative results. Second, this approach may have the algorithm wasting time searching redundant data. It is possible that the patterns in T[1:spIndex-1] are highly redundant, such as 1,000 heartbeats that are essentially identical. If the heartbeats all have the same pattern, it would suffice for the algorithm to take just one of them to learn3 ; there is no need to consider the same pattern 1,000 times, which will waste a lot of time. Further, it may be difficult for T[1:spIndex-1] to contain every normal pattern, which can cause the algorithm to incorrectly identify normal behavior that does not appear in T[1:spIndex-1] as an anomaly. For example, if T[1:spIndex-1] only contains data on the solar zenith angle during the day, the algorithm may incorrectly identify normal solar zenith angles at night as anomalies. These potential problems can undermine the accuracy and efficiency of the algorithm

Golden DAMP is our proposed solution to the above three problems

Two notes:

(1) As discussed above, split_idx is a timestamp before which ALL regular behaviors of the system appear with No anomaly.

(2) As discussed above, the paper will handle the split_idx with the algorithm called Golden DAMP. In this algorithm, we have a vector (a time series) called Golden batch that contains ALL regular behaviors (without any redundant patters) . We can start from the very first index of the main input T . Although it has no left neighbors, that is not an issue sinec we do NOT look at the left neighbors anymore! We just compare a query with a Golden Batch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let know if further discussion is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to build up the intuition step-by-step. To me split_idx seems like an advanced topic and not a parameter that is automatically intuitive to the average user. How can we build up to using split_idx?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the paper, split_idx is problem dependent and user needs to have some insight about the system and its usual / regular patterns. So, they can say "upto the time stamp split_idx, the data shows [all kind of] regular behaviors and there is no anomaly".

I added some explanation regarding this matter to the notebook.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jun 23, 2023

@seanlaw
I need to read the notebook one more time. I will then answer some of the remaining comments according to the new notebook and see how you feel.

@seanlaw
Copy link
Contributor

seanlaw commented Jun 30, 2023

@NimaSarajpoor Can you please merge the latest commits to this PR? I believe that it should fix the failed tests

@@ -0,0 +1,885 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #37.        IL[:split_idx] = -1  # split_idx or split_idx-m?

I added this comment for our future reference. In the paper, a subsequence is compared with all subsequences in T[:split_idx] , hence, the latest subsequence in the aforementioned chunk is a subsequence that starts at location split_idx - m. Although excl_zone is set to m in the paper, the value split_idx - m should still remain the same even with different excl_zone [IMO].

Should we think about how data arrives? [e.g. In batch of size of m , etc]

@NimaSarajpoor: Read the paper regarding this matter. [note to self]


Reply via ReviewNB

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve readability, the manipulation of IL is avoided. Instead, we just start computing PL from split_idx 

@@ -0,0 +1,885 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #60.        for _ in range(n_chunks_upperbound):    

Originally, there was a while loop. I replaced it with a for-loop and added a couple of break statements to terminate the loop. Note that the number of iterations shoud be, at least, equal to the number of chunks in T[:query_idx] So, one easy answer is query_idx since it is definitely larger than the number of chunks. But, just to add some sense to the code and make it a bit more readable, I computed a tighter upperbound for total number of chunks.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now changed to make the code closer to the proposed algorithm

@seanlaw
Copy link
Contributor

seanlaw commented Aug 16, 2023

@NimaSarajpoor How does it feel to return to this code after 2 months? I think this is a good example where you get to critique your own code for readability.

@NimaSarajpoor
Copy link
Collaborator Author

Correct! I am going to put a time gap between my visits to the PR :) Next time, I may provide some other comments with fresh eye.

@@ -0,0 +1,900 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #79.            chunk_stop = chunk_start + m

In the original Backward processing algorithm proposed in the paper, the chunk's stop index is always i , meaning each new iteration will have a chunk that contains the subsequences of the previous chunk (of previous iteration). In the MATLAB code, however, they have the optimized version where the algorithm updates stop index of chunk to make sure each chunk has not-yet-considered subsequences.

#MATLAB code; 

X_start = i-X+1+(expansion_num * SubsequenceLength);

X_end = i-(X/2)+(expansion_num * SubsequenceLength);

approximate_distance = min( real(MASS_V2(T(X_start:X_end), query)));

#Update X and expansion_num at the end of each iteration in while-loop

X = 2*X;

expansion_num = expansion_num+1;

As observed in the snippet code above, X_end is changing in each iteration.

[Note to self]

Further analysis is needed to understand the reason behind using expansion_num



Reply via ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants