Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

skyline.analyzer.metrics #90

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

earthgecko
Copy link

Added functionality to analyzer.py for skyline to feed all of its own metrics
back to graphite. This results in skyline analyzing its own metrics for free :)
The resultant graphite metrics and carbon files (if using whisper and not ceres)
that this provides are (e.g):
skyline/
├── analyzer
│   ├── anomaly_breakdown
│   │   ├── first_hour_average.wsp
│   │   ├── grubbs.wsp
│   │   ├── histogram_bins.wsp
│   │   ├── ks_test.wsp
│   │   ├── least_squares.wsp
│   │   ├── mean_subtraction_cumulation.wsp
│   │   ├── median_absolute_deviation.wsp
│   │   ├── stddev_from_average.wsp
│   │   └── stddev_from_moving_average.wsp
│   ├── duration.wsp
│   ├── exceptions
│   │   ├── Boring.wsp
│   │   └── Stale.wsp
│   ├── projected.wsp
│   ├── run_time.wsp
│   ├── total_analyzed.wsp
│   ├── total_anomalies.wsp
│   └── total_metrics.wsp
└── horizon
└── queue_size.wsp
There will be more for other exceptions and any further added algorithms, this
is however quite trivial in terms of whisper storage and new metrics adds.
Modified:
src/analyzer/analyzer.py

Added functionality to analyzer.py for skyline to feed all of its own metrics
back to graphite.  This results in skyline analyzing its own metrics for free :)
The resultant graphite metrics and carbon files (if using whisper and not ceres)
that this provides are (e.g):
skyline/
├── analyzer
│   ├── anomaly_breakdown
│   │   ├── first_hour_average.wsp
│   │   ├── grubbs.wsp
│   │   ├── histogram_bins.wsp
│   │   ├── ks_test.wsp
│   │   ├── least_squares.wsp
│   │   ├── mean_subtraction_cumulation.wsp
│   │   ├── median_absolute_deviation.wsp
│   │   ├── stddev_from_average.wsp
│   │   └── stddev_from_moving_average.wsp
│   ├── duration.wsp
│   ├── exceptions
│   │   ├── Boring.wsp
│   │   └── Stale.wsp
│   ├── projected.wsp
│   ├── run_time.wsp
│   ├── total_analyzed.wsp
│   ├── total_anomalies.wsp
│   └── total_metrics.wsp
└── horizon
    └── queue_size.wsp
There will be more for other exceptions and any further added algorithms, this
is however quite trivial in terms of whisper storage and new metrics adds.
Modified:
src/analyzer/analyzer.py
@earthgecko
Copy link
Author

This should probably close #89

self.send_graphite_metric('skyline.analyzer.total_metrics', '%d' % len(unique_metrics))
for key, value in exceptions.items():
send_metric = 'skyline.analyzer.exceptions.%s' % key
self.send_graphite_metric(send_metric, '%d' % value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this give crazy tracebacks? You don't really want to put that in Graphite as a metric name.

@earthgecko
Copy link
Author

Hi Abe

No it seems fine... I hope, although that you raised the question makes me wonder :)

Crazy feedback loop, I do see what you are saying, however although it may appear so, no it does not result in that, because actually they are expected to become anomalous.
So if lots of metrics suddenly went anomalous and stddev_from_moving_average, etc peaked then these metrics would be expected to become anomalous as well in most cases, but that is expected and desired.
For instance for skyline.analyzer.anomaly_breakdown.least_squares to become anomalous, a number of least_squares metrics needed to become anomalous for this to happen.

This is not for alerting on per se, but to keep a timeseries set for each of them.

So basically it is taking what logger is reporting and turning them into metrics:

2014-06-09 15:38:45 :: 26880 :: exception stats   :: {'Boring': 4501, 'Stale': 17}
2014-06-09 15:38:45 :: 26880 :: anomaly breakdown :: {'least_squares': 23, 'histogram_bins': 10, 'stddev_from_average': 25, 'stddev_from_moving_average': 4, 'first_hour_average': 26, 'ks_test': 1, 'median_absolute_deviation': 18, 'grubbs': 20, 'mean_subtraction_cumulation': 26}

So what logger wrote there gets send to graphite as:

skyline.analyzer.exceptions.Boring 4501
skyline.analyzer.exceptions.Stale 17
skyline.analyzer.anomaly_breakdown.least_squares 23
skyline.analyzer.anomaly_breakdown.histogram_bins 10

etc

Now seeing as exceptions are classes and anomaly_breakdowns are defines the namespaces should not be messed up by spaces, etc

And this results in.

skyline analyzer metrics namespace

And the first crazy skyline feedback graph in the world? :)

skyline analyzer metrics

@earthgecko
Copy link
Author

Hi Abe

Having assessed further, there is some crazy feedback, although I am not certain the feedback exaggeration is any worse than:
+ NUMBER_OF_ALGORITHMS + 1
I was thinking maybe:
+ (NUMBER_OF_ALGORITHMS * NUMBER_OF_ALGORITHMS) + 1 - but it does not seem like it. Overlaying the total_anomalies over the above graph image and accessing the past total_anomalies before the skyline.analyzer.anomaly_breakdown were added, there does not appear to be a great exaggeration in the total_anomalies going up. Albeit there is an exaggeration of each skyline.analyzer.anomaly_breakdown metric becoming anomalous when they trigger.

The + 1 being the total_anomalies value itself becoming anomalous, as each anomaly_breakdown item is only one metric sent every analyzer runs, as is the total_anomalies count. This would be fairly simple to unexaggerate by adding as a default "skyline.analyzer.anomaly_breakdown" to the SKIP_LIST.

I know this is quite a fork and not necessarily in the direction of anomaly detection (or is it?), however it does give one a better operational view and visualisation of what skyline is doing and how things are triggered. It is interest to see that ks_test appears to be the least CONSENSUSual algorithm, followed by what would appear to be stddev_from_moving_average second and then the others all appears prominently CONSENSUSual, not sure what that means, but I am damn sure a data scientist or data engineer or R enthusiast may be able to do something with it. by_greatest_surface_area I wish I could do that simply :) But I am fairly certain someone else somewhere can :)

Maybe good for testing new algorithms as well, if you were not using crucible to do that for some reason and just wanted to visually determine how anomalous an algorithm might be.

Going to SKIP_LIST "skyline.analyzer.anomaly_breakdown" and later do a visual diff let us see, can you pause a pull request? There is a feature request... @jnewland ;)

Added skyline.analyzer.anomaly_breakdown. to SKIP_LIST
Modified:
src/settings.py.example
@earthgecko
Copy link
Author

Hi Abe

The results are in and it appears that there is a small but noticable difference in the total_anomalies when skyline.analyzer.anomaly_breakdown in not the SKIP_LIST

When analyzing the skyline.analyzer.anomaly_breakdown, the total_anomalies graph does seems a bit more peaky, but it is very hard to say when you look into further retention periods.

That said I see no reason not to have 'skyline.analyzer.anomaly_breakdown.' as a default in the SKIP_LIST in the context of this pull request. Not monitoring the anomaly_breakdown items does not remove the value as they are all effectively aggregated into the one metric that matters most, total_anomalies, which is monitored/analyzed and we still have timeseries data on the algorithms, so win/win.

@earthgecko
Copy link
Author

The plot thickens, there was me thinking that my SKIP_LIST had applied, I forget that SKIP_LIST does not get applied in analyzer but rather in horizon (which I did not restart) so the visual peaky diff must have just been in the actual anomalies then. I just got a load of anomaly_breakdown alerts that I was no longer expecting from skyline (rkhunter is running, so it is actually normal).

So will update with more info when we have a total_anomalies graph without skyline.analyzer.anomalies_breakdown feedback.

@earthgecko
Copy link
Author

Hi Abe

The status quo is maintained, in the context of the skyline feedback with the 'skyline.analyzer.anomaly_breakdown.' in the SKIP_LIST being applied, the graphs visually look very similar, no exaggerated differences, but skyline not analyzing the skyline.analyzer.anomaly_breakdown timeseries data but just recording it makes more sense.

The total_anomalies timeseries and graph have already helped us identify and distribute some process intensive tasks over a longer time period, so as to spread the load out more, so it has added some value.

Therefore in terms of this pull request, that is it :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants