[Algorithm] Implement incremental clustering of streaming log based on Drain #5

Superskyyy · 2022-06-27T03:02:14Z

After the initial evaluation phase, we would start our log analysis feature by implementing the Drain method.

The goal of this algorithm is to ingest a raw log record stream and to produce most likely matches of the log among learnt template clusters.

Drain initial evaluation
Drain implementation
- Base implementation via Drain3
- DAG implementation (no-go)
Drain tuning & experiments
- Adaptive parameter tuning (no-go)
  Based on the second paper , auto parameter tuning is definitely a huge plus for our use case, we should pursue to implement it.

Superskyyy · 2022-06-27T03:56:28Z

FYI @Liangshumin

Liangshumin · 2022-06-27T04:04:25Z

Drain initial

wu-sheng · 2022-06-27T04:52:05Z

Does drain method have been implemented? Or, are we going to implement this?

Superskyyy · 2022-06-27T13:59:28Z

Does drain method have been implemented? Or, are we going to implement this?

@wu-sheng
There's a neat implementation from IBM, (MIT licensed) https://github.com/IBM/Drain3 but we need to do some customization for our integration and enhancements (including an extra cache mechanism and state persistence to bump up its performance).

It could be a more than minimal modification in consideration of the original code base's LOC.

I'm thinking to have a fork here in the org for the features, then sometime in the future we cherrypick and contribute these features back to its upstream. It seems unnecessary to totally rewrite it given the core parts are very well written already and continuously receiving updates.

Superskyyy · 2022-06-27T18:54:07Z

An interesting thing I found earlier - Drain has another updated version (in its journal paper) - called DAG drain, this more recent paper describes an auto parameter tuning method with DAG implementation, do take a deep look. @Liangshumin
https://arxiv.org/pdf/1806.04356.pdf
There's no publically available implementation that I know of, it might be a great contribution to open-source and research community if we can reproduce it.

wu-sheng · 2022-06-28T00:25:52Z

There's a neat implementation from IBM, (MIT licensed) https://github.com/IBM/Drain3 but we need to do some customization for our integration and enhancements (including an extra cache mechanism and state persistence to bump up its performance).
It could be a more than minimal modification in consideration of the original code base's LOC.

Could you share from what perspective we are going to change it? Could we try to do git submodule(or git commit-id lock) to import this report and apply minimal changes in our repo to rebuild it in the compiling process?
I don't like the fork, as it is hard to control the boundaries of changes.

wu-sheng · 2022-06-28T00:27:11Z

About DAG drain, it would be great if we are going to implement it through MIT license too.

Superskyyy · 2022-06-28T04:37:16Z

Could you share from what perspective we are going to change it? Could we try to do git submodule(or git commit-id lock) to import this report and apply minimal changes in our repo to rebuild it in the compiling process? I don't like the fork, as it is hard to control the boundaries of changes.

In the short run (before 0.1.0) I will only request @Liangshumin to optimize the input layer to try to utilize cache lookup so we can speed it up a bit more. So it should be able to do by wrapping the library without needing to submodule it. Anyway, anything can be overridden so not a problem.

In the longer run, I or someone will work on rewriting some key methods to Cython or parallelizing the tree states using multi-processing, considering the original algorithm core part only got < 500 lines of code, it's like a major rewrite, I'd like to contribute these things back to upstream.

@wu-sheng Btw, please advise how many logs per second typically a medium-scale SkyWalking deployment would receive? I don't have too much idea but I'd like to make some understanding of the requirements so we can avoid under/over-optimizing it. It's a very fast algorithm already, but it can be extremely fast if we do the above optimization.

About DAG drain, it would be great if we are going to implement it through MIT license too.

Fair enough, since most research reproduced code projects are under MIT.

wu-sheng · 2022-06-28T04:55:17Z

Don't worry about performance. All open source starts from MVP, and then we run the benchmark to see its capability.
Performance should only be considered from architecture perspective, such as bottleneck of deployment, scaling capability.

Superskyyy · 2022-06-28T05:15:09Z

Don't worry about performance. All open source starts from MVP, and then we run the benchmark to see its capability. Performance should only be considered from architecture perspective, such as bottleneck of deployment, scaling capability.

That's true, thank you for the advice, anyway, I just tested itself can handle at least 10k+ raw logs per second given a 5 million dataset with 200+ patterns, seems like stable enough.

wu-sheng · 2022-06-28T06:24:16Z

BTW, for exporting logs to this engine, we could provide a throughput limit sampling mechanism. Such as 10k/s as the max exporting, which could make the AI Engine's payload predictable.

Superskyyy · 2022-09-10T17:05:41Z

The DAG version of Drain seems unnecessary and will not provide much enhancement.
The Drain auto threshold derivation heuristic does not seem generally applicable after extensive experimentation (unless I did it wrong), so we stick to the default one for now.

I'm formalizing the algorithm implementation to our code base.

Superskyyy added type: feature A feature to be implemented Algorithm The work is on the algorithm side labels Jun 27, 2022

Superskyyy added this to the 0.1.0 milestone Jun 27, 2022

Superskyyy assigned Superskyyy and unassigned Superskyyy Jun 27, 2022

Superskyyy assigned Liangshumin and Superskyyy Jun 27, 2022

Superskyyy added the analysis: log label Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Algorithm] Implement incremental clustering of streaming log based on Drain #5

[Algorithm] Implement incremental clustering of streaming log based on Drain #5

Superskyyy commented Jun 27, 2022 •

edited

Loading

Superskyyy commented Jun 27, 2022

Liangshumin commented Jun 27, 2022

wu-sheng commented Jun 27, 2022

Superskyyy commented Jun 27, 2022 •

edited

Loading

Superskyyy commented Jun 27, 2022 •

edited

Loading

wu-sheng commented Jun 28, 2022

wu-sheng commented Jun 28, 2022

Superskyyy commented Jun 28, 2022 •

edited

Loading

wu-sheng commented Jun 28, 2022

Superskyyy commented Jun 28, 2022 •

edited

Loading

wu-sheng commented Jun 28, 2022

Superskyyy commented Sep 10, 2022

[Algorithm] Implement incremental clustering of streaming log based on Drain #5

[Algorithm] Implement incremental clustering of streaming log based on Drain #5

Comments

Superskyyy commented Jun 27, 2022 • edited Loading

Superskyyy commented Jun 27, 2022

Liangshumin commented Jun 27, 2022

wu-sheng commented Jun 27, 2022

Superskyyy commented Jun 27, 2022 • edited Loading

Superskyyy commented Jun 27, 2022 • edited Loading

wu-sheng commented Jun 28, 2022

wu-sheng commented Jun 28, 2022

Superskyyy commented Jun 28, 2022 • edited Loading

wu-sheng commented Jun 28, 2022

Superskyyy commented Jun 28, 2022 • edited Loading

wu-sheng commented Jun 28, 2022

Superskyyy commented Sep 10, 2022

Superskyyy commented Jun 27, 2022 •

edited

Loading

Superskyyy commented Jun 27, 2022 •

edited

Loading

Superskyyy commented Jun 27, 2022 •

edited

Loading

Superskyyy commented Jun 28, 2022 •

edited

Loading

Superskyyy commented Jun 28, 2022 •

edited

Loading