Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Algorithm] Implement incremental clustering of streaming log based on Drain #5

Open
4 of 6 tasks
Superskyyy opened this issue Jun 27, 2022 · 12 comments
Open
4 of 6 tasks
Assignees
Labels
Algorithm The work is on the algorithm side analysis: log type: feature A feature to be implemented
Milestone

Comments

@Superskyyy
Copy link
Member

Superskyyy commented Jun 27, 2022

After the initial evaluation phase, we would start our log analysis feature by implementing the Drain method.

The goal of this algorithm is to ingest a raw log record stream and to produce most likely matches of the log among learnt template clusters.

  • Drain initial evaluation

  • Drain implementation

    • Base implementation via Drain3
    • DAG implementation (no-go)
  • Drain tuning & experiments

    • Adaptive parameter tuning (no-go)
      Based on the second paper , auto parameter tuning is definitely a huge plus for our use case, we should pursue to implement it.
@Superskyyy Superskyyy added type: feature A feature to be implemented Algorithm The work is on the algorithm side labels Jun 27, 2022
@Superskyyy Superskyyy added this to the 0.1.0 milestone Jun 27, 2022
@Superskyyy Superskyyy assigned Superskyyy and unassigned Superskyyy Jun 27, 2022
@Superskyyy
Copy link
Member Author

FYI @Liangshumin

@Liangshumin
Copy link
Collaborator

Drain initial

@wu-sheng
Copy link
Member

Does drain method have been implemented? Or, are we going to implement this?

@Superskyyy
Copy link
Member Author

Superskyyy commented Jun 27, 2022

Does drain method have been implemented? Or, are we going to implement this?

@wu-sheng
There's a neat implementation from IBM, (MIT licensed) https://github.com/IBM/Drain3 but we need to do some customization for our integration and enhancements (including an extra cache mechanism and state persistence to bump up its performance).

It could be a more than minimal modification in consideration of the original code base's LOC.

I'm thinking to have a fork here in the org for the features, then sometime in the future we cherrypick and contribute these features back to its upstream. It seems unnecessary to totally rewrite it given the core parts are very well written already and continuously receiving updates.

@Superskyyy
Copy link
Member Author

Superskyyy commented Jun 27, 2022

An interesting thing I found earlier - Drain has another updated version (in its journal paper) - called DAG drain, this more recent paper describes an auto parameter tuning method with DAG implementation, do take a deep look. @Liangshumin
https://arxiv.org/pdf/1806.04356.pdf
There's no publically available implementation that I know of, it might be a great contribution to open-source and research community if we can reproduce it.

@wu-sheng
Copy link
Member

There's a neat implementation from IBM, (MIT licensed) https://github.com/IBM/Drain3 but we need to do some customization for our integration and enhancements (including an extra cache mechanism and state persistence to bump up its performance).
It could be a more than minimal modification in consideration of the original code base's LOC.

Could you share from what perspective we are going to change it? Could we try to do git submodule(or git commit-id lock) to import this report and apply minimal changes in our repo to rebuild it in the compiling process?
I don't like the fork, as it is hard to control the boundaries of changes.

@wu-sheng
Copy link
Member

About DAG drain, it would be great if we are going to implement it through MIT license too.

@Superskyyy
Copy link
Member Author

Superskyyy commented Jun 28, 2022

Could you share from what perspective we are going to change it? Could we try to do git submodule(or git commit-id lock) to import this report and apply minimal changes in our repo to rebuild it in the compiling process? I don't like the fork, as it is hard to control the boundaries of changes.

In the short run (before 0.1.0) I will only request @Liangshumin to optimize the input layer to try to utilize cache lookup so we can speed it up a bit more. So it should be able to do by wrapping the library without needing to submodule it. Anyway, anything can be overridden so not a problem.

In the longer run, I or someone will work on rewriting some key methods to Cython or parallelizing the tree states using multi-processing, considering the original algorithm core part only got < 500 lines of code, it's like a major rewrite, I'd like to contribute these things back to upstream.

@wu-sheng Btw, please advise how many logs per second typically a medium-scale SkyWalking deployment would receive? I don't have too much idea but I'd like to make some understanding of the requirements so we can avoid under/over-optimizing it. It's a very fast algorithm already, but it can be extremely fast if we do the above optimization.

About DAG drain, it would be great if we are going to implement it through MIT license too.

Fair enough, since most research reproduced code projects are under MIT.

@wu-sheng
Copy link
Member

Don't worry about performance. All open source starts from MVP, and then we run the benchmark to see its capability.
Performance should only be considered from architecture perspective, such as bottleneck of deployment, scaling capability.

@Superskyyy
Copy link
Member Author

Superskyyy commented Jun 28, 2022

Don't worry about performance. All open source starts from MVP, and then we run the benchmark to see its capability. Performance should only be considered from architecture perspective, such as bottleneck of deployment, scaling capability.

That's true, thank you for the advice, anyway, I just tested itself can handle at least 10k+ raw logs per second given a 5 million dataset with 200+ patterns, seems like stable enough.

@wu-sheng
Copy link
Member

BTW, for exporting logs to this engine, we could provide a throughput limit sampling mechanism. Such as 10k/s as the max exporting, which could make the AI Engine's payload predictable.

@Superskyyy
Copy link
Member Author

The DAG version of Drain seems unnecessary and will not provide much enhancement.
The Drain auto threshold derivation heuristic does not seem generally applicable after extensive experimentation (unless I did it wrong), so we stick to the default one for now.

I'm formalizing the algorithm implementation to our code base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithm The work is on the algorithm side analysis: log type: feature A feature to be implemented
Projects
Status: In Progress
Development

No branches or pull requests

3 participants