Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added support for file undeleted when tails input is paused #2254

Open
wants to merge 1 commit into
base: win32-next
Choose a base branch
from

Conversation

sachinmsft
Copy link

When tail input paused for any reason we destroy the timer that fires tail_fs_check() and as a reason we don't check if any file is deleted or not.
and when docker tries to delete the pod it try to delete the log file associated with pod but since fluent-bit has one handle opened for that log file and we are not firing (since tail is paused and we have destroyed the tail_fs_check() timer) tail_fs_check() to close the log file FD if docker is trying to delete the pod log file and pod stuck in terminating state.

@sachinmsft
Copy link
Author

@fujimotos Can you please look into this PR.

@fujimotos
Copy link
Member

@sachinmsft I was thinking on this patch last night but I'm not
convinced by this modification.

The basic problem is that this patch attempts to tune in_tail
to a very special situation where:

  1. Fluent Bit is running on Kuberntes, and
  2. its output plugin is not working at all.

In other cases, this modification makes little sense. Especially,
if your output plugin is working, you do not want this behaviour
at all, because:

  • If Fluent Bit lets a deleted file go while in_tail is pausing,
    it results in a unrecoverable data loss.

  • In this situation, "locking the file until the output plugin
    catches up" is totally legitimate behaviour, since it is better
    to defer the termination than losing a chunk of data in an
    unrecoverable manner.

The problem occurs (as you describe) when output plugin is broken,
and we expect it to never catch up. But in that case, we really
should resolve the problem by fixing that broken output plugin,
not by making in_tail lossy.

@sachinmsft
Copy link
Author

Hi @fujimotos , Thanks for taking a look on it.
Yes, I agree with you on this. Though in my opinion we should still look to fix it some more appropriate manner.
Problem is that target service can go down in cluster with any number of reason and as a result fluent-bit will pause the input plugin. We should not indefinitely keep the handle to log files as it breaks the other scenarios like we would not be able to delete the pods whose log files handle is kept by fluent-bit in input tail paused scenario.

we have seen this happening in our testing multiple number of times when elastic search instance was unreachable/down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants