Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Cleanup Part 2 - Rewrite #1000

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

evrardjp
Copy link
Collaborator

@evrardjp evrardjp commented Oct 18, 2024

This is the follow up PR to #990 .
It tackles deeper changes to help on the long term rewrite.

What it does:

  • The main goroutine, cleaning up the code for easier maintenance
  • Remove the usage of logrus, as we can simply use slog
  • Makes the system more unit testable.

It opens the door to the following:

  • Use different locks in the future
  • Move the state to annotations

@evrardjp
Copy link
Collaborator Author

evrardjp commented Oct 27, 2024

This will need a rebase, and reimplementation. But the idea is there. I will update this next week if I have time

@evrardjp
Copy link
Collaborator Author

evrardjp commented Nov 1, 2024

I need to reimplement this from scratch, please ignore my changes from now, until I mention this is ready for review (and [WIP] flag will be removed)

@evrardjp evrardjp force-pushed the second_rewrites branch 6 times, most recently from c73888b to a0d6cec Compare November 7, 2024 18:21
This will help make sure the big refactoring does not break
the main features.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Extends test coverage to ensure nothing breaks

Signed-off-by: Jean-Philippe Evrard <[email protected]>
For tests not running in different kubernetes versions,
but have different tests subcases/variants, rephrase the wording
"versions" as it is confusing.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
This will replace trim, taking a cutset, with Replace.

This clarifies the intent to remove a substring.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
According to staticcheck, Error strings should not be capitalized (ST1005).

This changes the cases for our errors.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
A few strings have evolved to eventually remove all the templating
part of their strings, yet kept the formatting features.

This is incorrect, and will not pass staticcheck SA1006 and S1039.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Without this, people like myself will forget to run staticcheck.

This fixes it by making it part of make tests, which will run
with all the fast tests in CI.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Without this patch, one metric could say "reboot is required"
while the rebootAsRequired tick did not run (long period for
example).

This is a problem, as it leads to misexpectations: "Why
did the system not reboot, while the metrics indicate a reboot
was required".

This solves it by inlining the metrics management within the
rebootAsRequired goroutine.

Closes: kubereboot#725

Signed-off-by: Jean-Philippe Evrard <[email protected]>
No reason to have all the internal tools (timewindows, taints,
locks) as public interfaces.

Should someone be crazy to rebuild its own kured, then they would
need to load our high level tools: the checkers/blockers/rebooters.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
This patches simplifies the principal loop: rebootAsRequired

- Previous patch "Fix discrepency metrics <> rebootAsRequired"
  changed the behaviour of the metrics, ensuring the metrics
  and the checks for reboots are aligned. However, this meant
  that the metrics would not be correct for long periods of
  time if the period is high.

  This patches therefore clarifies the ticking purpose to
  avoid users to set large values, as we have seen in
  multiple questions in the past (ppl were missing that part
  of the documentation, and could find the name "period" a
  misnomer).

- With the ticking clarified, it was the opportunity to remove
  the randomization bit. It was very unlikely, even on large
  clusters, that calls would trigger a storm on the API server.
  Similarly, the delay was not bringing any value.
  By removing those two items, the code can be simplified futher
  by removing the delaytick and its random seed.

- However, by ticking more frequently, we now have more calls
  on the API server, which might have an impact on large
  clusters.

  In order to fix that, I reschuffled the loop to
  quickly know if a reboot is required at each tick,
  which helps a bit reducing some of the calls.

  This can be further improved by sharing node/ds info
  through the calls in the tick in order to make
  the process more efficient.

- A new struct GenericRebooter was introduced to keep data
  all the rebooters need. For example, here, it keeps the
  rebootDelay imposed to the reboot function. Each rebooter
  can implement its own reboot method, but to simplify there
  the generic rebooter implements a DelayReboot() method,
  which is then called from all the Reboot methods of the
  concrete types.

- On the way, this commit also changed the RebootBlocked function,
  to also return the list of the names of (types) blocking the
  reboot.

  This is useful to debug, but also can be useful to later
  expose the content of the blocker into specific metrics.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Without this, there is no way to tell, by looking at the metrics
how much reboot attempts were done, or why they were blocked.

This is a problem, as only logs are currently usable to
expose the blockers.

This fixes it by introducing a new vector holding as labels
the node names and the reason of the block.

For this, all the blockers have to implement another method
in the RebootBlocker interface, namely MetricLabel(), to
generate a string containing the block reason.

To avoid high cardinality, this is basically a static name
based on the struct.

I did not choose to use a Stringer interface, as I feel
the Stringer interface would make more sense for the blocker
data itself (including its specific details, for example
pod selector string, etc.).

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Wthout this, it will not be possible to use unprivileged
commands.

This is a problem, as one might want to run a command
without the nsenter to pid 1.

This fixes it by exposing this to main, the only thing
remaining is to use a boolean to expose the feature in
a later commit.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
delaytick is not used anymore.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Without this, we are using logrus and the logs are inconsistent.

This fixes it by moving to slog, and reviewing all the logs.

I added multiple comments to clarify the intent behind the logs.

With structured logging, extra data, such as the node, is exposed
whenever possible, which makes the logging more consistent.

We are not yet using contextes for ticks, it should be done in
a later commit.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
signal is named SIGRTMIN not SIGTRMIN. It can be confusing
for people.

As this could also lead to questions, the code clarifies a
TODO item: We are still relying on hardcoded ints, instead
of evaluating the signal number at run time, as recommended
by man (7) signal. Please read man (7) signal for further
details.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
The current code repeats itself and relies on global vars.
This is a problem, as changing the notifications will move code
all over, and actually prevents us to use a simple Send() method
for all our notifications. This opens the door to a simple sender
Sending events to multiple places (webhooks, logs, and a
notification service)

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Without this, we will rely on old urls forever. It has been
multiple cycles we removed it, we are now in a good place to
remove the vars.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Now that kured is a community project under the CNCF, it might
be appropriate to use the official kured name, instead of the
old weave.works name. As this is intrusive, there was no
occasion to do it before.

Because other commits in this branch are intrusive (change
of main loop, removing old flags, ...) this is the occasion
to clean up the cruft.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Without this, the flags are becoming a mess.
This cleans up the order. Not very important.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Without this, we use the daemonset lock for annotating the state
of the node. This is unnecessary, as we have rights on the node.

This is a problem, as it makes the whole lock model more complex,
and prevents other implementations for lock.

This fixes it by adapting the drain/uncordon functions to rely
on the node annotations, by introducing a new annotation, named
"kured.dev/node-unschedulable-before-drain"

Signed-off-by: Jean-Philippe Evrard <[email protected]>
@evrardjp evrardjp force-pushed the second_rewrites branch 2 times, most recently from a670d69 to b6d3ba5 Compare November 11, 2024 19:58
This reduces the global variables, and regroups all the
operations in a single place. This will allow further
refactor to represent all the k8s operations kured needs
on a single node.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Without this patch, while developping (on uncommitted work),
the image with previous SHA gets overriden, which leads to
mishaps.

This fixes it by ensuring only the kured:dev tags (full path
and short one) are used everywhere.

At the same time, it cleans up the main_test to be more
flexible by passing more of the main features as options.

Signed-off-by: Jean-Philippe Evrard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant