-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behavior when dqlite's raft node enters the RAFT_UNAVAILABLE
state.
#213
Comments
This will happen more frequently now due to canonical/dqlite#434 which unmasked a class of unrecoverable errors that were ignored in the past (failed applications of raft log entries). |
What do you think @freeekanayaka? |
First of all, do we have an idea of what errors are being triggered by the FSM? I understand that with the disk-mode feature the error surface area is larger (I/O-related errors, as out-of-space), but for the in-memory one I don't quite see reasons for failure (perhaps except the That being said, I'm not entirely sure we should handle this transparently (i.e. perform automatic restarts, either in the In the in-memory mode case, from dqlite's and raft's perspective this should pretty much be an unrecoverable error: the FSM is supposed to be deterministic so restarting the node should trigger again the exact same problem. For the disk mode case, there might be transient I/O errors like disk full, and for those a retry might help. In both cases I'd say that the error is kind of a show stopper that requires some human to look at the situation (for example you might need to upgrade the dqlite version if there is an unknown command, or free disk space if it's full). So I'm thinking that perhaps the best course of action would be to propagate the error up the stack to the |
It's indeed related to the disk-mode when the disk is full, then sqlite can return an error when opening a database connection. I think what you propose makes more sense than retrying, thank you. |
Could I request that we dont panic in the library as that will prevent the app from taking any remedial/notification action. |
As mentioned, perhaps making all wire protocol requests fail is enough then. If error propagation works all the way through, you should see the error with a reasonable explanation (e.g. "out of disk", or "out-of-date dqlite engine") in, say, LXD logs or command line output. |
When an unrecoverable error occurs, a raft node can enter the
RAFT_UNAVAILABLE
state and will never leave it, unless eventually the process running the raft node is restarted. From my understanding theapp
package does not yet handle this case. Ideally we should detect this unrecoverable state, and restart it.The text was updated successfully, but these errors were encountered: