-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quorum commit feature #672
base: master
Are you sure you want to change the base?
Conversation
Generalizes synchronous_mode to multiple nodes being in sync. Trade-off between synchronization overhead and fault tolerance is user selectable via `replication_factor` configuration parameter, specifying how many nodes must contain a commit before it is acknowledged. Cluster will retain self-healing capability if within one HA loop interval up to `replication_factor - 1` nodes fail. The old `synchronous_commit` corresponds to `replication_factor = 2`. On PostgreSQL 10 the quorum commit `ANY k (n)` feature is used. On 9.6 the `k (n)` variant is used picking first nodes. On even older versions maximum `replication_factor` is 2. However, this is still an improvement from status quo as it delegates sync standby selection to PostgreSQL reducing commit latency on sync standby failure. DCS sync state is changed from `{"leader": .., "sync_standby": ..}` to `{"leader": .., "quorum": .., "members": [..]}`, where leader is kept mainly as an informative field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent a few hours trying to understand how QuorumStateResolver works. Not much success :(
It definitely deserves a few sentences explaining how it supposed to work.
patroni/postgresql.py
Outdated
if name != self._synchronous_standby_names: | ||
if name is None: | ||
if state == 'sync' or (state == 'potential' and current is None): | ||
current = member.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch seems to be unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. That should be sync_state and it looks like I forgot to finish the code to make sure that if we can only do one sync standby (version <= 9.5) we don't switch the standby around. Will fix.
patroni/postgresql.py
Outdated
else: | ||
assert self.name in sync | ||
sync_standbys = sync.difference([self.name]) | ||
standby_list = ", ".join(sorted(sync_standbys)) if sync_standbys else "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lost quote_ident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I will add it back in.
patroni/postgresql.py
Outdated
current = cluster.sync.sync_standby | ||
current = current.lower() if current else current | ||
sync_standby_names = self.query("SHOW synchronous_standby_names") | ||
result = parse_sync_standby_names(sync_standby_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to parse it all the time? Is it some kind of protection from unexpected changes from outside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically yes. It helps us automatically resolve the situation if due to a bug or administrator intervention our understanding of the state goes out of sync. If we aren't too worried about that we could cache the state and avoid spamming the database with unnecessary queries. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it could be changed with 'ALTER SYSTEM' which takes precedence over postgresql.conf :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could detect that from pg_settings. Now the question is what should the system do in that situation? Override the administrator with ALTER SYSTEM RESET synchronous_standby_names
, make a quorum state to match the synchronization state, just complain in the logs where nobody is looking anyway, panic with sys.exit(1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like executing ALTER SYSTEM RESET synchronous_standby_names
is the best we can do.
In addition to that we can write warning into the log with previous value.
# First get to requested replication factor | ||
increase_numsync_by = self.sync_wanted - self.numsync | ||
if increase_numsync_by: | ||
add = set(sorted(to_add)[:increase_numsync_by]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I don't really understand.
It is responsible for https://github.com/zalando/patroni/pull/672/files#diff-57e7f41a6358e844d5245ea7b8409311R12
+ self.assertEquals(list(QuorumStateResolver(1, set("a"), 1, set("a"), active=set("abcde"), sync_wanted=3)), [
+ ('sync', 3, set('abc')),
+ ('quorum', 3, set('abcde')),
+ ('sync', 3, set('abcde')),
+ ])
Why does it fist adds only 'b' and 'c'? Why not all 4 at the same time? Something like:
+ ('sync', 3, set('abcde')),
+ ('quorum', 3, set('abcde')),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be a correct optimization. Currently the algorithm doesn't take into account that the master is special and always guaranteed to be one of the synced nodes. This may lead to extra steps when adding multiple nodes at the same time. On the plus side this allows for promotion before we update sync state in DCS. I need to think a bit about how the algorithm would look like if we consider that master to be special. If it makes it considerably more complicated than it already is, then it might not be worth the minor improvement in efficiency. I will also write down how and why the QuorumStateResolver works.
return | ||
logger.info("Synchronous standby status assigned to %s", picked) | ||
elif transition == 'sync': | ||
logger.info("Setting synchronous replication to %d of %d (%s)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is misleading when num < min_sync
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will correct.
sync_wanted=sync_wanted): | ||
if transition == 'quorum': | ||
logger.info("Setting quorum to %d of %d (%s)", num, len(nodes), ", ".join(sorted(nodes))) | ||
if not self.dcs.write_sync_state(leader=self.state_handler.name, quorum=num, members=list(nodes), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_sync_state
- this method should be also changed, no ?
here you are passing leader
, quorum
, membres
and index
as a parameters, but signature of method wasn't changed in this pull request.
def write_sync_state(self, leader, sync_standby, index=None):
@ants If there is any chance You are getting back to work on this feature ? |
patroni/postgresql.py
Outdated
""" | ||
current = cluster.sync.sync_standby | ||
current = current.lower() if current else current | ||
sync_standby_names = self.query("SHOW synchronous_standby_names") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I can see here https://github.com/zalando/patroni/blob/master/patroni/postgresql.py#L1509
self.query
- will return cursor
, not the result of query execution. And after that You are passing cursor to
sync_rep_parser_re.finditer
its not acceptable.
Thanks for reviewing. I plan to resume work on this in about 1-2 weeks time
and any review effort on the code will be helpful, as will be testing once
the code is in a testable state.
… |
I have done some changes and this feature is working as a prototype, but without usage of new variables Also its pretty hard to understand the purpose of the Is my solution acceptable ? Or should I just fix everything in current PR ?
|
Getting closer to something I am happy with. Work still to be done:
I should be able to pick this up sometime next week, publishing what I have so far to solicit feedback. |
Guys, any updates on this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found a small corner case, when I started the new single node cluster with {synchronous_mode: true, replication_factor: 2, minimum_replication_factor: 2}
in bootstrap.dcs
it didn't set synchronous_standby_names
.
@@ -309,15 +309,15 @@ def is_failover_possible(self, cluster, leader, candidate, action): | |||
if leader and (not cluster.leader or cluster.leader.name != leader): | |||
return 'leader name does not match' | |||
if candidate: | |||
if action == 'switchover' and cluster.is_synchronous_mode() and cluster.sync.sync_standby != candidate: | |||
return 'candidate name does not match with sync_standby' | |||
if action == 'switchover' and cluster.is_synchronous_mode() and cluster.sync.matches(candidate): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and not cluster.sync.matches(candidate):
@@ -47,6 +47,8 @@ class Config(object): | |||
'master_start_timeout': 300, | |||
'synchronous_mode': False, | |||
'synchronous_mode_strict': False, | |||
'replication_factor': 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the status of synchronous_mode
and synchronous_mode_strict
. From one side it seems they are redundant, but from another side, they are still used in the ha.py and postgresql.py
patroni/postgresql.py
Outdated
@property | ||
def use_multiple_sync(self): | ||
return self._major_version >= 90600 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 304 references synchronous_mode_strict
, which is kind of becomes deprecated.
# | ||
# Regardless of voting, if we observe a node that can become a leader and is ahead, we defer to that node. | ||
# This can lead to failure to act on quorum if there is asymmetric connectivity. | ||
quorum_votes = 1 if self.state_handler.name in voting_set else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be also good to add some info output here about the current quorum and voting set. It will make some potential investigations easier.
Right now such info output is only available on the master when it processes topology changes, but it might happen that logs from the master are unrecoverable.
|
||
def quorum_update(self, quorum, voters): | ||
if quorum < 1: | ||
raise QuorumError("Quorum %d < 0 of (%s)" % (quorum, voters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception message doesn't match the condition: quorum < 1
and %d < 0
|
||
def check_invariants(self): | ||
if self.quorum and not (len(self.voters | self.sync) < self.quorum + self.numsync): | ||
raise QuorumError("Quorum and sync not guaranteed to overlap: nodes %d >= quorum %d + sync %d" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I removed sync key from the DCS the master crashed:
$ etcdctl rm /service/batman/sync
PrevNode.Value: {"leader":"postgresql2","members":["postgresql1","postgresql0","postgresql2"],"quorum":2}
2019-02-10 09:51:33,719 INFO: Lock owner: postgresql2; I am postgresql2
2019-02-10 09:51:33.720 CET [10143] LOG: received fast shutdown request
2019-02-10 09:51:33.725 CET [10143] LOG: aborting any active transactions
2019-02-10 09:51:33.725 CET [10178] FATAL: terminating connection due to administrator command
2019-02-10 09:51:33.730 CET [10143] LOG: background worker "logical replication launcher" (PID 10193) exited with exit code 1
2019-02-10 09:51:33.731 CET [10146] LOG: shutting down
2019-02-10 09:51:33.794 CET [10143] LOG: database system is shut down
2019-02-10 09:51:33,813 INFO: Lock owner: postgresql2; I am postgresql2
Traceback (most recent call last):
File "./patroni.py", line 6, in <module>
main()
File "/home/akukushkin/git/ants/patroni/patroni/__init__.py", line 182, in main
return patroni_main()
File "/home/akukushkin/git/ants/patroni/patroni/__init__.py", line 148, in patroni_main
patroni.run()
File "/home/akukushkin/git/ants/patroni/patroni/__init__.py", line 119, in run
logger.info(self.ha.run_cycle())
File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 1370, in run_cycle
info = self._run_cycle()
File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 1343, in _run_cycle
msg = self.process_healthy_cluster()
File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 966, in process_healthy_cluster
'promoted self to leader because i had the session lock'
File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 566, in enforce_master_role
self.process_sync_replication()
File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 426, in process_sync_replication
sync_wanted=sync_wanted):
File "/home/akukushkin/git/ants/patroni/patroni/quorum.py", line 84, in __iter__
transitions = list(self._generate_transitions())
File "/home/akukushkin/git/ants/patroni/patroni/quorum.py", line 95, in _generate_transitions
self.check_invariants()
File "/home/akukushkin/git/ants/patroni/patroni/quorum.py", line 62, in check_invariants
(len(self.voters | self.sync), self.quorum, self.numsync))
patroni.quorum.QuorumError: Quorum and sync not guaranteed to overlap: nodes 3 >= quorum 1 + sync 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a lot of corner-cases are not covered in the QuorumStateResolver
:(
# First get to requested replication factor | ||
logger.debug("Adding nodes: %s", to_add) | ||
increase_numsync_by = self.sync_wanted - self.numsync | ||
if increase_numsync_by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
increase_numsync_by
could become negative.
example:
QuorumStateResolver(quorum=2, voters=set('abcdef'),
numsync=5, sync=set('abcdef'),
active=set('abcdefg'), sync_wanted=4)
|
||
safety_margin = self.quorum + self.numsync - len(self.voters | self.sync) | ||
if safety_margin > 1: | ||
logger.debug("Case 3: replication factor is bigger than needed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is not covered by tests, so I tried to do it:
QuorumStateResolver(quorum=3, voters=set('abcdef'),
numsync=5, sync=set('abcdef'),
active=set('abcdef'), sync_wanted=3)
and got an exception QuorumError: Quorum and sync not guaranteed to overlap: nodes 6 >= quorum 3 + sync 3
from sync_update()
This PR suddenly is more interesting. Can this Pullrequest be made easier for non support of PG9.6? I expected this to be less complicated given Postgres supports Quorum mode now. |
… sync > sync_wanted
We can then handle any node that is ahead but unable to promote. Downside is that we might potentially need to rewind more and/or lose a couple more unsychronized transactions.
if SyncState is empty. | ||
:param quorum: number of servers from synchronous set we need to see to know we see the latest | ||
commit. | ||
:param members: set of member names that participate in determining quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SyncState
uses the word "members" while everywhere else in the algorithm the term voters
is used. Can you please name it consistently ?
from patroni.quorum import QuorumStateResolver, QuorumError | ||
|
||
class QuorumTest(unittest.TestCase): | ||
def test_1111(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to explicitly spell out how test names map to the content of the state
variable; it is not obvious at first glance
logger.info("Removing synchronous privilege from %s", current) | ||
if not self.dcs.write_sync_state(self.state_handler.name, None, index=self.cluster.sync.index): | ||
sync_state = self.state_handler.current_sync_state(self.cluster) | ||
numsync = min(sync_state['numsync'], len(sync_state['sync'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this line be moved to self.state_handler.current_sync_state(self.cluster)
?
# Active set matches state | ||
self.assertEqual(list(QuorumStateResolver(*state, active=set("ab"), sync_wanted=3)), [ | ||
]) | ||
# Add node by increasing quorum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this description a bit too short , better would be to have sth like
add a sync standby without increasing the replication factor; must increase quorum to avoid losing commits on failover
('quorum', 2, set('abc')), | ||
('sync', 2, set('abc')), | ||
]) | ||
# Add node by increasing sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here:
add a sync standby and increase the replication factor in the corner case where before and after adding a sync standby all nodes of a PG cluster must ack the commit
('quorum', 3, set('abcde')), | ||
('sync', 3, set('abcde')), | ||
]) | ||
# Master is alone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that removing a sync standby from a 2-node cluster ?
logger.debug("Case 1: synchronous_standby_names subset of DCS state") | ||
# Case 1: quorum is superset of sync nodes. In the middle of changing quorum. | ||
# Evict from quorum dead nodes that are not being synced. | ||
remove_from_quorum = self.voters - (self.sync | self.active) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it happen at this point that self.sync
still contains some node that ceased to be active and therefore must be evicted from voters
? In other words, why do we subtract here the union and not self.sync & self.active
or simply self.active
?
candidates = [] | ||
# Pick candidates based on who has flushed WAL farthest. | ||
# TODO: for synchronous_commit = remote_write we actually want to order on write_location | ||
|
||
for app_name, state, sync_state in self.query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_state
is no longer used in the proposed change
if candidates: | ||
return candidates[0], False | ||
return None, False | ||
active.append(member.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double check: a standby with pg_stat_replication.sync_state in ('async', 'potential'):
is still considered active
for the purposes of QuorumStateResolver
?
if remove_from_sync: | ||
yield self.sync_update( | ||
numsync=min(self.sync_wanted, len(self.sync) - len(remove_from_sync)), | ||
sync=self.sync - remove_from_sync) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not
if remove_from_sync:
remaining_sync = self.sync - remove_from_sync
yield self.sync_update(
numsync=min(self.sync_wanted, len(remaining_sync)),
sync=remaining_sync)
# Case 3: quorum or replication factor is bigger than needed. In the middle of changing replication factor. | ||
if self.numsync > self.sync_wanted: | ||
# Reduce replication factor | ||
new_sync = clamp(self.sync_wanted, min=len(self.voters) - self.quorum + 1, max=len(self.sync)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync
denotes a set everywhere else, so here just for the sake of consistency, new_sync
should be new_numsync
- A node that is not the leader or current synchronous standby is not allowed to promote itself automatically. | ||
|
||
Patroni will only ever assign one standby to ``synchronous_standby_names`` because with multiple candidates it is not possible to know which node was acting as synchronous during the failure. | ||
When in synchronous mode Patroni maintains synchronization state in the DCS, containing the latest primary, number of nodes required for quorum and nodes currently eligible to vote on quorum. In steady state the nodes voting on quorum are the leader and all synchronous standbys. This state is updated with strict ordering constraints with regards to node promotion and ``synchronous_standby_names`` to ensure that at all times any subset of voters that can achieve quorum is contained to have at least one node having the latest successful commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In steady state the nodes voting on quorum are the leader and all synchronous standbys.
Judging from this line, steady state also means updates to quorum and/or replication factor have completed (as opposed to simply having self.voters == self.sync
).
We are looking into implementing Patroni in-house and this feature is one of the shortfall holding us right now. We are very much interested in this enhancement and wondering if we can be any help in getting this PR moving. We are open to collaborate and contribute. Thanks. |
We utilize mix of synchronous hot standbys (with synchronous_commit set to remote_apply With the Quorum based (ANY) implementation, the databases that receives the change My suggestion would be to extend current one synchronous standby implementation to This functionality can be further extended to make it generic using new parameter(s) for providing Please let me know your thoughts. Thanks |
When `synchronous_standby_names` GUC is changed PostgreSQL nearly immediately starts reporting corresponding walsenders as synchronous, while in fact they maybe didn't reach this state yet. To mitigate this problem we memorize current flush lsn on the primary right after change of `synchronous_standby_names` got visible and use it as an additional check for walsenders. The walsender will be counted as truly "sync" only when write/flush/replay_lsn on it reached memorized LSN and the `application_name` is known to be a part of `synchronous_standby_names`. The size of PR mostly related to refactoring and moving the code responsible for working with `synchronous_standby_names` and `pg_stat_replication` to the dedicated file. And `parse_sync_standby_names()` function was mostly copied from #672.
I did some work on the subject, did not have enough time to get it finished, but far enough to solicit some feedback.
Description
The feature generalizes synchronous_mode to multiple nodes being in sync. Trade-off between synchronization overhead and fault tolerance is user selectable via
replication_factor
configuration parameter, specifying how many nodes must contain a commit before it is acknowledged.Cluster will retain self-healing capability if within one HA loop interval up to
replication_factor - 1
nodes fail.The old
synchronous_commit
corresponds toreplication_factor = 2
.On PostgreSQL 10 the quorum commit
ANY k (n)
feature is used. On 9.6 thek (n)
variant is used picking first nodes. On even older versions maximumreplication_factor
is 2. However, this is still an improvement from status quo as it delegates sync standby selection to PostgreSQL reducing commit latency on sync standby failure.DCS sync state is changed from
{"leader": .., "sync_standby": ..}
to{"leader": .., "quorum": .., "members": [..]}
, where leader is kept mainly as an informative field.State
The main quorum management engine should be in a reasonable state although I haven't tested it yet.
Open issues:
replication_factor
a reasonable UI for this? Do we want to keepsynchronous_commit
as the main switch for this functionality.synchronous_standby_names
becomes active. Currently there is a small race condition there. I'm not even sure if the previous check of pg_stat_replication showing walsender as sync was enough as a cursory review of postgres code hints that a walsender might think it is sync while a backend hasn't gotten the message that sync replication was enabled. More study of the code is needed.synchronous_standby_names
in that quorum engine thinks of leader as just one additional synchronous replica. Currently the boundary transition between two worlds is in postgresql.py. Maybe some other place would work better.