You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying out some ways to break Patroni and restore from backup, I ended up in a state where an old PostgreSQL process from a previous Patroni instance was stlil running, so it could not start its own Postgres:
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,687 INFO: Lock owner: ld-pg-04; I am ld-pg-06
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,792 INFO: starting as a secondary
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,794 INFO: Process 16496 is not postmaster, too much difference between PID file start time 1690476426.67 and process start time 1690476423
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,794 INFO: closed patroni connection to the postgresql cluster
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,795 INFO: Process 16496 is not postmaster, too much difference between PID file start time 1690476426.67 and process start time 1690476423
Jul 31 18:35:04 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:04,097 INFO: Process 16496 is not postmaster, too much difference between PID file start time 1690476426.67 and process start time 1690476423
Jul 31 18:35:04 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:04,098 INFO: Telling pg_ctl that it is safe to ignore postmaster.pid for process 16496
Jul 31 18:35:04 ld-pg-06 patroni@15-test[26419]: 2023-07-31 18:35:04.247 CEST [26419] FATAL: bereits bestehender Shared-Memory-Block (Schlüssel 21902726, ID 11) wird noch benutzt
Jul 31 18:35:04 ld-pg-06 patroni@15-test[26419]: 2023-07-31 18:35:04.247 CEST [26419] TIPP: Beenden Sie alle alten Serverprozesse, die zum Datenverzeichnis »/var/lib/postgresql/15/test« gehören.
The patronictl list output oscillates between crashed and streaming, the latter due to Patroni being able to login to the rogue Postgres instance locally via the unix socket and querying pg_stat_get_wal_receiver().
This is probably a rare corner case, but I wanted to mention it in case there can be done something about this, e.g. due to a global state that knows Postgres was not started successfully so streaming cannot be true?
Note that the second standby did not recover from the below torture, so it might be timing-related:
Jul 31 18:21:02 ld-pg-05 patroni@15-test[22620]: 2023-07-31 18:21:02,605 ERROR: No initialize key in DCS and PostgreSQL is running as replica, aborting start
Jul 31 18:21:02 ld-pg-05 patroni@15-test[22620]: 2023-07-31 18:21:02,605 ERROR: Please first start Patroni on the node running as primary
How can we reproduce it (as minimally and precisely as possible)?
Put patron into maintenance_mode
Drop all Patroni cluster from DCS via patronictl remove
Stop Patroni on all nodes
Remove dynamic JSON Patroni cache from the data directory of the primary
Start Patroni on all nodes
What did you expect to happen?
Not sure, but I would expect that the new streaming state is only shown if we are positively sure that our own Postgres instance is, in fact, streaming.
Patroni/PostgreSQL/DCS version
Patroni version: 3.0.4
PostgreSQL version: 15
DCS (and its version): etcd 3.3.25
Patroni configuration file
.
patronictl show-config
loop_wait: 10maximum_lag_on_failover: 1048576postgresql:
create_replica_methods:
- pgbackrest
- basebackupparameters:
max_connections: 100max_wal_senders: 10pg_hba:
- local all all peer
- host all all 127.0.0.1/32 scram-sha-256
- host all all ::1/128 scram-sha-256
- host all all 10.0.3.216/24 scram-sha-256
- local replication all peer
- host replication all 127.0.0.1/32 scram-sha-256
- host replication all ::1/128 scram-sha-256
- host replication all 10.0.3.216/24 scram-sha-256
- host all postgres_rewind 10.0.3.216/24 scram-sha-256pgbackrest:
command: /usr/bin/pgbackrest --stanza=15-test --log-level-file=detail --delta restorekeep_data: trueno_params: truerecovery_conf:
restore_command: pgbackrest --stanza=15-test archive-get %f %premove_data_directory_on_diverged_timelines: trueremove_data_directory_on_rewind_failure: trueuse_pg_rewind: trueretry_timeout: 10ttl: 30
Patroni log files
see above
PostgreSQL log files
see above
Have you tried to use GitHub issue search?
Yes
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered:
Jul 31 18:35:03 ld-pg-06 patroni@15-test[24131]: 2023-07-31 18:35:03,794 INFO: Process 16496 is not postmaster, too much difference between PID file start time 1690476426.67 and process start time 1690476423
I think this behavior was already discussed a few times (for example #811).
Patroni needs to know whether the Postgres is running for a given PGDATA.
There are two different cases:
When Patroni starts Postgres on its own, it doesn't use pg_ctl, but starts directly postgres binary, so it knows the postmaster PID and later just checks whether exactly this process is still alive.
If Postgres is already running things get more complex:
It needs to check whether $PGDATA/postmaster.pid exists
from the file it gets PID of the postmaster process
checks whether there is a process in a system with this PID
and most important - it needs to identify whether this is really a Postgres running
To do the last thing we compare start time from postmaster.pid (it is written there by postgres) and the time of process creation.
If the difference is bigger than 3 seconds we make an assumption that it could be not Postgres, but something else (very possible in containerized environments) and as a result makes an attempt to start start PG, which obviously fails.
But, right after starting Postgres Patroni is already trying to open a connection and run monitoring queries, which obviously succeeds, therefore you sometimes get streaming.
Maybe you have a good idea how we can better identify running PG (the solution should be cross-platform)?
@mbanck it seems that in your environment it is fairly easy to reproduce the problem
Could you please compare postmaster start time from the postmaster.pid file (it is on the third line) and the value returned by the pg_catalog.pg_postmaster_start_time() function?
What happened?
While trying out some ways to break Patroni and restore from backup, I ended up in a state where an old PostgreSQL process from a previous Patroni instance was stlil running, so it could not start its own Postgres:
The
patronictl list
output oscillates betweencrashed
andstreaming
, the latter due to Patroni being able to login to the rogue Postgres instance locally via the unix socket and queryingpg_stat_get_wal_receiver()
.This is probably a rare corner case, but I wanted to mention it in case there can be done something about this, e.g. due to a global state that knows Postgres was not started successfully so
streaming
cannot be true?Note that the second standby did not recover from the below torture, so it might be timing-related:
How can we reproduce it (as minimally and precisely as possible)?
patronictl remove
What did you expect to happen?
Not sure, but I would expect that the new
streaming
state is only shown if we are positively sure that our own Postgres instance is, in fact, streaming.Patroni/PostgreSQL/DCS version
Patroni configuration file
.
patronictl show-config
Patroni log files
PostgreSQL log files
Have you tried to use GitHub issue search?
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: