Cascaded secondaries occasionally drop out from another secondary using Wal-E #399

stanhu · 2018-07-04T05:43:07Z

I'm trying to use Wal-E in this way:

                                     /--- Cascaded Secondary (replication slot postgres-03)  
Primary  ------ postgres-01 (Wal-E) ---
                                     \--- Cascaded Secondary (replication slot postgres-02)

There is one secondary configured to use Wal-E from an S3 bucket. Several other secondaries connect to this node via replication slots.

For a few days, this seems to work fine. However, once every week or so, I find that one of the secondaries stops replication with a message such as:

2018-07-03_03:25:54.20606 2018-07-03 03:25:54 GMT [25710]: [17278-1] LOG:  restartpoint complete: wrote 97213 buffers (0.7%); 0 transaction log file(s) added, 0 removed, 33 recycled; write=209.835 s, sync=0.098 s, total=210.013 s; sync files=1000, longest=0.007 s, average=0.000 s; distance=701541 kB, estimate=859799 kB
2018-07-03_03:25:54.20615 2018-07-03 03:25:54 GMT [25710]: [17279-1] LOG:  recovery restart point at 7086/5FDD5748
2018-07-03_03:25:54.20616 2018-07-03 03:25:54 GMT [25710]: [17280-1] DETAIL:  last completed transaction was at log time 2018-07-03 03:25:31.678877+00
2018-07-03_03:27:24.29645 2018-07-03 03:27:24 GMT [25710]: [17281-1] LOG:  restartpoint starting: time
2018-07-03_03:30:43.58933 2018-07-03 03:30:43 GMT [25077]: [2-1] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000500007086000000BC has already been removed

On this node, I see this segment has an earlier timestamp and a different MD5 sum:

root@postgres-03:/var/opt/postgresql/data/pg_xlog# ls -al 0000000500007086000000BC
-rw------- 1 postgres postgres 16777216 Jul  2 23:28 0000000500007086000000BC
root@postgres-03:/var/opt/postgresql/data/pg_xlog# md5sum 0000000500007086000000BC
276e4a1f6fa3b13b3fc0e85718c0d994  0000000500007086000000BC

On the secondary, I see:

root@postgres-01:/var/opt/postgresql/data/pg_xlog# ls -al 0000000500007086000000BC
-rw------- 1 postgres postgres 16777216 Jul  3 03:31 0000000500007086000000BC
root@postgres-01:/var/opt/postgresql/data/pg_xlog# md5sum 0000000500007086000000BC
296d68dbfd0f8d6ceb91de8639efab70  0000000500007086000000BC

I'm not sure why the timestamps are so far off, but based on the events above, I'm wondering if the following race condition occurred:

2018-07-03 03:30:43 GMT: postgres-03 attempts to get the WAL segment from postgres-01 but fails
2018-07-03 03:31:xx GMT: postgres-01 downloads the WAL segment into the filesystem

Does this mean Wal-E shouldn't be used this way, or is there tuning (short of adding a delay) that would help?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cascaded secondaries occasionally drop out from another secondary using Wal-E #399

Cascaded secondaries occasionally drop out from another secondary using Wal-E #399

stanhu commented Jul 4, 2018

Cascaded secondaries occasionally drop out from another secondary using Wal-E #399

Cascaded secondaries occasionally drop out from another secondary using Wal-E #399

Comments

stanhu commented Jul 4, 2018