Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cascaded secondaries occasionally drop out from another secondary using Wal-E #399

Open
stanhu opened this issue Jul 4, 2018 · 0 comments

Comments

@stanhu
Copy link

stanhu commented Jul 4, 2018

I'm trying to use Wal-E in this way:

                                     /--- Cascaded Secondary (replication slot postgres-03)  
Primary  ------ postgres-01 (Wal-E) ---
                                     \--- Cascaded Secondary (replication slot postgres-02)

There is one secondary configured to use Wal-E from an S3 bucket. Several other secondaries connect to this node via replication slots.

For a few days, this seems to work fine. However, once every week or so, I find that one of the secondaries stops replication with a message such as:

2018-07-03_03:25:54.20606 2018-07-03 03:25:54 GMT [25710]: [17278-1] LOG:  restartpoint complete: wrote 97213 buffers (0.7%); 0 transaction log file(s) added, 0 removed, 33 recycled; write=209.835 s, sync=0.098 s, total=210.013 s; sync files=1000, longest=0.007 s, average=0.000 s; distance=701541 kB, estimate=859799 kB
2018-07-03_03:25:54.20615 2018-07-03 03:25:54 GMT [25710]: [17279-1] LOG:  recovery restart point at 7086/5FDD5748
2018-07-03_03:25:54.20616 2018-07-03 03:25:54 GMT [25710]: [17280-1] DETAIL:  last completed transaction was at log time 2018-07-03 03:25:31.678877+00
2018-07-03_03:27:24.29645 2018-07-03 03:27:24 GMT [25710]: [17281-1] LOG:  restartpoint starting: time
2018-07-03_03:30:43.58933 2018-07-03 03:30:43 GMT [25077]: [2-1] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000500007086000000BC has already been removed

On this node, I see this segment has an earlier timestamp and a different MD5 sum:

root@postgres-03:/var/opt/postgresql/data/pg_xlog# ls -al 0000000500007086000000BC
-rw------- 1 postgres postgres 16777216 Jul  2 23:28 0000000500007086000000BC
root@postgres-03:/var/opt/postgresql/data/pg_xlog# md5sum 0000000500007086000000BC
276e4a1f6fa3b13b3fc0e85718c0d994  0000000500007086000000BC

On the secondary, I see:

root@postgres-01:/var/opt/postgresql/data/pg_xlog# ls -al 0000000500007086000000BC
-rw------- 1 postgres postgres 16777216 Jul  3 03:31 0000000500007086000000BC
root@postgres-01:/var/opt/postgresql/data/pg_xlog# md5sum 0000000500007086000000BC
296d68dbfd0f8d6ceb91de8639efab70  0000000500007086000000BC

I'm not sure why the timestamps are so far off, but based on the events above, I'm wondering if the following race condition occurred:

2018-07-03 03:30:43 GMT: postgres-03 attempts to get the WAL segment from postgres-01 but fails
2018-07-03 03:31:xx GMT: postgres-01 downloads the WAL segment into the filesystem

Does this mean Wal-E shouldn't be used this way, or is there tuning (short of adding a delay) that would help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant