Urgent: Oplog Resolver Intermittently doesn't Complete Resolving #277

corey-hammerton · 2018-09-17T15:27:26Z

On our backup server we experience cases where backups fail to complete successfully because the Oplog Resolver from a previous backup didn't complete. This happens on many of our backup configurations using default oplog configuration settings.

The ResolverThreads to complete successfully, as displayed by the logs below with hostnames redacted. The resolver, however, doesn't free the threads and continues with the backup. Manually killing the defunct processes releases the locks and the MainProcess thread logs that Oplog resolving completed.
Killing the processes on the server with high Virtual Memory Size achieves this.

[2018-09-15 00:23:33,726] [INFO] [PoolWorker-11] [ResolverThread:run:36] Resolving oplog for XXXXXXXXXXXXXXXXXXXXX:10000 to max ts: Timestamp(1536970998, 0)
[2018-09-15 00:23:33,796] [INFO] [PoolWorker-10] [ResolverThread:run:36] Resolving oplog for XXXXXXXXXXXXXXXXXXXXX:10001 to max ts: Timestamp(1536970998, 0)
[2018-09-15 00:23:33,828] [INFO] [PoolWorker-11] [ResolverThread:run:60] Applied 56 oplog changes to XXXXXXXXXXXXXXXXXXXXX:10000 oplog, end ts: Timestamp(1536970999, 1)
[2018-09-15 00:23:34,029] [INFO] [PoolWorker-10] [ResolverThread:run:60] Applied 205 oplog changes to XXXXXXXXXXXXXXXXXXXXX:10001 oplog, end ts: Timestamp(1536970997, 50)
[2018-09-17 02:17:23,739] [INFO] [MainProcess] [Resolver:run:142] Oplog resolving completed in 0.00 seconds

In our testing and debugging we have narrowed it down to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/master/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L105. Please Advise.

Generic server information:

OS: CentOS 7.5.1804
YUM Package Version: 1.3.0
Python Version: 2.7.5
CPU Count: 8
RAM Size: 15G

The text was updated successfully, but these errors were encountered:

corey-hammerton · 2018-09-24T13:36:08Z

@timvaillancourt ^^

corey-hammerton · 2018-09-26T13:59:12Z

Using the pystack debugger I managed to use gdb to get a thread dump of some of our hanging processes. I've attached the dump of one of our hung threads into a file in this comment.
mongodb_consistent_backup_threaddump.txt

The short version is the thread dump points to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/1.3.0/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L45

corey-hammerton · 2018-09-26T14:36:23Z

Got a dump of the parent PID of the defunct processes. the results are attached below.
mongodb_consistent_backup_threaddump.txt

This one points to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/master/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L105

timvaillancourt · 2018-12-07T13:36:08Z

Hi @corey-hammerton I suspect this is related to the same stalls seen in #165. I've also traced it down to thread dumps, strace, etc and cannot explain why 'multiprocessing' and/or threading is stalling.

There's a few issues on GitHub for multiprocessing that sound related to this problem but I haven't seen a clear solutions yet (that don't mean large changes like Python 3+, etc). I'm hoping someone with more familiarity with Python internals can help take this investigation further.

corey-hammerton changed the title ~~Oplog Resolver Intermittently doesn't Complete Resolving~~ Urgent: Oplog Resolver Intermittently doesn't Complete Resolving Sep 24, 2018

This was referenced Oct 2, 2018

New Bugfix Release #279

Closed

Oplog/Resolver/Resolver.py: De-duplicating code closing the pool #280

Merged

timvaillancourt added the bug label Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Urgent: Oplog Resolver Intermittently doesn't Complete Resolving #277

Urgent: Oplog Resolver Intermittently doesn't Complete Resolving #277

corey-hammerton commented Sep 17, 2018 •

edited

corey-hammerton commented Sep 24, 2018

corey-hammerton commented Sep 26, 2018

corey-hammerton commented Sep 26, 2018

timvaillancourt commented Dec 7, 2018 •

edited

Urgent: Oplog Resolver Intermittently doesn't Complete Resolving #277

Urgent: Oplog Resolver Intermittently doesn't Complete Resolving #277

Comments

corey-hammerton commented Sep 17, 2018 • edited

corey-hammerton commented Sep 24, 2018

corey-hammerton commented Sep 26, 2018

corey-hammerton commented Sep 26, 2018

timvaillancourt commented Dec 7, 2018 • edited

corey-hammerton commented Sep 17, 2018 •

edited

timvaillancourt commented Dec 7, 2018 •

edited