Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Urgent: Oplog Resolver Intermittently doesn't Complete Resolving #277

Open
corey-hammerton opened this issue Sep 17, 2018 · 4 comments
Open
Labels

Comments

@corey-hammerton
Copy link
Contributor

corey-hammerton commented Sep 17, 2018

On our backup server we experience cases where backups fail to complete successfully because the Oplog Resolver from a previous backup didn't complete. This happens on many of our backup configurations using default oplog configuration settings.

The ResolverThreads to complete successfully, as displayed by the logs below with hostnames redacted. The resolver, however, doesn't free the threads and continues with the backup. Manually killing the defunct processes releases the locks and the MainProcess thread logs that Oplog resolving completed.
Killing the processes on the server with high Virtual Memory Size achieves this.

[2018-09-15 00:23:33,726] [INFO] [PoolWorker-11] [ResolverThread:run:36] Resolving oplog for XXXXXXXXXXXXXXXXXXXXX:10000 to max ts: Timestamp(1536970998, 0)
[2018-09-15 00:23:33,796] [INFO] [PoolWorker-10] [ResolverThread:run:36] Resolving oplog for XXXXXXXXXXXXXXXXXXXXX:10001 to max ts: Timestamp(1536970998, 0)
[2018-09-15 00:23:33,828] [INFO] [PoolWorker-11] [ResolverThread:run:60] Applied 56 oplog changes to XXXXXXXXXXXXXXXXXXXXX:10000 oplog, end ts: Timestamp(1536970999, 1)
[2018-09-15 00:23:34,029] [INFO] [PoolWorker-10] [ResolverThread:run:60] Applied 205 oplog changes to XXXXXXXXXXXXXXXXXXXXX:10001 oplog, end ts: Timestamp(1536970997, 50)
[2018-09-17 02:17:23,739] [INFO] [MainProcess] [Resolver:run:142] Oplog resolving completed in 0.00 seconds

In our testing and debugging we have narrowed it down to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/master/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L105. Please Advise.

Generic server information:

OS: CentOS 7.5.1804
YUM Package Version: 1.3.0
Python Version: 2.7.5
CPU Count: 8
RAM Size: 15G
@corey-hammerton corey-hammerton changed the title Oplog Resolver Intermittently doesn't Complete Resolving Urgent: Oplog Resolver Intermittently doesn't Complete Resolving Sep 24, 2018
@corey-hammerton
Copy link
Contributor Author

@timvaillancourt ^^

@corey-hammerton
Copy link
Contributor Author

Using the pystack debugger I managed to use gdb to get a thread dump of some of our hanging processes. I've attached the dump of one of our hung threads into a file in this comment.
mongodb_consistent_backup_threaddump.txt

The short version is the thread dump points to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/1.3.0/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L45

@corey-hammerton
Copy link
Contributor Author

Got a dump of the parent PID of the defunct processes. the results are attached below.
mongodb_consistent_backup_threaddump.txt

This one points to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/master/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L105

@timvaillancourt
Copy link
Contributor

timvaillancourt commented Dec 7, 2018

Hi @corey-hammerton I suspect this is related to the same stalls seen in #165. I've also traced it down to thread dumps, strace, etc and cannot explain why 'multiprocessing' and/or threading is stalling.

There's a few issues on GitHub for multiprocessing that sound related to this problem but I haven't seen a clear solutions yet (that don't mean large changes like Python 3+, etc). I'm hoping someone with more familiarity with Python internals can help take this investigation further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants