Bug 1418627
Summary: | Mongo cursor times out during task pulp.server.managers.content.orphan.delete_all_orphans | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Ben <ben.argyle> |
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> |
Status: | CLOSED ERRATA | QA Contact: | Peter Ondrejka <pondrejk> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6.2.7 | CC: | bbuckingham, ben.argyle, bkearney, bmbouter, cwelton, daviddavis, dkliban, ggainey, ipanova, jcallaha, lzap, mhrivnak, mmccune, pcreech, peter.vreman, pgozart, pmoravec, rchan, ttereshc |
Target Milestone: | Unspecified | Keywords: | Regression, Triaged |
Target Release: | Unused | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-02-21 16:54:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1533259 |
Description
Ben
2017-02-02 11:23:03 UTC
Looking at the code, it's in a pretty tight loop there. The only way I could think this might have happened is if mongod restarted while the orphan purge task was running. Or if someone accidentally deleted an index on that collection. Or if someone ran pulp-manage-db while the orphan purge task was running, which temporarily removes indexes. https://github.com/pulp/pulp/blob/2.8-release/server/pulp/server/managers/content/orphan.py#L106 Is this problem reproducible? The only way I've been able to reproduce it is to load Satellite up with a number of jobs at the same time. The best way is to publish three or more Content Views (say RHEL6 plus Optional plus Sat. Tools plus Supplementary, and at the same time RHEL7 plus the same) concurrently. On an 8 CPU, 24 GB VMWare guest running RHEL7.3 and doing nothing but Satellite, eventually the box will get so busy a cursor will time out. Thank you. This feedback is very helpful. I have one more question: are you seeing cursor timeouts mostly from this specific task? Or when the system gets heavily loaded, do you see them in other places also? Is there a short list of those places, or is it unpredictable? I'm trying to determine where we should focus our effort. I think the only time I've seen cursor timeouts has been when trying to publish multiple Content Views concurrently. It's _possible_ I might have seen one when restarting Satellite ("katello-service restart") while it was in the middle of doing a number of bulk errata pushes (more than 15 Content Hosts in a Host Collection). As all the individual services came back up I _might_ have seen one then. Or what I saw may have been the "UserWarning: MongoClient open before fork." messages that litter Satellite's log output when restarting. I'm sorry I can't be more specific. Oh, having had a look at all of the /var/log/messages files I have, it appears that there's been a cursor timeout at these timestamps: Thur Jan 19 04:23:57 Thur Jan 26 04:39:19 Thur Feb 2 04:28:43 ... but not this morning, oddly... I think it may be something to do with the aftermath/finishing up after the nightly sync'ing of all repos. The Pulp upstream bug status is at NEW. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. The Pulp upstream bug status is at POST. Updating the external tracker on this bug. The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug. All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST. The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug. The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug. Just a note that I'm seeing this on 6.2.9 as well when doing a manual invocation of "foreman-rake katello:delete_orphaned_content RAILS_ENV=production" with no other appreciable load. I have the same issue, the katello weekly cron that calls pulp-admin orphan remove --all is not running anymore for several months. As the time needed for the orphan removal increases proportional to the number of orphans, then the next successfull run will take many hours. On an other system with a fixed pulp i have seen the task could take 12+ hours to 'recover' from all the lost months of not doing orphan cleanup. After apply the patch adding '.batch_size(100)' it works Peter (In reply to Peter Vreman from comment #15) > [...] > After apply the patch adding '.batch_size(100)' it works What patch is that? Ben, in the External trackers you can see Pulp issue associated: https://pulp.plan.io/issues/2584 In this issue you can click to github patch: https://github.com/pulp/pulp/pull/2952/files You can download this as patch: wget https://github.com/pulp/pulp/pull/2952.patch and apply it. Verified on Sat 6.3 snap 35, no timeout experienced on removing 50k+ orphans Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
>
> For information on the advisory, and where to find the updated files, follow the link below.
>
> If the solution does not work for you, open a new bug report.
>
> https://access.redhat.com/errata/RHSA-2018:0336
|