Bug 1418627

Summary:	Mongo cursor times out during task pulp.server.managers.content.orphan.delete_all_orphans
Product:	Red Hat Satellite	Reporter:	Ben <ben.argyle>
Component:	Pulp	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	Peter Ondrejka <pondrejk>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.2.7	CC:	bbuckingham, ben.argyle, bkearney, bmbouter, cwelton, daviddavis, dkliban, ggainey, ipanova, jcallaha, lzap, mhrivnak, mmccune, pcreech, peter.vreman, pgozart, pmoravec, rchan, ttereshc
Target Milestone:	Unspecified	Keywords:	Regression, Triaged
Target Release:	Unused
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-02-21 16:54:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1533259

Description Ben 2017-02-02 11:23:03 UTC

Description of problem:
When Pulp tried to delete orphan content, Mongo's cursor times out


Version-Release number of selected component (if applicable):
Satellite 6.2.7, fully updated as of 30 Jan 2017.

Actual results:

Feb  2 04:13:52 satellite1 pulp: py.warnings:WARNING: (1955-62752) /usr/lib/python2.7/site-packages/pulp/server/managers/content/orphan.py:228: DeprecationWarning: remove is deprecated. Use delete_one or delete_many instead.
Feb  2 04:13:52 satellite1 pulp: py.warnings:WARNING: (1955-62752)   content_units_collection.remove(content_unit['_id'])
Feb  2 04:13:52 satellite1 pulp: py.warnings:WARNING: (1955-62752)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) Task pulp.server.managers.content.orphan.delete_all_orphans[ece93a94-907b-4c1a-b4a4-474a99893be2] raised unexpected: CursorNotFound('Cursor not found, cursor id: 163452381930',)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) Traceback (most recent call last):
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     R = retval = fun(*args, **kwargs)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 473, in __call__
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     return super(Task, self).__call__(*args, **kwargs)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 103, in __call__
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     return super(PulpTask, self).__call__(*args, **kwargs)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     return self.run(*args, **kwargs)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/managers/content/orphan.py", line 170, in delete_all_orphans
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     OrphanManager.delete_orphans_by_type(content_type_id)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/managers/content/orphan.py", line 219, in delete_orphans_by_type
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     fields=['_id', '_storage_path']):
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/managers/content/orphan.py", line 106, in generate_orphans_by_type
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     for content_unit in content_units_collection.find({}, projection=fields):
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 1097, in next
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     if len(self.__data) or self._refresh():
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 1039, in _refresh
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     self.__max_await_time_ms))
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 903, in __send_message
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     codec_options=self.__codec_options)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)   File "/usr/lib64/python2.7/site-packages/pymongo/helpers.py", line 114, in _unpack_response
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752)     raise CursorNotFound(msg, 43, errobj)
Feb  2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) CursorNotFound: Cursor not found, cursor id: 163452381930

Expected results:
No error.

Additional info:

Comment 3 Michael Hrivnak 2017-02-06 22:50:54 UTC

Looking at the code, it's in a pretty tight loop there. The only way I could think this might have happened is if mongod restarted while the orphan purge task was running. Or if someone accidentally deleted an index on that collection. Or if someone ran pulp-manage-db while the orphan purge task was running, which temporarily removes indexes.

https://github.com/pulp/pulp/blob/2.8-release/server/pulp/server/managers/content/orphan.py#L106

Is this problem reproducible?

Comment 4 Ben 2017-02-07 10:00:24 UTC

The only way I've been able to reproduce it is to load Satellite up with a number of jobs at the same time.  The best way is to publish three or more Content Views (say RHEL6 plus Optional plus Sat. Tools plus Supplementary, and at the same time RHEL7 plus the same) concurrently.  On an 8 CPU, 24 GB VMWare guest running RHEL7.3 and doing nothing but Satellite, eventually the box will get so busy a cursor will time out.

Comment 5 Michael Hrivnak 2017-02-09 14:13:19 UTC

Thank you. This feedback is very helpful. I have one more question: are you seeing cursor timeouts mostly from this specific task? Or when the system gets heavily loaded, do you see them in other places also? Is there a short list of those places, or is it unpredictable? I'm trying to determine where we should focus our effort.

Comment 6 Ben 2017-02-09 14:57:14 UTC

I think the only time I've seen cursor timeouts has been when trying to publish multiple Content Views concurrently.  It's _possible_ I might have seen one when restarting Satellite ("katello-service restart") while it was in the middle of doing a number of bulk errata pushes (more than 15 Content Hosts in a Host Collection).  As all the individual services came back up I _might_ have seen one then.  Or what I saw may have been the "UserWarning: MongoClient open before fork." messages that litter Satellite's log output when restarting.  I'm sorry I can't be more specific.

Oh, having had a look at all of the /var/log/messages files I have, it appears that there's been a cursor timeout at these timestamps:

Thur Jan 19 04:23:57
Thur Jan 26 04:39:19
Thur Feb  2 04:28:43

... but not this morning, oddly...  I think it may be something to do with the aftermath/finishing up after the nightly sync'ing of all repos.

Comment 7 pulp-infra@redhat.com 2017-02-10 15:31:14 UTC

The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 8 pulp-infra@redhat.com 2017-02-10 15:31:17 UTC

The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 9 pulp-infra@redhat.com 2017-02-14 21:01:38 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2017-02-17 01:01:58 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 11 pulp-infra@redhat.com 2017-02-17 01:32:13 UTC

All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 12 Dennis Kliban 2017-03-14 02:25:21 UTC

The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 13 pulp-infra@redhat.com 2017-04-12 14:34:00 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 14 Ben 2017-06-07 08:52:13 UTC

Just a note that I'm seeing this on 6.2.9 as well when doing a manual invocation of "foreman-rake katello:delete_orphaned_content RAILS_ENV=production" with no other appreciable load.

Comment 15 Peter Vreman 2017-12-07 07:44:32 UTC

I have the same issue, the katello weekly cron that calls pulp-admin orphan remove --all is not running anymore for several months.

As the time needed for the orphan removal increases proportional to the number of orphans, then the next successfull run will take many hours. On an other system with a fixed pulp i have seen the task could take 12+ hours to 'recover' from all the lost months of not doing orphan cleanup.

After apply the patch adding '.batch_size(100)' it works


Peter

Comment 16 Ben 2017-12-07 11:37:22 UTC

(In reply to Peter Vreman from comment #15)
> [...]
> After apply the patch adding '.batch_size(100)' it works

What patch is that?

Comment 17 Lukas Zapletal 2017-12-07 12:25:00 UTC

Ben,

in the External trackers you can see Pulp issue associated:

https://pulp.plan.io/issues/2584

In this issue you can click to github patch:

https://github.com/pulp/pulp/pull/2952/files

You can download this as patch:

wget https://github.com/pulp/pulp/pull/2952.patch

and apply it.

Comment 24 Peter Ondrejka 2018-02-07 12:06:26 UTC

Verified on Sat 6.3 snap 35, no timeout experienced on removing 50k+ orphans

Comment 25 Satellite Program 2018-02-21 16:54:17 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
> 
> For information on the advisory, and where to find the updated files, follow the link below.
> 
> If the solution does not work for you, open a new bug report.
> 
> https://access.redhat.com/errata/RHSA-2018:0336