Bug 1889795

Summary:	uncaught exception after a celery worker process abruptly terminates, resulting in a hung task
Product:	Red Hat Satellite	Reporter:	Pavel Moravec <pmoravec>
Component:	Pulp	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	Tasos Papaioannou <tpapaioa>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.7.0	CC:	ahumbe, bmbouter, dalley, ggainey, hyu, ipanova, jkrajice, kkinge, ktordeur, mfalz, mmccune, mn.albeschenko, patalber, rchan, satellite6-bugs, saydas, ttereshc, zhunting
Target Milestone:	6.9.0	Keywords:	PrioBumpGSS, Triaged
Target Release:	Unused
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1919405 (view as bug list)		Environment:
Last Closed:	2021-04-21 13:18:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Moravec 2020-10-20 15:16:28 UTC

Description of problem:
When a celery worker process is processing a task, and the process terminates unexpectedly (bit artificial but could happen), then:
- parent process hits uncaught exception
- the processed task is hung forever, as the uncaught exception happens just before updating the task status

This leaves to Content View Publish/Promote or Caps Sync task hung forever, waiting for forever "running" pulp task.

Moreover, the original cause of failure is not written anywhere, since the uncaught exception is raised before "task_status['traceback'] = einfo.traceback" would store it. Therefore, this bug prevents investigation of the root cause "why the pulp worker terminated?".


Version-Release number of selected component (if applicable):
Any Sat 6.7, e.g.:
- pulp-server-2.21.0.4-1
- satellite-6.7.4-1


How reproducible:
100%


Steps to Reproduce:
1. Invoke some bigger CV publish/promote (with more repos inside)
2. While there are pulp celery workers processing the sync/publish tasks, kill some of them via "kill -SIGHUP <pid>"
3. Check /var/log/messages
4. Check CV publish/promote task


Actual results:
3. logs error like:

Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#using-pymongo-with-multiprocessing>
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088)   "MongoClient opened before fork. Create MongoClient "
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: pulp.server.async.tasks:INFO: [989d3c32] Task failed : [b2851544-6a94-4e5d-bd6a-5bd9cde01ac0] : Worker terminated abnormally while processing task b2851544-6a94-4e5d-bd6a-5bd9cde01ac0.  Check the logs for details
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) Task pulp.server.async.tasks._release_resource[989d3c32-12c8-4fd7-9ee0-53e62880a156] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) Traceback (most recent call last):
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     R = retval = fun(*args, **kwargs)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 108, in __call__
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     return super(PulpTask, self).__call__(*args, **kwargs)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     return self.run(*args, **kwargs)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 376, in _release_resource
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 759, in on_failure
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     if not self.request.called_directly:
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 978, in _get_request
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     req = self.request_stack.top
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) AttributeError: 'NoneType' object has no attribute 'top'

4. The foreman task never terminates, as the pulp task is "running" forever.


Expected results:
3. No such uncaught exception.
4. foreman task to terminate soon after pulp worker gone


Additional info:
This might be pulp-2 specific bug that does not happen in pulp-3. If pulp-3 is robust in this scenario, I am OK closing it as NEXTRELEASE.

Comment 1 Pavel Moravec 2020-11-11 11:29:42 UTC

Indirect reproducer:

- invoke several Capsule synces
- once many tasks are running or pending, restart pulp services on the Capsule:

for i in pulp_celerybeat pulp_resource_manager pulp_streamer pulp_workers; do service $i restart; done

- /var/log/messages will log:

Nov 11 12:21:09 pmoravec-caps67-rhev pulp: pulp.server.async.tasks:INFO: [2e9962f1] Task failed : [a2745159-81ab-42ee-907c-f265b618e00c] : Worker terminated abnormally while processing task a2745159-81ab-42ee-907c-f265b618e00c.  Check the logs for details
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) Task pulp.server.async.tasks._release_resource[2e9962f1-3f24-4b28-80ba-9396f58b5f26] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) Traceback (most recent call last):
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     R = retval = fun(*args, **kwargs)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 108, in __call__
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     return super(PulpTask, self).__call__(*args, **kwargs)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     return self.run(*args, **kwargs)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 378, in _release_resource
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 761, in on_failure
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     if not self.request.called_directly:
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 983, in _get_request
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     req = self.request_stack.top
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) AttributeError: 'NoneType' object has no attribute 'top'


Just the cause of error is different (and known here), but apart of it, the backtrace happens.

Anyway, there are 2 issues, in fact:
1) the uncaught exception when trying to log "this step ('self.request_stack.top' of an empty request_stack) in the task hasnt finished while the task is about to finish / release resource"
  - this is easy to reproduce via above steps
2) situation when a task is about to finish (release resources) but some its steps are still running
  - no idea when or why this happens..

Comment 2 Pavel Moravec 2020-11-11 12:24:39 UTC

.. or bit simplified reproducer outside Satellite / on a standalone pulp server (JUST the backtrace above, though):


1) forcefully republish all repos (just to generate some tasks):

pulpAdminPassword=$(grep ^default_password /etc/pulp/server.conf | cut -d' ' -f2)

for repo in $(ls /var/lib/pulp/published/yum/master/yum_distributor/); do curl -i -H "Content-Type: application/json" -X POST -d "{\"id\":\"$repo\",\"override_config\":{\"force_full\":true}}" -u admin:$pulpAdminPassword https://$(hostname -f)/pulp/api/v2/repositories/$repo/actions/publish/; done


2) restart workers:

service pulp_workers restart


3) check that random repo publish task will raise that backtrace during worker shutdown (recall: the shutdown just triggers the backtrace that can occur during a regular run also)

Comment 3 Brian Bouterse 2020-11-12 22:26:14 UTC

Based on the stack trace the original error could be logged if we did it here: https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.py#L757-L758

The significant impact though isn't that an error occured it's that pulp's on_falure() handler fails to handle due to the pulp calling in to celery's broken internal state here https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.py#L761

So what I think we should explore is updating the on_failure() routine of Pulp to be more defensive, and if Celery is very broken underneath, we make sure Pulp doesn't halt the other tasks. Generally speaking @pmoravec, what do you think about that this as an approach?

Also FYI, I analyzed the pulp/celery/kombu/amqp/billiard stack from 6.5 and onward and the last meaningful change was in pyamqp and celery got some release patches in 6.6, but I don't see any changes in 6.7+ for example.

Comment 5 Pavel Moravec 2020-11-13 08:27:11 UTC

(In reply to Brian Bouterse from comment #3)
> Based on the stack trace the original error could be logged if we did it
> here:
> https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.
> py#L757-L758
> 
> The significant impact though isn't that an error occured it's that pulp's
> on_falure() handler fails to handle due to the pulp calling in to celery's
> broken internal state here
> https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.
> py#L761
> 
> So what I think we should explore is updating the on_failure() routine of
> Pulp to be more defensive, and if Celery is very broken underneath, we make
> sure Pulp doesn't halt the other tasks. Generally speaking @pmoravec, what
> do you think about that this as an approach?
> 
> Also FYI, I analyzed the pulp/celery/kombu/amqp/billiard stack from 6.5 and
> onward and the last meaningful change was in pyamqp and celery got some
> release patches in 6.6, but I don't see any changes in 6.7+ for example.

This sounds as a good approach; worth adding printing traceback _and_ namely exc (the original exception) at tasks.py#L757-L758, instead of relying on celery. This workaround patch will allow us to see the original exception. Plus make the on_failure routine robust to celery failures to catch exceptions.

I havent found any relevant diff causing 6.6->6.7 regression, but we started to see this since 6.7 upgrades. Maybe some other subtle change (i.e. in katello using different pulp option triggering different pulp behaviour) is behind, but that isn't worth to investigate, I think.

Comment 6 Pavel Moravec 2020-11-13 09:13:14 UTC

OK, the scenario in one support case is "OOM killed celery worker child process(*), celery response is the uncaught exception". That is, so far we have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no free disk space) that leads to the celery uncaught exception.

(*) why celery worker consumes 5GB(!) like in this case..? That will be another concern I will investigate.

Comment 7 Hao Chang Yu 2020-11-13 11:04:08 UTC

(In reply to Pavel Moravec from comment #6)
> OK, the scenario in one support case is "OOM killed celery worker child
> process(*), celery response is the uncaught exception". That is, so far we
> have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no
> free disk space) that leads to the celery uncaught exception.
> 
> (*) why celery worker consumes 5GB(!) like in this case..? That will be
> another concern I will investigate.

You mean 5GB RAM? Could be this bug 1876782

Comment 8 Pavel Moravec 2020-11-13 12:22:14 UTC

(In reply to Hao Chang Yu from comment #7)
> (In reply to Pavel Moravec from comment #6)
> > OK, the scenario in one support case is "OOM killed celery worker child
> > process(*), celery response is the uncaught exception". That is, so far we
> > have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no
> > free disk space) that leads to the celery uncaught exception.
> > 
> > (*) why celery worker consumes 5GB(!) like in this case..? That will be
> > another concern I will investigate.
> 
> You mean 5GB RAM? Could be this bug 1876782

Yes, 5GB RAM in RSS (not VSZ). Too much..

Good point with the BZ! Though probably not applicable to this particular case, where I see the task details:

        "progress_report" : {
                "1-obfuscated-data-4cd727c9-8780-4ff1-af6a-ca2e047df849" : [
                        {
                                "num_success" : 1,
                                "description" : "Copying files",
                                "step_type" : "save_tar",
                                "items_total" : 1,
                                "state" : "FINISHED",
                                "error_details" : [ ],
                                "details" : "",
                                "num_failures" : 0,
                                "step_id" : "b4c59cf1-0614-41b8-b9c1-8fc9dcfdacac",
                                "num_processed" : 1
                        },
..
                        {
                                "num_success" : 0,
                                "description" : "Publishing RPMs",
                                "step_type" : "rpms",
                                "items_total" : 0,
                                "state" : "IN_PROGRESS",
                                "error_details" : [ ],
                                "details" : "",
                                "num_failures" : 0,
                                "step_id" : "6995bdda-38a9-4d3f-84f4-a4ff76982bae",
                                "num_processed" : 0
                        },
                        {
                                "num_success" : 0,
                                "description" : "Publishing Errata",
                                "step_type" : "errata",
                                "items_total" : 1,
                                "state" : "NOT_STARTED",
                                "error_details" : [ ],
                                "details" : "",
                                "num_failures" : 0,
                                "step_id" : "3fa09604-8972-476b-82df-8fdf8d9249a6",
                                "num_processed" : 0
                        },

I.e. it should happened during publishing RPMs, not (yet) errata.

Or maybe the task details are not up-to-date..

Comment 9 pulp-infra@redhat.com 2020-11-13 16:07:42 UTC

The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2020-11-13 16:07:43 UTC

The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 15 pulp-infra@redhat.com 2020-11-25 22:06:34 UTC

The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 16 pulp-infra@redhat.com 2020-12-04 15:06:06 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 17 pulp-infra@redhat.com 2020-12-09 14:07:30 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 18 pulp-infra@redhat.com 2020-12-09 15:06:28 UTC

All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 20 pulp-infra@redhat.com 2021-01-25 20:32:07 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 24 Pavel Moravec 2021-03-09 11:56:10 UTC

*** Bug 1926627 has been marked as a duplicate of this bug. ***

Comment 26 Tasos Papaioannou 2021-03-17 00:59:55 UTC

Verified on 6.9.0 snap 17.

Comment 31 errata-xmlrpc 2021-04-21 13:18:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1313

Comment 32 Red Hat Bugzilla 2023-09-15 00:49:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days