Bug 1889795 - uncaught exception after a celery worker process abruptly terminates, resulting in a hung task [NEEDINFO]
Summary: uncaught exception after a celery worker process abruptly terminates, resulti...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: 6.7.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: Lai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 15:16 UTC by Pavel Moravec
Modified: 2020-11-25 22:06 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
patalber: needinfo? (satellite6-bugs)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Pulp Redmine 5502 Normal ASSIGNED worker fails to mark a task as failed when critical conditions are encountered 2020-11-25 22:06:33 UTC
Red Hat Knowledge Base (Solution) 5569881 None None None 2020-11-11 11:48:22 UTC

Description Pavel Moravec 2020-10-20 15:16:28 UTC
Description of problem:
When a celery worker process is processing a task, and the process terminates unexpectedly (bit artificial but could happen), then:
- parent process hits uncaught exception
- the processed task is hung forever, as the uncaught exception happens just before updating the task status

This leaves to Content View Publish/Promote or Caps Sync task hung forever, waiting for forever "running" pulp task.

Moreover, the original cause of failure is not written anywhere, since the uncaught exception is raised before "task_status['traceback'] = einfo.traceback" would store it. Therefore, this bug prevents investigation of the root cause "why the pulp worker terminated?".


Version-Release number of selected component (if applicable):
Any Sat 6.7, e.g.:
- pulp-server-2.21.0.4-1
- satellite-6.7.4-1


How reproducible:
100%


Steps to Reproduce:
1. Invoke some bigger CV publish/promote (with more repos inside)
2. While there are pulp celery workers processing the sync/publish tasks, kill some of them via "kill -SIGHUP <pid>"
3. Check /var/log/messages
4. Check CV publish/promote task


Actual results:
3. logs error like:

Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#using-pymongo-with-multiprocessing>
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088)   "MongoClient opened before fork. Create MongoClient "
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: pulp.server.async.tasks:INFO: [989d3c32] Task failed : [b2851544-6a94-4e5d-bd6a-5bd9cde01ac0] : Worker terminated abnormally while processing task b2851544-6a94-4e5d-bd6a-5bd9cde01ac0.  Check the logs for details
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) Task pulp.server.async.tasks._release_resource[989d3c32-12c8-4fd7-9ee0-53e62880a156] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) Traceback (most recent call last):
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     R = retval = fun(*args, **kwargs)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 108, in __call__
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     return super(PulpTask, self).__call__(*args, **kwargs)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     return self.run(*args, **kwargs)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 376, in _release_resource
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 759, in on_failure
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     if not self.request.called_directly:
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)   File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 978, in _get_request
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088)     req = self.request_stack.top
Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) AttributeError: 'NoneType' object has no attribute 'top'

4. The foreman task never terminates, as the pulp task is "running" forever.


Expected results:
3. No such uncaught exception.
4. foreman task to terminate soon after pulp worker gone


Additional info:
This might be pulp-2 specific bug that does not happen in pulp-3. If pulp-3 is robust in this scenario, I am OK closing it as NEXTRELEASE.

Comment 1 Pavel Moravec 2020-11-11 11:29:42 UTC
Indirect reproducer:

- invoke several Capsule synces
- once many tasks are running or pending, restart pulp services on the Capsule:

for i in pulp_celerybeat pulp_resource_manager pulp_streamer pulp_workers; do service $i restart; done

- /var/log/messages will log:

Nov 11 12:21:09 pmoravec-caps67-rhev pulp: pulp.server.async.tasks:INFO: [2e9962f1] Task failed : [a2745159-81ab-42ee-907c-f265b618e00c] : Worker terminated abnormally while processing task a2745159-81ab-42ee-907c-f265b618e00c.  Check the logs for details
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) Task pulp.server.async.tasks._release_resource[2e9962f1-3f24-4b28-80ba-9396f58b5f26] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) Traceback (most recent call last):
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     R = retval = fun(*args, **kwargs)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 108, in __call__
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     return super(PulpTask, self).__call__(*args, **kwargs)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     return self.run(*args, **kwargs)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 378, in _release_resource
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 761, in on_failure
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     if not self.request.called_directly:
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)   File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 983, in _get_request
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752)     req = self.request_stack.top
Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) AttributeError: 'NoneType' object has no attribute 'top'


Just the cause of error is different (and known here), but apart of it, the backtrace happens.

Anyway, there are 2 issues, in fact:
1) the uncaught exception when trying to log "this step ('self.request_stack.top' of an empty request_stack) in the task hasnt finished while the task is about to finish / release resource"
  - this is easy to reproduce via above steps
2) situation when a task is about to finish (release resources) but some its steps are still running
  - no idea when or why this happens..

Comment 2 Pavel Moravec 2020-11-11 12:24:39 UTC
.. or bit simplified reproducer outside Satellite / on a standalone pulp server (JUST the backtrace above, though):


1) forcefully republish all repos (just to generate some tasks):

pulpAdminPassword=$(grep ^default_password /etc/pulp/server.conf | cut -d' ' -f2)

for repo in $(ls /var/lib/pulp/published/yum/master/yum_distributor/); do curl -i -H "Content-Type: application/json" -X POST -d "{\"id\":\"$repo\",\"override_config\":{\"force_full\":true}}" -u admin:$pulpAdminPassword https://$(hostname -f)/pulp/api/v2/repositories/$repo/actions/publish/; done


2) restart workers:

service pulp_workers restart


3) check that random repo publish task will raise that backtrace during worker shutdown (recall: the shutdown just triggers the backtrace that can occur during a regular run also)

Comment 3 Brian Bouterse 2020-11-12 22:26:14 UTC
Based on the stack trace the original error could be logged if we did it here: https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.py#L757-L758

The significant impact though isn't that an error occured it's that pulp's on_falure() handler fails to handle due to the pulp calling in to celery's broken internal state here https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.py#L761

So what I think we should explore is updating the on_failure() routine of Pulp to be more defensive, and if Celery is very broken underneath, we make sure Pulp doesn't halt the other tasks. Generally speaking @pmoravec, what do you think about that this as an approach?

Also FYI, I analyzed the pulp/celery/kombu/amqp/billiard stack from 6.5 and onward and the last meaningful change was in pyamqp and celery got some release patches in 6.6, but I don't see any changes in 6.7+ for example.

Comment 5 Pavel Moravec 2020-11-13 08:27:11 UTC
(In reply to Brian Bouterse from comment #3)
> Based on the stack trace the original error could be logged if we did it
> here:
> https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.
> py#L757-L758
> 
> The significant impact though isn't that an error occured it's that pulp's
> on_falure() handler fails to handle due to the pulp calling in to celery's
> broken internal state here
> https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.
> py#L761
> 
> So what I think we should explore is updating the on_failure() routine of
> Pulp to be more defensive, and if Celery is very broken underneath, we make
> sure Pulp doesn't halt the other tasks. Generally speaking @pmoravec, what
> do you think about that this as an approach?
> 
> Also FYI, I analyzed the pulp/celery/kombu/amqp/billiard stack from 6.5 and
> onward and the last meaningful change was in pyamqp and celery got some
> release patches in 6.6, but I don't see any changes in 6.7+ for example.

This sounds as a good approach; worth adding printing traceback _and_ namely exc (the original exception) at tasks.py#L757-L758, instead of relying on celery. This workaround patch will allow us to see the original exception. Plus make the on_failure routine robust to celery failures to catch exceptions.

I havent found any relevant diff causing 6.6->6.7 regression, but we started to see this since 6.7 upgrades. Maybe some other subtle change (i.e. in katello using different pulp option triggering different pulp behaviour) is behind, but that isn't worth to investigate, I think.

Comment 6 Pavel Moravec 2020-11-13 09:13:14 UTC
OK, the scenario in one support case is "OOM killed celery worker child process(*), celery response is the uncaught exception". That is, so far we have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no free disk space) that leads to the celery uncaught exception.

(*) why celery worker consumes 5GB(!) like in this case..? That will be another concern I will investigate.

Comment 7 Hao Chang Yu 2020-11-13 11:04:08 UTC
(In reply to Pavel Moravec from comment #6)
> OK, the scenario in one support case is "OOM killed celery worker child
> process(*), celery response is the uncaught exception". That is, so far we
> have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no
> free disk space) that leads to the celery uncaught exception.
> 
> (*) why celery worker consumes 5GB(!) like in this case..? That will be
> another concern I will investigate.

You mean 5GB RAM? Could be this bug 1876782

Comment 8 Pavel Moravec 2020-11-13 12:22:14 UTC
(In reply to Hao Chang Yu from comment #7)
> (In reply to Pavel Moravec from comment #6)
> > OK, the scenario in one support case is "OOM killed celery worker child
> > process(*), celery response is the uncaught exception". That is, so far we
> > have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no
> > free disk space) that leads to the celery uncaught exception.
> > 
> > (*) why celery worker consumes 5GB(!) like in this case..? That will be
> > another concern I will investigate.
> 
> You mean 5GB RAM? Could be this bug 1876782

Yes, 5GB RAM in RSS (not VSZ). Too much..

Good point with the BZ! Though probably not applicable to this particular case, where I see the task details:

        "progress_report" : {
                "1-obfuscated-data-4cd727c9-8780-4ff1-af6a-ca2e047df849" : [
                        {
                                "num_success" : 1,
                                "description" : "Copying files",
                                "step_type" : "save_tar",
                                "items_total" : 1,
                                "state" : "FINISHED",
                                "error_details" : [ ],
                                "details" : "",
                                "num_failures" : 0,
                                "step_id" : "b4c59cf1-0614-41b8-b9c1-8fc9dcfdacac",
                                "num_processed" : 1
                        },
..
                        {
                                "num_success" : 0,
                                "description" : "Publishing RPMs",
                                "step_type" : "rpms",
                                "items_total" : 0,
                                "state" : "IN_PROGRESS",
                                "error_details" : [ ],
                                "details" : "",
                                "num_failures" : 0,
                                "step_id" : "6995bdda-38a9-4d3f-84f4-a4ff76982bae",
                                "num_processed" : 0
                        },
                        {
                                "num_success" : 0,
                                "description" : "Publishing Errata",
                                "step_type" : "errata",
                                "items_total" : 1,
                                "state" : "NOT_STARTED",
                                "error_details" : [ ],
                                "details" : "",
                                "num_failures" : 0,
                                "step_id" : "3fa09604-8972-476b-82df-8fdf8d9249a6",
                                "num_processed" : 0
                        },

I.e. it should happened during publishing RPMs, not (yet) errata.

Or maybe the task details are not up-to-date..

Comment 9 pulp-infra@redhat.com 2020-11-13 16:07:42 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2020-11-13 16:07:43 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 15 pulp-infra@redhat.com 2020-11-25 22:06:34 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.


Note You need to log in before you can comment on or make changes to this bug.