Description of problem: When a celery worker process is processing a task, and the process terminates unexpectedly (bit artificial but could happen), then: - parent process hits uncaught exception - the processed task is hung forever, as the uncaught exception happens just before updating the task status This leaves to Content View Publish/Promote or Caps Sync task hung forever, waiting for forever "running" pulp task. Moreover, the original cause of failure is not written anywhere, since the uncaught exception is raised before "task_status['traceback'] = einfo.traceback" would store it. Therefore, this bug prevents investigation of the root cause "why the pulp worker terminated?". Version-Release number of selected component (if applicable): Any Sat 6.7, e.g.: - pulp-server-2.21.0.4-1 - satellite-6.7.4-1 How reproducible: 100% Steps to Reproduce: 1. Invoke some bigger CV publish/promote (with more repos inside) 2. While there are pulp celery workers processing the sync/publish tasks, kill some of them via "kill -SIGHUP <pid>" 3. Check /var/log/messages 4. Check CV publish/promote task Actual results: 3. logs error like: Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#using-pymongo-with-multiprocessing> Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088) "MongoClient opened before fork. Create MongoClient " Oct 20 16:47:12 pmoravec-caps67-rhev pulp: py.warnings:WARNING: [989d3c32] (10316-57088) Oct 20 16:47:12 pmoravec-caps67-rhev pulp: pulp.server.async.tasks:INFO: [989d3c32] Task failed : [b2851544-6a94-4e5d-bd6a-5bd9cde01ac0] : Worker terminated abnormally while processing task b2851544-6a94-4e5d-bd6a-5bd9cde01ac0. Check the logs for details Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) Task pulp.server.async.tasks._release_resource[989d3c32-12c8-4fd7-9ee0-53e62880a156] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",) Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) Traceback (most recent call last): Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) R = retval = fun(*args, **kwargs) Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 108, in __call__ Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) return super(PulpTask, self).__call__(*args, **kwargs) Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__ Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) return self.run(*args, **kwargs) Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 376, in _release_resource Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) new_task.on_failure(exception, task_id, (), {}, MyEinfo) Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 759, in on_failure Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) if not self.request.called_directly: Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 978, in _get_request Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) req = self.request_stack.top Oct 20 16:47:12 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [989d3c32] (10316-57088) AttributeError: 'NoneType' object has no attribute 'top' 4. The foreman task never terminates, as the pulp task is "running" forever. Expected results: 3. No such uncaught exception. 4. foreman task to terminate soon after pulp worker gone Additional info: This might be pulp-2 specific bug that does not happen in pulp-3. If pulp-3 is robust in this scenario, I am OK closing it as NEXTRELEASE.
Indirect reproducer: - invoke several Capsule synces - once many tasks are running or pending, restart pulp services on the Capsule: for i in pulp_celerybeat pulp_resource_manager pulp_streamer pulp_workers; do service $i restart; done - /var/log/messages will log: Nov 11 12:21:09 pmoravec-caps67-rhev pulp: pulp.server.async.tasks:INFO: [2e9962f1] Task failed : [a2745159-81ab-42ee-907c-f265b618e00c] : Worker terminated abnormally while processing task a2745159-81ab-42ee-907c-f265b618e00c. Check the logs for details Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) Task pulp.server.async.tasks._release_resource[2e9962f1-3f24-4b28-80ba-9396f58b5f26] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",) Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) Traceback (most recent call last): Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) R = retval = fun(*args, **kwargs) Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 108, in __call__ Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) return super(PulpTask, self).__call__(*args, **kwargs) Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__ Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) return self.run(*args, **kwargs) Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 378, in _release_resource Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) new_task.on_failure(exception, task_id, (), {}, MyEinfo) Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 761, in on_failure Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) if not self.request.called_directly: Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 983, in _get_request Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) req = self.request_stack.top Nov 11 12:21:09 pmoravec-caps67-rhev pulp: celery.app.trace:ERROR: [2e9962f1] (20590-62752) AttributeError: 'NoneType' object has no attribute 'top' Just the cause of error is different (and known here), but apart of it, the backtrace happens. Anyway, there are 2 issues, in fact: 1) the uncaught exception when trying to log "this step ('self.request_stack.top' of an empty request_stack) in the task hasnt finished while the task is about to finish / release resource" - this is easy to reproduce via above steps 2) situation when a task is about to finish (release resources) but some its steps are still running - no idea when or why this happens..
.. or bit simplified reproducer outside Satellite / on a standalone pulp server (JUST the backtrace above, though): 1) forcefully republish all repos (just to generate some tasks): pulpAdminPassword=$(grep ^default_password /etc/pulp/server.conf | cut -d' ' -f2) for repo in $(ls /var/lib/pulp/published/yum/master/yum_distributor/); do curl -i -H "Content-Type: application/json" -X POST -d "{\"id\":\"$repo\",\"override_config\":{\"force_full\":true}}" -u admin:$pulpAdminPassword https://$(hostname -f)/pulp/api/v2/repositories/$repo/actions/publish/; done 2) restart workers: service pulp_workers restart 3) check that random repo publish task will raise that backtrace during worker shutdown (recall: the shutdown just triggers the backtrace that can occur during a regular run also)
Based on the stack trace the original error could be logged if we did it here: https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.py#L757-L758 The significant impact though isn't that an error occured it's that pulp's on_falure() handler fails to handle due to the pulp calling in to celery's broken internal state here https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.py#L761 So what I think we should explore is updating the on_failure() routine of Pulp to be more defensive, and if Celery is very broken underneath, we make sure Pulp doesn't halt the other tasks. Generally speaking @pmoravec, what do you think about that this as an approach? Also FYI, I analyzed the pulp/celery/kombu/amqp/billiard stack from 6.5 and onward and the last meaningful change was in pyamqp and celery got some release patches in 6.6, but I don't see any changes in 6.7+ for example.
(In reply to Brian Bouterse from comment #3) > Based on the stack trace the original error could be logged if we did it > here: > https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks. > py#L757-L758 > > The significant impact though isn't that an error occured it's that pulp's > on_falure() handler fails to handle due to the pulp calling in to celery's > broken internal state here > https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks. > py#L761 > > So what I think we should explore is updating the on_failure() routine of > Pulp to be more defensive, and if Celery is very broken underneath, we make > sure Pulp doesn't halt the other tasks. Generally speaking @pmoravec, what > do you think about that this as an approach? > > Also FYI, I analyzed the pulp/celery/kombu/amqp/billiard stack from 6.5 and > onward and the last meaningful change was in pyamqp and celery got some > release patches in 6.6, but I don't see any changes in 6.7+ for example. This sounds as a good approach; worth adding printing traceback _and_ namely exc (the original exception) at tasks.py#L757-L758, instead of relying on celery. This workaround patch will allow us to see the original exception. Plus make the on_failure routine robust to celery failures to catch exceptions. I havent found any relevant diff causing 6.6->6.7 regression, but we started to see this since 6.7 upgrades. Maybe some other subtle change (i.e. in katello using different pulp option triggering different pulp behaviour) is behind, but that isn't worth to investigate, I think.
OK, the scenario in one support case is "OOM killed celery worker child process(*), celery response is the uncaught exception". That is, so far we have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no free disk space) that leads to the celery uncaught exception. (*) why celery worker consumes 5GB(!) like in this case..? That will be another concern I will investigate.
(In reply to Pavel Moravec from comment #6) > OK, the scenario in one support case is "OOM killed celery worker child > process(*), celery response is the uncaught exception". That is, so far we > have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no > free disk space) that leads to the celery uncaught exception. > > (*) why celery worker consumes 5GB(!) like in this case..? That will be > another concern I will investigate. You mean 5GB RAM? Could be this bug 1876782
(In reply to Hao Chang Yu from comment #7) > (In reply to Pavel Moravec from comment #6) > > OK, the scenario in one support case is "OOM killed celery worker child > > process(*), celery response is the uncaught exception". That is, so far we > > have 3 scenarios of abruptly terminated process (OOM kill, manual kill, no > > free disk space) that leads to the celery uncaught exception. > > > > (*) why celery worker consumes 5GB(!) like in this case..? That will be > > another concern I will investigate. > > You mean 5GB RAM? Could be this bug 1876782 Yes, 5GB RAM in RSS (not VSZ). Too much.. Good point with the BZ! Though probably not applicable to this particular case, where I see the task details: "progress_report" : { "1-obfuscated-data-4cd727c9-8780-4ff1-af6a-ca2e047df849" : [ { "num_success" : 1, "description" : "Copying files", "step_type" : "save_tar", "items_total" : 1, "state" : "FINISHED", "error_details" : [ ], "details" : "", "num_failures" : 0, "step_id" : "b4c59cf1-0614-41b8-b9c1-8fc9dcfdacac", "num_processed" : 1 }, .. { "num_success" : 0, "description" : "Publishing RPMs", "step_type" : "rpms", "items_total" : 0, "state" : "IN_PROGRESS", "error_details" : [ ], "details" : "", "num_failures" : 0, "step_id" : "6995bdda-38a9-4d3f-84f4-a4ff76982bae", "num_processed" : 0 }, { "num_success" : 0, "description" : "Publishing Errata", "step_type" : "errata", "items_total" : 1, "state" : "NOT_STARTED", "error_details" : [ ], "details" : "", "num_failures" : 0, "step_id" : "3fa09604-8972-476b-82df-8fdf8d9249a6", "num_processed" : 0 }, I.e. it should happened during publishing RPMs, not (yet) errata. Or maybe the task details are not up-to-date..
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.
*** Bug 1926627 has been marked as a duplicate of this bug. ***
Verified on 6.9.0 snap 17.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1313
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days