Red Hat Bugzilla – 1692885 – Pulp scheduler cancels tasks even with a high worker_timeout specified - Worker has gone missing, removing from list of workers
Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
We are seeing environments were Pulp is removing workers from system and canceling tasks that are executing:
pulp: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-0.com' has gone missing, removing from list of workers
pulp: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-0.com is missing. Canceling the tasks in its queue.
pulp.server.async.tasks:INFO: Task canceled: 4658bdd0-8f23-482a-a965-d2dbd313cf12
As a workaround, high timeout values have been attempted to be specified in the worker_timeout setting in /etc/pulp/server.conf
We have tried values of 60, 300, 600, 900+ yet still see workers being removed.
The results of this are Content View publish/promotes, repository syncs and other operations that fail with the only option is to re-start the process. This can be business impacting if their is automation in place that expects these operations to succeed. These operations continue to fail even after re-attempts offering the user no option currently without code modification.
Will attach a patch to this BZ that we have employed to /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py that removes the delete lines:
if worker.last_heartbeat < oldest_heartbeat_time:
msg = _("Worker '%s' has gone missing, removing from list of workers") % worker.name
_logger.error(msg)
if worker.name.startswith(constants.SCHEDULER_WORKER_NAME):
# worker.delete()
_logger.error("Not deleting SCHEDULER worker - continuing on")
else:
# _delete_worker(worker.name)
_logger.error("Not deleting worker - continuing on")
Will also try increasingly large values of worker_timeout (24h+) to see if there really is a bug in the time calculations in this code.
Created attachment 1548133[details]
scheduler.py patch
*** WORKAROUND ***
Patch that can be utilized as a workaround to disable worker deletion. Place in
1) make backup copy of scheduler.py
cp /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.bak-1692885
2) copy attached scheduler.py from case to your Satellite in location:
/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py
3) restart
foreman-maintain service restart
4) resume operations
We are going to try increasingly higher values of worker_timeout (eg 3600) to rule out any issues with the algorithm for checking heartbeats.
This condition is likely a result of over-taxed IO and slow storage but will continue to diagnose.
Timeout can be configured via the installer, you must use both params:
satellite-installer --foreman-proxy-content-pulp-worker-timeout 3600 --katello-pulp-worker-timeout 3600
*** WORKAROUND 2 RECOMMENDED ***
Increase the timeout to be large as mentioned in comment 7:
# satellite-installer --foreman-proxy-content-pulp-worker-timeout 3600 --katello-pulp-worker-timeout 3600
this requires no code modification.
Comment 9Tanya Tereshchenko
2020-05-01 12:47:53 UTC
Pulp 2 is in maintenance mode and currently accepts only critical/security issues. The main focus is on Pulp 3 and some of the requests will be satisfied in the newer version.
We have evaluated this request, and while we recognize that it is a valid request, we do not expect this to be implemented in Pulp 2. As this issue is not relevant for Pulp 3, we are therefore closing this out as WONTFIX.
We are seeing environments were Pulp is removing workers from system and canceling tasks that are executing: pulp: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-0.com' has gone missing, removing from list of workers pulp: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-0.com is missing. Canceling the tasks in its queue. pulp.server.async.tasks:INFO: Task canceled: 4658bdd0-8f23-482a-a965-d2dbd313cf12 As a workaround, high timeout values have been attempted to be specified in the worker_timeout setting in /etc/pulp/server.conf We have tried values of 60, 300, 600, 900+ yet still see workers being removed. The results of this are Content View publish/promotes, repository syncs and other operations that fail with the only option is to re-start the process. This can be business impacting if their is automation in place that expects these operations to succeed. These operations continue to fail even after re-attempts offering the user no option currently without code modification. Will attach a patch to this BZ that we have employed to /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py that removes the delete lines: if worker.last_heartbeat < oldest_heartbeat_time: msg = _("Worker '%s' has gone missing, removing from list of workers") % worker.name _logger.error(msg) if worker.name.startswith(constants.SCHEDULER_WORKER_NAME): # worker.delete() _logger.error("Not deleting SCHEDULER worker - continuing on") else: # _delete_worker(worker.name) _logger.error("Not deleting worker - continuing on") Will also try increasingly large values of worker_timeout (24h+) to see if there really is a bug in the time calculations in this code.