Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1692885

Summary: Pulp scheduler cancels tasks even with a high worker_timeout specified - Worker has gone missing, removing from list of workers
Product: Red Hat Satellite Reporter: Mike McCune <mmccune>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED WONTFIX QA Contact: Kersom <koliveir>
Severity: high Docs Contact:
Priority: high    
Version: 6.5.0CC: alsouza, bmbouter, hmore, jbhatia, jdickers, jjansky, ktordeur, ltran, mcasabur, pdwyer, saydas, ttereshc
Target Milestone: UnspecifiedKeywords: Patch, PrioBumpField, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-01 12:47:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
scheduler.py patch none

Description Mike McCune 2019-03-26 16:11:03 UTC
We are seeing environments were Pulp is removing workers from system and canceling tasks that are executing:

pulp: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-0.com' has gone missing, removing from list of workers
pulp: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-0.com is missing. Canceling the tasks in its queue.
pulp.server.async.tasks:INFO: Task canceled: 4658bdd0-8f23-482a-a965-d2dbd313cf12

As a workaround, high timeout values have been attempted to be specified in the worker_timeout setting in /etc/pulp/server.conf

We have tried values of 60, 300, 600, 900+ yet still see workers being removed.

The results of this are Content View publish/promotes, repository syncs and other operations that fail with the only option is to re-start the process. This can be business impacting if their is automation in place that expects these operations to succeed. These operations continue to fail even after re-attempts offering the user no option currently without code modification.

Will attach a patch to this BZ that we have employed to /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py that removes the delete lines:

  if worker.last_heartbeat < oldest_heartbeat_time:
      msg = _("Worker '%s' has gone missing, removing from list of workers") % worker.name
      _logger.error(msg)

      if worker.name.startswith(constants.SCHEDULER_WORKER_NAME):
          # worker.delete()
          _logger.error("Not deleting SCHEDULER worker - continuing on")
      else:
          # _delete_worker(worker.name)
          _logger.error("Not deleting worker - continuing on")

Will also try increasingly large values of worker_timeout (24h+) to see if there really is a bug in the time calculations in this code.

Comment 4 Mike McCune 2019-03-26 16:16:53 UTC
Created attachment 1548133 [details]
scheduler.py patch

*** WORKAROUND ***

Patch that can be utilized as a workaround to disable worker deletion. Place in 

1) make backup copy of scheduler.py

cp /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.bak-1692885

2) copy attached scheduler.py from case to your Satellite in location:

/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py 

3) restart

foreman-maintain service restart

4) resume operations

Comment 5 Brian Bouterse 2019-03-26 16:33:00 UTC
With so many timeout values tried, and with clock drift not a possible cause due to workers and celerybeat running on the same machine, I deduce that the workers are not able to write their timestamps to the db as expected. Are we sure that workers are writing their heartbeats to the db?

The workers do that as part of their HeartbeatStep here https://github.com/pulp/pulp/blob/e5a22e13ae46fe86dccedc5bf214537c2b90ad0d/server/pulp/server/async/app.py#L119
That calls the handle_worker_heartbeat() method here:  https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/worker_watcher.py#L28-L57

Comment 6 Mike McCune 2019-03-27 19:20:23 UTC
We are going to try increasingly higher values of worker_timeout (eg 3600) to rule out any issues with the algorithm for checking heartbeats.

This condition is likely a result of over-taxed IO and slow storage but will continue to diagnose.

Comment 7 Mike McCune 2019-03-27 19:48:29 UTC
Timeout can be configured via the installer, you must use both params:

 satellite-installer --foreman-proxy-content-pulp-worker-timeout 3600 --katello-pulp-worker-timeout 3600

Comment 8 Mike McCune 2019-04-17 21:10:49 UTC
*** WORKAROUND 2 RECOMMENDED ***

Increase the timeout to be large as mentioned in comment 7:

# satellite-installer --foreman-proxy-content-pulp-worker-timeout 3600 --katello-pulp-worker-timeout 3600

this requires no code modification.

Comment 9 Tanya Tereshchenko 2020-05-01 12:47:53 UTC
Pulp 2 is in maintenance mode and currently accepts only critical/security issues. The main focus is on Pulp 3 and some of the requests will be satisfied in the newer version.
We have evaluated this request, and while we recognize that it is a valid request, we do not expect this to be implemented in Pulp 2. As this issue is not relevant for Pulp 3, we are therefore closing this out as WONTFIX.