1532348 – Workers go missing under heavy load

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1532348 - Workers go missing under heavy load

Summary: Workers go missing under heavy load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Pulp
Sub Component:
Version:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	Unspecified
Assignee:	Chris Roberts
QA Contact:	jcallaha
Docs Contact:
URL:	https://projects.theforeman.org/issue...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-08 17:34 UTC by Brian Bouterse
Modified:	2023-10-06 17:42 UTC (History)
CC List:	22 users (show)
Fixed In Version:	katello-installer-base-3.4.5.22-1,pulp-2.13.4.8-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-13 13:29:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
multiple publishes (99.17 KB, image/png) 2018-04-02 20:59 UTC, jcallaha	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	22338	Normal	Closed	Workers go missing under heavy load	2021-02-10 21:48:08 UTC
Pulp Redmine	3135	High	CLOSED - CURRENTRELEASE	As a user, I have a setting to mitigate when workers go missing under heavy loads	2018-04-03 21:36:16 UTC
Red Hat Product Errata	RHBA-2018:1126	None	None	None	2018-04-13 13:31:33 UTC

Description Brian Bouterse 2018-01-08 17:34:20 UTC

The timings of upstream Pulp have a problem under load when the disk I/O load is high, it causes the database read/write times to be large. This creates a symptom where workers that are still online are incorrectly shown as offline. This is upstream issue: https://pulp.plan.io/issues/3135

The downstream things need to happen:

1. The upstream fix from 3135 needs to be cherry picked into downstream
2. There must be a setting set. The worker_timeout setting should be set to 300.

This will cause workers to not be noticed missing until roughly 5 minutes, which is the same timing behavior as 6.2.z-.

Comment 1 pulp-infra@redhat.com 2018-01-08 18:33:44 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 2 pulp-infra@redhat.com 2018-01-08 18:33:47 UTC

The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 6 Brian Bouterse 2018-01-11 15:48:57 UTC

I looked at the charts, but I think all they show anecdotally is that there is a large I/O load when their workloads run. What it doesn't tell us is, if the write and read times of the worker heartbeat records are running super slow (tens of seconds to several minutes). If those read/write queries take a long time, then workers will disappear. Unfortunately this data doesn't confirm those query run times.

The patch that is being prepared for 6.3 will not backport easily to 6.2.z. Another patch could be made that specifically adds warnings on the heartbeat read and write times. This way any 6.2.z environment that suspects this is the root cause of their workers disappearing they can apply this patch and have actual data on if the workers missing is due to slow db read and writes.

We need downstream to drive this. All upstream Pulp can do is make a patch like ^ if requested. Please talk to @ttereshc or @dkliban (Pulp satellite leads) or Rchan to coordinate such a patch request.

Comment 7 David Davis 2018-01-16 16:59:36 UTC

I just merged a patch to fix this issue upstream:

https://github.com/pulp/pulp/pull/3245

This patch adds a config variable in the tasks section of /etc/pulp/server.conf called 'worker_timeout' that sets the maximum time a worker will run without checking in before it's killed. It also adds some warnings that will get raised before this point to indicate that heartbeats are taking too long.

The one thing I think Katello/Satellite should do is raise the worker_timeout setting. Since installations typically run multiple apps/dbs/processes, it'll probably need a higher timeout than just Pulp alone. The default is 30. I'd probably recommend at least 60. If you plan to support mongoDB running on spinning disks (probably not a good idea) then I'd go with 300.

Let me know if you have any questions.

Comment 8 pulp-infra@redhat.com 2018-01-16 17:01:27 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 9 pulp-infra@redhat.com 2018-01-16 17:31:19 UTC

All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 10 Eric Helms 2018-01-16 19:30:06 UTC

Moving this back to assigned as it will require a change to puppet-pulp, puppet-katello and puppet-foreman_proxy_content to complete it.

Comment 11 pulp-infra@redhat.com 2018-01-16 19:31:16 UTC

All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 13 Satellite Program 2018-02-21 16:54:17 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
> 
> For information on the advisory, and where to find the updated files, follow the link below.
> 
> If the solution does not work for you, open a new bug report.
> 
> https://access.redhat.com/errata/RHSA-2018:0336

Comment 16 pulp-infra@redhat.com 2018-03-20 15:31:51 UTC

The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 17 Jitendra Yejare 2018-03-27 11:48:02 UTC

Can we have Repro Steps for this bug to test with 6.3.1 ?

Comment 18 Eric Helms 2018-04-02 16:19:46 UTC

This issue is largely environmental which makes it hard to reproduce. This typically presents itself in slow disk speed environments or large mongodb w/ heavy workload environments. One option would be to:

 1) Load as much content as you can into the system
 2) Lower the worker timeout to something more like 10 or 20
 3) Kickoff multiple simultaneous content view publishes
 4) Monitor /var/log/messages for the error

Comment 19 jcallaha 2018-04-02 20:59:04 UTC

Verified in Satellite 6.3.1 Snap 1.

I followed the steps outlined in comment #18.

I synced 45 repositories from the RH CDN, totaling over 110k packages.


I then lowered the worker timeout in /etc/pulp/server.conf to 10.

-bash-4.2# grep timeout /etc/pulp/server.conf
# worker_timeout: The amount of time (in seconds) before considering a worker as missing. If Pulp's
worker_timeout: 10


Following that, I spread those repositories around 10 content views, and published them at the same time (see attached). Simultaneously, I monitored /v/l/m for the error message.
As seen below, no errors were produced during the publishes.

-bash-4.2# tail -f /var/log/messages | grep "Pulp will not operate correctly"
^C
-bash-4.2#

Comment 20 jcallaha 2018-04-02 20:59:25 UTC

Created attachment 1416492 [details]
multiple publishes

Comment 21 pulp-infra@redhat.com 2018-04-03 21:36:17 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 23 errata-xmlrpc 2018-04-13 13:29:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1126

Note You need to log in before you can comment on or make changes to this bug.

andrew.schofield
bbuckingham
bkearney
bmbouter
chrobert
daviddavis
dkliban
egolov
ehelms
ggainey
ipanova
jyejare
kabbott
ktordeur
mhrivnak
mmccune
mtenheuv
mvanderw
pcreech
rchan
ttereshc
wpinheir