Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1532348 - Workers go missing under heavy load
Workers go missing under heavy load
Status: CLOSED ERRATA
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Pulp (Show other bugs)
6.3.0
Unspecified Unspecified
unspecified Severity high (vote)
: 6.3.1
: Unused
Assigned To: Chris Roberts
jcallaha
https://projects.theforeman.org/issue...
: Reopened, Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-01-08 12:34 EST by Brian Bouterse
Modified: 2018-05-25 11:05 EDT (History)
18 users (show)

See Also:
Fixed In Version: katello-installer-base-3.4.5.22-1,pulp-2.13.4.8-1
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-04-13 09:29:48 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
multiple publishes (99.17 KB, image/png)
2018-04-02 16:59 EDT, jcallaha
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Foreman Issue Tracker 22338 None None None 2018-01-25 12:39 EST
Pulp Redmine 3135 High CLOSED - CURRENTRELEASE As a user, I have a setting to mitigate when workers go missing under heavy loads 2018-04-03 17:36 EDT
Red Hat Product Errata RHBA-2018:1126 None None None 2018-04-13 09:31 EDT

  None (edit)
Description Brian Bouterse 2018-01-08 12:34:20 EST
The timings of upstream Pulp have a problem under load when the disk I/O load is high, it causes the database read/write times to be large. This creates a symptom where workers that are still online are incorrectly shown as offline. This is upstream issue: https://pulp.plan.io/issues/3135

The downstream things need to happen:

1. The upstream fix from 3135 needs to be cherry picked into downstream
2. There must be a setting set. The worker_timeout setting should be set to 300.

This will cause workers to not be noticed missing until roughly 5 minutes, which is the same timing behavior as 6.2.z-.
Comment 1 pulp-infra@redhat.com 2018-01-08 13:33:44 EST
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.
Comment 2 pulp-infra@redhat.com 2018-01-08 13:33:47 EST
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.
Comment 6 Brian Bouterse 2018-01-11 10:48:57 EST
I looked at the charts, but I think all they show anecdotally is that there is a large I/O load when their workloads run. What it doesn't tell us is, if the write and read times of the worker heartbeat records are running super slow (tens of seconds to several minutes). If those read/write queries take a long time, then workers will disappear. Unfortunately this data doesn't confirm those query run times.

The patch that is being prepared for 6.3 will not backport easily to 6.2.z. Another patch could be made that specifically adds warnings on the heartbeat read and write times. This way any 6.2.z environment that suspects this is the root cause of their workers disappearing they can apply this patch and have actual data on if the workers missing is due to slow db read and writes.

We need downstream to drive this. All upstream Pulp can do is make a patch like ^ if requested. Please talk to @ttereshc or @dkliban (Pulp satellite leads) or Rchan to coordinate such a patch request.
Comment 7 David Davis 2018-01-16 11:59:36 EST
I just merged a patch to fix this issue upstream:

https://github.com/pulp/pulp/pull/3245

This patch adds a config variable in the tasks section of /etc/pulp/server.conf called 'worker_timeout' that sets the maximum time a worker will run without checking in before it's killed. It also adds some warnings that will get raised before this point to indicate that heartbeats are taking too long.

The one thing I think Katello/Satellite should do is raise the worker_timeout setting. Since installations typically run multiple apps/dbs/processes, it'll probably need a higher timeout than just Pulp alone. The default is 30. I'd probably recommend at least 60. If you plan to support mongoDB running on spinning disks (probably not a good idea) then I'd go with 300.

Let me know if you have any questions.
Comment 8 pulp-infra@redhat.com 2018-01-16 12:01:27 EST
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.
Comment 9 pulp-infra@redhat.com 2018-01-16 12:31:19 EST
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.
Comment 10 Eric Helms 2018-01-16 14:30:06 EST
Moving this back to assigned as it will require a change to puppet-pulp, puppet-katello and puppet-foreman_proxy_content to complete it.
Comment 11 pulp-infra@redhat.com 2018-01-16 14:31:16 EST
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.
Comment 13 pm-sat@redhat.com 2018-02-21 11:54:17 EST
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
> 
> For information on the advisory, and where to find the updated files, follow the link below.
> 
> If the solution does not work for you, open a new bug report.
> 
> https://access.redhat.com/errata/RHSA-2018:0336
Comment 16 pulp-infra@redhat.com 2018-03-20 11:31:51 EDT
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.
Comment 17 Jitendra Yejare 2018-03-27 07:48:02 EDT
Can we have Repro Steps for this bug to test with 6.3.1 ?
Comment 18 Eric Helms 2018-04-02 12:19:46 EDT
This issue is largely environmental which makes it hard to reproduce. This typically presents itself in slow disk speed environments or large mongodb w/ heavy workload environments. One option would be to:

 1) Load as much content as you can into the system
 2) Lower the worker timeout to something more like 10 or 20
 3) Kickoff multiple simultaneous content view publishes
 4) Monitor /var/log/messages for the error
Comment 19 jcallaha 2018-04-02 16:59:04 EDT
Verified in Satellite 6.3.1 Snap 1.

I followed the steps outlined in comment #18.

I synced 45 repositories from the RH CDN, totaling over 110k packages.


I then lowered the worker timeout in /etc/pulp/server.conf to 10.

-bash-4.2# grep timeout /etc/pulp/server.conf
# worker_timeout: The amount of time (in seconds) before considering a worker as missing. If Pulp's
worker_timeout: 10


Following that, I spread those repositories around 10 content views, and published them at the same time (see attached). Simultaneously, I monitored /v/l/m for the error message.
As seen below, no errors were produced during the publishes.

-bash-4.2# tail -f /var/log/messages | grep "Pulp will not operate correctly"
^C
-bash-4.2#
Comment 20 jcallaha 2018-04-02 16:59 EDT
Created attachment 1416492 [details]
multiple publishes
Comment 21 pulp-infra@redhat.com 2018-04-03 17:36:17 EDT
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.
Comment 23 errata-xmlrpc 2018-04-13 09:29:48 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1126

Note You need to log in before you can comment on or make changes to this bug.