Bug 1532348

Summary: Workers go missing under heavy load
Product: Red Hat Satellite 6 Reporter: Brian Bouterse <bmbouter>
Component: PulpAssignee: Chris Roberts <chrobert>
Status: CLOSED ERRATA QA Contact: jcallaha
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.3.0CC: andrew.schofield, bbuckingham, bkearney, chrobert, daviddavis, dkliban, egolov, ehelms, jyejare, kabbott, ktordeur, mhrivnak, mmccune, mtenheuv, mvanderw, pcreech, rchan, ttereshc
Target Milestone: UnspecifiedKeywords: Reopened, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
URL: https://projects.theforeman.org/issues/22338
Whiteboard:
Fixed In Version: katello-installer-base-3.4.5.22-1,pulp-2.13.4.8-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-13 13:29:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
multiple publishes none

Description Brian Bouterse 2018-01-08 17:34:20 UTC
The timings of upstream Pulp have a problem under load when the disk I/O load is high, it causes the database read/write times to be large. This creates a symptom where workers that are still online are incorrectly shown as offline. This is upstream issue: https://pulp.plan.io/issues/3135

The downstream things need to happen:

1. The upstream fix from 3135 needs to be cherry picked into downstream
2. There must be a setting set. The worker_timeout setting should be set to 300.

This will cause workers to not be noticed missing until roughly 5 minutes, which is the same timing behavior as 6.2.z-.

Comment 1 pulp-infra@redhat.com 2018-01-08 18:33:44 UTC
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 2 pulp-infra@redhat.com 2018-01-08 18:33:47 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 6 Brian Bouterse 2018-01-11 15:48:57 UTC
I looked at the charts, but I think all they show anecdotally is that there is a large I/O load when their workloads run. What it doesn't tell us is, if the write and read times of the worker heartbeat records are running super slow (tens of seconds to several minutes). If those read/write queries take a long time, then workers will disappear. Unfortunately this data doesn't confirm those query run times.

The patch that is being prepared for 6.3 will not backport easily to 6.2.z. Another patch could be made that specifically adds warnings on the heartbeat read and write times. This way any 6.2.z environment that suspects this is the root cause of their workers disappearing they can apply this patch and have actual data on if the workers missing is due to slow db read and writes.

We need downstream to drive this. All upstream Pulp can do is make a patch like ^ if requested. Please talk to @ttereshc or @dkliban (Pulp satellite leads) or Rchan to coordinate such a patch request.

Comment 7 David Davis 2018-01-16 16:59:36 UTC
I just merged a patch to fix this issue upstream:

https://github.com/pulp/pulp/pull/3245

This patch adds a config variable in the tasks section of /etc/pulp/server.conf called 'worker_timeout' that sets the maximum time a worker will run without checking in before it's killed. It also adds some warnings that will get raised before this point to indicate that heartbeats are taking too long.

The one thing I think Katello/Satellite should do is raise the worker_timeout setting. Since installations typically run multiple apps/dbs/processes, it'll probably need a higher timeout than just Pulp alone. The default is 30. I'd probably recommend at least 60. If you plan to support mongoDB running on spinning disks (probably not a good idea) then I'd go with 300.

Let me know if you have any questions.

Comment 8 pulp-infra@redhat.com 2018-01-16 17:01:27 UTC
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 9 pulp-infra@redhat.com 2018-01-16 17:31:19 UTC
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 10 Eric Helms 2018-01-16 19:30:06 UTC
Moving this back to assigned as it will require a change to puppet-pulp, puppet-katello and puppet-foreman_proxy_content to complete it.

Comment 11 pulp-infra@redhat.com 2018-01-16 19:31:16 UTC
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 13 pm-sat@redhat.com 2018-02-21 16:54:17 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
> 
> For information on the advisory, and where to find the updated files, follow the link below.
> 
> If the solution does not work for you, open a new bug report.
> 
> https://access.redhat.com/errata/RHSA-2018:0336

Comment 16 pulp-infra@redhat.com 2018-03-20 15:31:51 UTC
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 17 Jitendra Yejare 2018-03-27 11:48:02 UTC
Can we have Repro Steps for this bug to test with 6.3.1 ?

Comment 18 Eric Helms 2018-04-02 16:19:46 UTC
This issue is largely environmental which makes it hard to reproduce. This typically presents itself in slow disk speed environments or large mongodb w/ heavy workload environments. One option would be to:

 1) Load as much content as you can into the system
 2) Lower the worker timeout to something more like 10 or 20
 3) Kickoff multiple simultaneous content view publishes
 4) Monitor /var/log/messages for the error

Comment 19 jcallaha 2018-04-02 20:59:04 UTC
Verified in Satellite 6.3.1 Snap 1.

I followed the steps outlined in comment #18.

I synced 45 repositories from the RH CDN, totaling over 110k packages.


I then lowered the worker timeout in /etc/pulp/server.conf to 10.

-bash-4.2# grep timeout /etc/pulp/server.conf
# worker_timeout: The amount of time (in seconds) before considering a worker as missing. If Pulp's
worker_timeout: 10


Following that, I spread those repositories around 10 content views, and published them at the same time (see attached). Simultaneously, I monitored /v/l/m for the error message.
As seen below, no errors were produced during the publishes.

-bash-4.2# tail -f /var/log/messages | grep "Pulp will not operate correctly"
^C
-bash-4.2#

Comment 20 jcallaha 2018-04-02 20:59:25 UTC
Created attachment 1416492 [details]
multiple publishes

Comment 21 pulp-infra@redhat.com 2018-04-03 21:36:17 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 23 errata-xmlrpc 2018-04-13 13:29:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1126