The timings of upstream Pulp have a problem under load when the disk I/O load is high, it causes the database read/write times to be large. This creates a symptom where workers that are still online are incorrectly shown as offline. This is upstream issue: https://pulp.plan.io/issues/3135
The downstream things need to happen:
1. The upstream fix from 3135 needs to be cherry picked into downstream
2. There must be a setting set. The worker_timeout setting should be set to 300.
This will cause workers to not be noticed missing until roughly 5 minutes, which is the same timing behavior as 6.2.z-.
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.
I looked at the charts, but I think all they show anecdotally is that there is a large I/O load when their workloads run. What it doesn't tell us is, if the write and read times of the worker heartbeat records are running super slow (tens of seconds to several minutes). If those read/write queries take a long time, then workers will disappear. Unfortunately this data doesn't confirm those query run times.
The patch that is being prepared for 6.3 will not backport easily to 6.2.z. Another patch could be made that specifically adds warnings on the heartbeat read and write times. This way any 6.2.z environment that suspects this is the root cause of their workers disappearing they can apply this patch and have actual data on if the workers missing is due to slow db read and writes.
We need downstream to drive this. All upstream Pulp can do is make a patch like ^ if requested. Please talk to @ttereshc or @dkliban (Pulp satellite leads) or Rchan to coordinate such a patch request.
I just merged a patch to fix this issue upstream:
This patch adds a config variable in the tasks section of /etc/pulp/server.conf called 'worker_timeout' that sets the maximum time a worker will run without checking in before it's killed. It also adds some warnings that will get raised before this point to indicate that heartbeats are taking too long.
The one thing I think Katello/Satellite should do is raise the worker_timeout setting. Since installations typically run multiple apps/dbs/processes, it'll probably need a higher timeout than just Pulp alone. The default is 30. I'd probably recommend at least 60. If you plan to support mongoDB running on spinning disks (probably not a good idea) then I'd go with 300.
Let me know if you have any questions.
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.
Moving this back to assigned as it will require a change to puppet-pulp, puppet-katello and puppet-foreman_proxy_content to complete it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
> For information on the advisory, and where to find the updated files, follow the link below.
> If the solution does not work for you, open a new bug report.
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.
Can we have Repro Steps for this bug to test with 6.3.1 ?
This issue is largely environmental which makes it hard to reproduce. This typically presents itself in slow disk speed environments or large mongodb w/ heavy workload environments. One option would be to:
1) Load as much content as you can into the system
2) Lower the worker timeout to something more like 10 or 20
3) Kickoff multiple simultaneous content view publishes
4) Monitor /var/log/messages for the error
Verified in Satellite 6.3.1 Snap 1.
I followed the steps outlined in comment #18.
I synced 45 repositories from the RH CDN, totaling over 110k packages.
I then lowered the worker timeout in /etc/pulp/server.conf to 10.
-bash-4.2# grep timeout /etc/pulp/server.conf
# worker_timeout: The amount of time (in seconds) before considering a worker as missing. If Pulp's
Following that, I spread those repositories around 10 content views, and published them at the same time (see attached). Simultaneously, I monitored /v/l/m for the error message.
As seen below, no errors were produced during the publishes.
-bash-4.2# tail -f /var/log/messages | grep "Pulp will not operate correctly"
Created attachment 1416492 [details]
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.