Bug 1929344
| Summary: | pulp3: Worker has gone missing during migration (likely due to I/O load) | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Tanya Tereshchenko <ttereshc> |
| Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | CLOSED ERRATA | QA Contact: | Lai <ltran> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.9.0 | CC: | bmbouter, ggainey, ipanova, jsherril, pcreech, rchan, ttereshc, wclark |
| Target Milestone: | 6.9.0 | Keywords: | Triaged |
| Target Release: | Unused | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | python-pulp_2to3_migration-0.9.0-1 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-04-21 13:10:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Tanya Tereshchenko
2021-02-16 17:27:06 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug. The Pulp upstream bug priority is at High. Updating the external tracker on this bug. *** Bug 1931584 has been marked as a duplicate of this bug. *** Note for 6.9 we will hard code this, to check: cat /usr/lib/python3.6/site-packages/pulpcore/tasking/constants.py check WORKER_TTL=XYZ XYZ should be 300 The Pulp upstream bug status is at POST. Updating the external tracker on this bug. The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug. All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST. The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug. Based on comment #4, I ran the command and got this: from types import SimpleNamespace TASKING_CONSTANTS = SimpleNamespace( # The name of resource manager entries in the workers table RESOURCE_MANAGER_WORKER_NAME="resource-manager", # The amount of time (in seconds) after which a worker process is considered missing. WORKER_TTL=30, # The amount of time (in seconds) between checks JOB_MONITORING_INTERVAL=5, # The Redis key used to force-kill a job KILL_KEY="rq:jobs:kill", ) As you can see, WORKER_TTL is not 300. This is on python3-pulp-2to3-migration-0.9.1-1.el7pc.noarch with 6.9.0_017 Requesting needsinfo from upstream developer bmbouter because the 'FailedQA' flag is set. Requesting needsinfo from upstream developer bmbouter because the 'FailedQA' flag is set. opened a downstream MR with these patches I have applied the changes from https://gitlab.sat.engineering.redhat.com/satellite6/pulpcore-packaging/-/merge_requests/69/diffs I am still facing this issue when syncing large repositories, in this example rhel8 BaseOS The foreman-task is in status stopped/warning with: ``` Errors: Error message: the server returns an error HTTP status code: 502 Response headers: {"date"=>"Tue, 16 Mar 2021 22:25:07 GMT", "server"=>"Apache", "content-length"=>"445", "connection"=>"close", "content-type"=>"text/html; charset=iso-8859-1"} Response body: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>502 Proxy Error</title> </head><body> <h1>Proxy Error</h1> <p>The proxy server received an invalid response from an upstream server.<br /> The proxy server could not handle the request <em><a href="/pulp/api/v3/content/rpm/packages/">GET /pulp/api/v3/content/rpm/packages/</a></em>.<p> Reason: <strong>Error reading from remote server</strong></p></p> </body></html> ``` In the messages log, I see the worker timeout: ``` pulpcore-api: [2021-03-16 18:25:37 -0400] [10699] [CRITICAL] WORKER TIMEOUT (pid:10746) pulpcore-api: [2021-03-16 22:25:38 +0000] [10746] [INFO] Worker exiting (pid: 10746) pulpcore-api: [2021-03-16 18:25:39 -0400] [12522] [INFO] Booting worker with pid: 12522 ``` In fact I have also tried increasing the timeout as far as WORKER_TTL=1800, and I still see the same issue A few details that may be relevant: 1. It's occurring when testing an ansible role that enables and syncs rhel7 and rhel8 (baseos and appstream) repositories, so I'm syncing these 3 repositories simultaneously 2. In my latest attempt, the timeout occurred ~12 minutes into the sync task (which is odd to see after I configured a 30 minute worker timeout) 3. Between attempts I'm dropping DBs and synced content with `foreman-installer --reset-data`, which resets pulpcore content and the DB in postgresql but actually doesn't touch redis... could this be causing my issue? It looks like my issue is related to a gunicorn worker timeout, not rq worker. Therefore it seems unrelated to this BZ. I just read through comment 15 and comment 16 and I reached the same conclusion that it's for gunicorn worker timeout, not rq worker timeout. Good conclusion. Steps to test:
1. Create a bash script below named `useless_hard_drive_io.sh`to DoS I/O to a high number:
while true
do
FILE="/var/lib/pulp/$RANDOM"
sync
dd if=/dev/zero of=$FILE bs=1M count=1024
sync
dd if=$FILE of=/dev/null bs=1M count=1024
rm $FILE
done
2. Run script on a couple terminals
3. Sync a repo (I used a random RH repo)
4. perform migration
Expected:
Migration should finish successfully
Actual
Migration finished successfully.
I also checked constants.py and got the WORKER_TTL = 300.
# cat /usr/lib/python3.6/site-packages/pulpcore/tasking/constants.py
from types import SimpleNamespace
TASKING_CONSTANTS = SimpleNamespace(
# The name of resource manager entries in the workers table
RESOURCE_MANAGER_WORKER_NAME="resource-manager",
# The amount of time (in seconds) after which a worker process is considered missing.
WORKER_TTL=300,
# The amount of time (in seconds) between checks
JOB_MONITORING_INTERVAL=5,
# The Redis key used to force-kill a job
KILL_KEY="rq:jobs:kill",
)
Verified on 6.9.0_19.1 with python3-pulp-2to3-migration-0.10.0-1.el7pc.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1313 |