Bug 1929344

Summary:	pulp3: Worker has gone missing during migration (likely due to I/O load)
Product:	Red Hat Satellite	Reporter:	Tanya Tereshchenko <ttereshc>
Component:	Pulp	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	Lai <ltran>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.9.0	CC:	bmbouter, ggainey, ipanova, jsherril, pcreech, rchan, ttereshc, wclark
Target Milestone:	6.9.0	Keywords:	Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	python-pulp_2to3_migration-0.9.0-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-04-21 13:10:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tanya Tereshchenko 2021-02-16 17:27:06 UTC

Description of problem:

Trying to migrate Pulp 2 -> Pulp 3. The migration has been working consistently until when tried 180K RPMs and 31 repos in Katello. 
Now, consistently, after about 15 minutes, the migration fails with the following errors:

...
Nov 17 14:52:49 centos7-katello-nightly-2 pulpcore-api: - - [17/Nov/2020:14:52:49 +0000] "GET /pulp/api/v3/tasks/dfa912b7-6b4c-4d29-b546-4926e8d00a93/ HTTP/1.1" 200 3678 "-" "OpenAPI-Generator/3.7.1/ruby"
Nov 17 14:52:50 centos7-katello-nightly-2 pulpcore-api: - - [17/Nov/2020:14:52:50 +0000] "GET /pulp/api/v3/task-groups/5a30cf02-4354-4b72-a8c5-a3d0d4ce6fcb/ HTTP/1.1" 200 440 "-" "OpenAPI-Generator/3.7.1/ruby"
Nov 17 14:53:06 centos7-katello-nightly-2 pulpcore-api: - - [17/Nov/2020:14:53:06 +0000] "GET /pulp/api/v3/tasks/dfa912b7-6b4c-4d29-b546-4926e8d00a93/ HTTP/1.1" 200 3678 "-" "OpenAPI-Generator/3.7.1/ruby"
Nov 17 14:53:06 centos7-katello-nightly-2 pulpcore-api: - - [17/Nov/2020:14:53:06 +0000] "GET /pulp/api/v3/task-groups/5a30cf02-4354-4b72-a8c5-a3d0d4ce6fcb/ HTTP/1.1" 200 440 "-" "OpenAPI-Generator/3.7.1/ruby"
Nov 17 14:53:52 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22312.example.com' has gone missing, removing from list of workers
Nov 17 14:53:52 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22312.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-7: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-5: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-5: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-7: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-resource-manager: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-resource-manager: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-2: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-2: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22319.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22319.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-6: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-6: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-8: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-8: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-1: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-1: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-3: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-3: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22323.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22323.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22315.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22315.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker 'resource-manager' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named resource-manager is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22307.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22307.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22314.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22314.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22310.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22310.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: Worker '22311.example.com' has gone missing, removing from list of workers
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22311.example.com is missing. Canceling the tasks in its queue.
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-api: - - [17/Nov/2020:14:53:54 +0000] "GET /pulp/api/v3/tasks/dfa912b7-6b4c-4d29-b546-4926e8d00a93/ HTTP/1.1" 200 7283 "-" "OpenAPI-Generator/3.7.1/ruby"
Nov 17 14:53:54 centos7-katello-nightly-2 pulpcore-api: - - [17/Nov/2020:14:53:54 +0000] "GET /pulp/api/v3/task-groups/5a30cf02-4354-4b72-a8c5-a3d0d4ce6fcb/ HTTP/1.1" 200 440 "-" "OpenAPI-Generator/3.7.1/ruby"
Nov 17 14:54:23 centos7-katello-nightly-2 pulpcore-worker-7: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:23 centos7-katello-nightly-2 pulpcore-worker-6: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:23 centos7-katello-nightly-2 pulpcore-resource-manager: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-1: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-8: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-5: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-api: - - [17/Nov/2020:14:54:24 +0000] "GET /pulp/api/v3/tasks/dfa912b7-6b4c-4d29-b546-4926e8d00a93/ HTTP/1.1" 200 7284 "-" "OpenAPI-Generator/3.7.1/ruby"
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-2: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-4: pulp: rq.worker:INFO: Cleaning registries for queue: 22311.example.com
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-3: pulp: pulpcore.tasking.util:INFO: Task canceled: dfa912b7-6b4c-4d29-b546-4926e8d00a93.
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-4: pulp: rq.worker:INFO: 22311.example.com: e698be4e-6943-4f7c-9614-07e7f81c2265
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-api: - - [17/Nov/2020:14:54:24 +0000] "GET /pulp/api/v3/task-groups/5a30cf02-4354-4b72-a8c5-a3d0d4ce6fcb/ HTTP/1.1" 200 440 "-" "OpenAPI-Generator/3.7.1/ruby"
Nov 17 14:54:24 centos7-katello-nightly-2 pulpcore-worker-4: pulp: rq.worker:INFO: 22311.example.com: Job OK (e698be4e-6943-4f7c-9614-07e7f81c2265)
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-worker-6: pulp: pulpcore.tasking.services.worker_watcher:INFO: Worker '22312.example.com' is back online.
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-worker-6: pulp: pulpcore.tasking.services.worker_watcher:ERROR: There are 0 pulpcore-resource-manager processes running. Pulp will not operate correctly without at least one pulpcore-resource-mananger process running.
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-worker-8: pulp: pulpcore.tasking.services.worker_watcher:INFO: Worker '22319.example.com' is back online.
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-worker-8: pulp: pulpcore.tasking.services.worker_watcher:ERROR: There are 0 pulpcore-resource-manager processes running. Pulp will not operate correctly without at least one pulpcore-resource-mananger process running.
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-worker-7: pulp: pulpcore.tasking.services.worker_watcher:INFO: Worker '22315.example.com' is back online.
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-worker-7: pulp: pulpcore.tasking.services.worker_watcher:ERROR: There are 0 pulpcore-resource-manager processes running. Pulp will not operate correctly without at least one pulpcore-resource-mananger process running.
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-worker-4: pulp: pulpcore.tasking.services.worker_watcher:ERROR: There are 0 pulpcore-resource-manager processes running. Pulp will not operate correctly without at least one pulpcore-resource-mananger process running.
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-worker-2: pulp: pulpcore.tasking.services.worker_watcher:INFO: Worker '22314.example.com' is back online.
Nov 17 14:54:39 centos7-katello-nightly-2 pulpcore-resource-manager: pulp: pulpcore.tasking.services.worker_watcher:INFO: Worker 'resource-manager' is back online.

Version-Release number of selected component (if applicable):
pulp-2to3-migration (0.5.1)
pulp-certguard (1.0.3)
pulp-container (2.1.0)
pulp-file (1.3.0)
pulp-rpm (3.7.0)
pulpcore (3.7.3)

python3-createrepo_c-0.16.2-1.el7.x86_64
createrepo_c-libs-0.16.2-1.el7.x86_64
python2-createrepo_c-0.16.2-1.el7.x86_64
createrepo-0.9.9-28.el7.noarch
createrepo_c-0.16.2-1.el7.x86_64
createrepo_c-debuginfo-0.16.2-1.el7.x86_64
createrepo_c-devel-0.16.2-1.el7.x86_64


/var/lib/pulp is on a separate virtual disk and symlinked

Comment 1 pulp-infra@redhat.com 2021-02-16 18:06:58 UTC

The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 2 pulp-infra@redhat.com 2021-02-16 18:06:59 UTC

The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 3 Justin Sherrill 2021-02-22 18:55:52 UTC

*** Bug 1931584 has been marked as a duplicate of this bug. ***

Comment 4 Justin Sherrill 2021-02-22 18:56:54 UTC

Note for 6.9 we will hard code this, to check:



cat /usr/lib/python3.6/site-packages/pulpcore/tasking/constants.py

check 
WORKER_TTL=XYZ

XYZ should be 300

Comment 5 pulp-infra@redhat.com 2021-03-04 12:08:44 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 6 pulp-infra@redhat.com 2021-03-04 14:07:08 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 7 pulp-infra@redhat.com 2021-03-04 15:06:26 UTC

All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 8 pulp-infra@redhat.com 2021-03-09 23:07:10 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 9 Lai 2021-03-16 18:49:27 UTC

Based on comment #4, I ran the command and got this:

from types import SimpleNamespace

TASKING_CONSTANTS = SimpleNamespace(
    # The name of resource manager entries in the workers table
    RESOURCE_MANAGER_WORKER_NAME="resource-manager",
    # The amount of time (in seconds) after which a worker process is considered missing.
    WORKER_TTL=30,
    # The amount of time (in seconds) between checks
    JOB_MONITORING_INTERVAL=5,
    # The Redis key used to force-kill a job
    KILL_KEY="rq:jobs:kill",
)

As you can see, WORKER_TTL is not 300.

This is on python3-pulp-2to3-migration-0.9.1-1.el7pc.noarch with 6.9.0_017

Comment 11 pulp-infra@redhat.com 2021-03-16 19:05:47 UTC

Requesting needsinfo from upstream developer bmbouter because the 'FailedQA' flag is set.

Comment 13 pulp-infra@redhat.com 2021-03-16 20:13:07 UTC

Requesting needsinfo from upstream developer bmbouter because the 'FailedQA' flag is set.

Comment 14 Justin Sherrill 2021-03-16 20:17:35 UTC

opened a downstream MR with these patches

Comment 15 wclark 2021-03-16 22:51:19 UTC

I have applied the changes from https://gitlab.sat.engineering.redhat.com/satellite6/pulpcore-packaging/-/merge_requests/69/diffs

I am still facing this issue when syncing large repositories, in this example rhel8 BaseOS

The foreman-task is in status stopped/warning with:

```
Errors:

Error message: the server returns an error
HTTP status code: 502
Response headers: {"date"=>"Tue, 16 Mar 2021 22:25:07 GMT", "server"=>"Apache", "content-length"=>"445", "connection"=>"close", "content-type"=>"text/html; charset=iso-8859-1"}
Response body: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid

response from an upstream server.<br />

The proxy server could not handle the request <em><a href="/pulp/api/v3/content/rpm/packages/">GET&nbsp;/pulp/api/v3/content/rpm/packages/</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>
```

In the messages log, I see the worker timeout:

```
pulpcore-api: [2021-03-16 18:25:37 -0400] [10699] [CRITICAL] WORKER TIMEOUT (pid:10746)
pulpcore-api: [2021-03-16 22:25:38 +0000] [10746] [INFO] Worker exiting (pid: 10746)
pulpcore-api: [2021-03-16 18:25:39 -0400] [12522] [INFO] Booting worker with pid: 12522
```

In fact I have also tried increasing the timeout as far as WORKER_TTL=1800, and I still see the same issue

A few details that may be relevant:

1. It's occurring when testing an ansible role that enables and syncs rhel7 and rhel8 (baseos and appstream) repositories, so I'm syncing these 3 repositories simultaneously
2. In my latest attempt, the timeout occurred ~12 minutes into the sync task (which is odd to see after I configured a 30 minute worker timeout)
3. Between attempts I'm dropping DBs and synced content with `foreman-installer --reset-data`, which resets pulpcore content and the DB in postgresql but actually doesn't touch redis... could this be causing my issue?

Comment 16 wclark 2021-03-17 01:13:10 UTC

It looks like my issue is related to a gunicorn worker timeout, not rq worker. Therefore it seems unrelated to this BZ.

Comment 17 Brian Bouterse 2021-03-17 11:52:45 UTC

I just read through comment 15 and comment 16 and I reached the same conclusion that it's for gunicorn worker timeout, not rq worker timeout. Good conclusion.

Comment 18 Lai 2021-03-31 21:10:30 UTC

Steps to test:
1. Create a bash script below named `useless_hard_drive_io.sh`to DoS I/O to a high number:

while true
do
    FILE="/var/lib/pulp/$RANDOM"
    sync
    dd if=/dev/zero of=$FILE bs=1M count=1024
    sync
    dd if=$FILE of=/dev/null bs=1M count=1024
    rm $FILE
done

2. Run script on a couple terminals
3. Sync a repo (I used a random RH repo)
4. perform migration

Expected:
Migration should finish successfully

Actual
Migration finished successfully.


I also checked constants.py and got the WORKER_TTL = 300.

# cat /usr/lib/python3.6/site-packages/pulpcore/tasking/constants.py
from types import SimpleNamespace

TASKING_CONSTANTS = SimpleNamespace(
    # The name of resource manager entries in the workers table
    RESOURCE_MANAGER_WORKER_NAME="resource-manager",
    # The amount of time (in seconds) after which a worker process is considered missing.
    WORKER_TTL=300,
    # The amount of time (in seconds) between checks
    JOB_MONITORING_INTERVAL=5,
    # The Redis key used to force-kill a job
    KILL_KEY="rq:jobs:kill",
)

Verified on 6.9.0_19.1 with python3-pulp-2to3-migration-0.10.0-1.el7pc.noarch

Comment 21 errata-xmlrpc 2021-04-21 13:10:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1313