Bug 1962462
Summary: | pulp3: Sat6.9 with pulpcore tasking system assigns task to a removed worker | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> |
Status: | CLOSED ERRATA | QA Contact: | Lai <ltran> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.9.0 | CC: | desingh, ggainey, jjeffers, jyejare, osousa, pcreech, peter.vreman, pmendezh, rchan, ttereshc |
Target Milestone: | 6.9.6 | Keywords: | AutomationBlocker, PrioBumpQA, Triaged, UpgradeBlocker |
Target Release: | Unused | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | python-pulpcore-3.7.8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-21 14:37:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pavel Moravec
2021-05-20 06:29:31 UTC
I forgot to specify the 5 repos in the CV, here they are: Red Hat Ansible Engine 2.9 RPMs for Red Hat Enterprise Linux 7 Server x86_64 Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server Red Hat Satellite Capsule 6.9 for RHEL 7 Server RPMs x86_64 Red Hat Satellite Maintenance 6 for RHEL 7 Server RPMs x86_64 Red Hat Software Collections RPMs for Red Hat Enterprise Linux 7 Server x86_64 7Server Re-reproduced with a CV filter "include everything older than 2021-01-01". https://hackmd.io/@pulp/reserved_resource_debugging#Gimme-answer-RIGHT-NOW : Worker 207894a5-0b0d-4138-b7e9-ee9de1906414 owns ReservedResource 71274740-970d-427a-90df-65e8d0f6b4bf and is not in online_workers!! while core_workers: pulp_id | pulp_created | pulp_last_updated | name | last_heartbeat | gracefully_stopped | cleaned_up --------------------------------------+-------------------------------+-------------------------------+------------------------------------------------+-------------------------------+--------------------+------------ 207894a5-0b0d-4138-b7e9-ee9de1906414 | 2021-05-17 09:46:15.450749+02 | 2021-05-17 09:54:57.859257+02 | 1070.redhat.com | 2021-05-18 08:31:59.962307+02 | f | t last_heartbeat seen 2 days ago, but the worker got assigned a new task..? (I smell some ungracefull shutdown behind it, I will rather start with a clean table from scratch) The problem isnt with a specific CV (I can't re-reproduce it on a fresh install), but in the way how I stopped tasks previously. I *think* I rebooted the system as it started to swap, and that abrupt termination of the tasks caused the tasks and reservations were wrongly loaded after the start. I will try to reproduce *this* behaviour, in further days. The Pulp upstream bug status is at NEW. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. The Pulp upstream bug priority is at High. Updating the external tracker on this bug. I might have a reproducer for this: - put the system under a bigger load - such big that workers send their keepalives too late (I mimicked these two steps by stopping chronyd, moving time back a bit(*), starting chronyd and letting it to move clocks to the right time) (*) 12 times in a row, I mved time 1s back and slept for 10s - restart all services Then the symptoms exactly match. I will re-run this if it is more deterministic reproducer. (and I tested this on Sat6.10 batch4 / python3-pulpcore-3.11.2-2.el7pc.noarch / python3-pulp-rpm-3.11.0-1.el7pc.noarch) Let put this bug on hold: I can't reproduce it. I know some sequence (with some randomness) among: - modifying clocks (just to force workers to be marked as inactive due to lost heartbeats) - having some running + waiting tasks - restarting services or rebooting the system some such sequence did the trick for me twice. Among few tens of attempts. So there is some issue with assigning tasks to inactive workers, but with no known reproducer :( . I will give it some new trials after some time, again.. This should go away with the new tasking system in pulpcore 3.14. *** Bug 1975858 has been marked as a duplicate of this bug. *** The Pulp upstream bug status is at CLOSED - WONTFIX. Updating the external tracker on this bug. I have the issue also with testing the pulp3. Waiting tasks seen on various tasks: - normal CV publish - composite CV publish - redhat repo sync (daily suync of ~60 repos) The bug is persistent and cannot be solved/fixed/workaround by me as user. With pulp2 a reboot would just normally restart the work again and finish things. But now even after a reboot the pending/waiting work is not rescheduled to the new workers. For me as beta tester this bug is showstopper to do any real-world testing (e.g. functional/performance that all my use cases work) with pulp3 I really hope (also for RedHat support) that there will be a 2nd HTB test program with pulpcore 3.14 before it goes public. Also another thing i noticed is that concurrency is also not handled correctly. Just like in the RQ redis output show in the description above were 1 worker has 600+ tasks. For me it also like this, that 1 worker got 80% of the tasks assigned. And therefor the parallel execution for the content handling is almost serialized. ~~~ crash/LI] root@li-lc-2222:~# curl -k --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key "https://localhost/pulp/api/v3/tasks/?state=waiting" | jq '.results[].worker' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100k 100 100k 0 0 47584 0 0:00:02 0:00:02 --:--:-- 47590 "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/fbe0b7a9-ae4c-4c38-98b7-d89a300b926c/" "/pulp/api/v3/workers/fbe0b7a9-ae4c-4c38-98b7-d89a300b926c/" "/pulp/api/v3/workers/fbe0b7a9-ae4c-4c38-98b7-d89a300b926c/" "/pulp/api/v3/workers/fbe0b7a9-ae4c-4c38-98b7-d89a300b926c/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/03c106c6-4eb9-4487-b104-abfaee42b157/" "/pulp/api/v3/workers/2a224ade-3fc8-497e-a0d3-dc56666ce606/" "/pulp/api/v3/workers/5703421a-7fcd-4e55-afe7-15499e86a07e/" null null [crash/LI] root@li-lc-2222:~# ~~~ i did another test after cancling all tasks in pulp3 and katello Now i restarted the publishing of 8x CV with each ~7 Redhat repos. The result is all tasks are waiting on a single worker: ~~~ [crash/LI] root@li-lc-2222:~# curl -k --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key "https://localhost/pulp/api/v3/workers/?online=true" | jq '.results[].pulp_href' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1892 100 1892 0 0 9144 0 --:--:-- --:--:-- --:--:-- 9229 "/pulp/api/v3/workers/3d45473d-bd72-4011-bb5a-46b018345965/" "/pulp/api/v3/workers/b8a96a16-6670-4970-ba55-a7d24503ab3d/" "/pulp/api/v3/workers/c44a28a9-94fa-4100-9ec8-7e7f2dafad5f/" "/pulp/api/v3/workers/774326e7-6378-476b-aa49-20b32b5a9bc4/" "/pulp/api/v3/workers/ca44f583-6168-47a2-a559-ab70ef28f01e/" "/pulp/api/v3/workers/aece7a92-f3df-493a-a483-9414b60368fb/" "/pulp/api/v3/workers/45b2dd0c-dfdd-43aa-864d-9871d2e47fa9/" "/pulp/api/v3/workers/b0e00463-256e-4d8b-ba66-8c87230f9cc1/" "/pulp/api/v3/workers/5930b6a0-b38b-4b6a-a8c5-94df09f35f24/" [crash/LI] root@li-lc-2222:~# curl -k --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key "https://localhost/pulp/api/v3/tasks/?state=running" | jq '.results[].worker' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1691 100 1691 0 0 6963 0 --:--:-- --:--:-- --:--:-- 6987 "/pulp/api/v3/workers/45b2dd0c-dfdd-43aa-864d-9871d2e47fa9/" [crash/LI] root@li-lc-2222:~# curl -k --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key "https://localhost/pulp/api/v3/tasks/?state=waiting" | jq '.results[].worker' | sort | uniq -c % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 28987 100 28987 0 0 100k 0 --:--:-- --:--:-- --:--:-- 100k 17 "/pulp/api/v3/workers/45b2dd0c-dfdd-43aa-864d-9871d2e47fa9/" [crash/LI] root@li-lc-2222:~# ~~~ The Pulp upstream bug status is at NEW. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. The Pulp upstream bug status is at POST. Updating the external tracker on this bug. The Pulp upstream bug status is at NEW. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. The Pulp upstream bug status is at POST. Updating the external tracker on this bug. (In reply to Peter Vreman from comment #15) > Also another thing i noticed is that concurrency is also not handled > correctly. Just like in the RQ redis output show in the description above > were 1 worker has 600+ tasks. > For me it also like this, that 1 worker got 80% of the tasks assigned. And > therefor the parallel execution for the content handling is almost > serialized. > Though this should be filed as a separate bug, I got a similar impression as well. Some pulp devel told me it could be due to a dependency among the repos (or maybe some other objects?) that require serialization. And really, running pulp tasks on really independent repos made busy all workers. But I still feel my original user story (as well as yours) contained some independent tasks that could be executed concurrently, but the tasking system buffered all of them to a single worker. More deterministic reproducer would be needed.. (I will try to play with this if having some time..) The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug. All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST. The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug. As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1962462#c25 i have created a dedicated BZ https://bugzilla.redhat.com/show_bug.cgi?id=1986356 regarding the concurrency of pulp3 tasks (which looks to be a caused by exclusive-lock always used instead of reader/writer-locks) Removing this bugzilla from Satellite 6.10. This bugzilla only applies on the old tasking system, which will be included only for 6.9.z to support migrations. In Satellite 6.10, a new tasking system is used. Steps to retest: 1. On a 6.9 sat, sync the following repos: Red Hat Ansible Engine 2.9 RPMs for Red Hat Enterprise Linux 7 Server x86_64 Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server Red Hat Satellite Capsule 6.9 for RHEL 7 Server RPMs x86_64 Red Hat Satellite Maintenance 6 for RHEL 7 Server RPMs x86_64 Red Hat Software Collections RPMs for Red Hat Enterprise Linux 7 Server x86_64 7Server 2. create cv and add repos from step 1 3. publish repo 4. perform migration 5. perform switchover but do not migrate to 6.10 (this will enable use to use pulp3 on 6.9) 6. restart services 7. republish cv created in step 2 Expected: 3) Publish should be successful 4) migration should be successful 5) switchover should be successful 7) republish should be successful Actual: 3) Publish is successful 4) migration is successful 5) switchover is successful 7) republish is successful In both instances, I tried isolating it to just publish in 6.9 with pulp2 and then another new instance of doing a switchover to pulp3 and publishing those same repos. Both instances are successful and without errors. Verified on sat 6.9.6_02 with python3-pulpcore-3.7.8-1.el7pc.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Satellite 6.9.6 Async Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3628 The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. |