We previously switched from a sync task to an async task for host update API calls: https://github.com/Katello/katello/pull/6709/files This temporarily reduced a bottleneck at the passenger layer but introduces a new issue: During high traffic situations it is likely and common for clients to issue sequential calls from the client, eg: subscription-manager repos disable .. subscription-manager repos enable .. these calls end up firing off an async task but it is common that the second task will fail because the Host is locked from the 1st. With our new ability to have multiple Dynflow executors, we should explore moving this back to a sync task so we can avoid repeated failed Tasks calling host update. Longer term, we need to: 1) move this out of Foreman Tasks into a direct API call like we did for registration 2) A better retry method 3) Queuing per-host so we can process multiple calls for the same host and process them in order.
*** Bug 1644342 has been marked as a duplicate of this bug. ***
Comments from an another scenario that can cause this from the dupe we closed: https://bugzilla.redhat.com/show_bug.cgi?id=1644342 """ Description of problem: In case a hypervisor is registered as a Host and is running virt-who, a task to update the Host can fail on acquired lock. The reason is, both rhsmcertd-worker AND virt-who can fire "PUT /rhsm/consumers/${UUID}" requests concurrently and independently, what leads to concurrent Actions::Katello::Host::Update tasks where either of them fails with "Required lock is already taken by other running tasks" error. We should handle this better in our API, that e.g. should handle the 2nd request, wait till the 1st is completed and then continue in the 2nd one. Version-Release number of selected component (if applicable): Sat 6.3, expected in 6.4 as well. How reproducible: 100% with some effort / proper timing Steps to Reproduce: 1. Have a hypervisor registered as a Host, with virt-who installed & configured 2. Populate or update some host's facts, just to have Host update crunching the facts longer. 3. Enable rhsm debugs and restart rhsmcertd service 4. Check rhsm logs to spot time when "PUT /rhsm/consumers/${UUID}" is triggered. 5. At that time, run virt-who -o. 6. After a while, check for status of 2 Host update tasks for this Host. Actual results: 2nd Host update fails with "Required lock is already taken by other running tasks" error. Expected results: 2nd Host update task proceeds well. """
Created redmine issue https://projects.theforeman.org/issues/25422 from this bug
Upstream bug assigned to mmccune
Moving this bug to POST for triage into Satellite 6 since the upstream issue https://projects.theforeman.org/issues/25422 has been resolved.
Tested! @Satellite 6.6 Snap 16 Steps: ------------- 1. Registered 5 RHEL7 systems to the satellite. 2. Ran 100 times disable and enable for all 5 hosts asynchronously. using script [1]. 3. Monitor clients and satellite if fails. Scripts: -------------- [1] : ``` for i in {1..100} do for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5 do echo -e "\nResult for container: ${i}:\n" docker exec -d $i subscription-manager repos --disable rhel-7-server-rpms done for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5 do echo -e "\nResult for container: ${i}:\n" docker exec -d $i subscription-manager repos --enable rhel-7-server-rpms done done time docker exec 4ebacb14a9b5 subscription-manager repos --disable rhel-7-server-rpms ``` Observation: --------------- 1. During execution, I never saw 'Required Lock already taken error!' 2. At 1001st time, when I execute 'subscription-manager disable' where the satellite was having an almost good chunk of 'Host Update' task yet to execute, the execution wait for 1minute 48 seconds, but did not fail with 'required lock' error. Script stdout: --------------- ``` . . . Result for container: 41c5a036ef47: Result for container: 1549600b5af5: Result for container: 4ebacb14a9b5: Repository 'rhel-7-server-rpms' is disabled for this system. <---------------- This 1001st request which wait for 1m48s and executed without error. real 1m48.293s user 0m0.021s sys 0m0.015s ``` Is this what we expect, to verify this bug?
Did you change the Administer -> Settings -> Content -> "Allow multiple concurrent Actions::Katello::Host::Update calls for one host to be processed at the same time" to 'TRUE'? If not and it was false then you did not have a concurrency level high enough to introduce the Lock contention. If you had that set to 'TRUE' then yes, this bug is VERIFIED.
Verified! @Satellite 6.6 Snap 17 Steps: ------------- 1. As per comment 14, I set setting "Allow multiple concurrent Actions::Katello::Host::Update calls for one host to be processed at the same time" to True/yes. 2. Registered 5 RHEL7 systems to the satellite. 3. Ran 100 times disable and enable for all 5 hosts asynchronously. using script [1]. 4. Monitor clients and satellite if fails. Scripts: -------------- [1] : ``` for i in {1..100} do for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5 do echo -e "\nResult for container: ${i}:\n" docker exec -d $i subscription-manager repos --disable rhel-7-server-rpms done for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5 do echo -e "\nResult for container: ${i}:\n" docker exec -d $i subscription-manager repos --enable rhel-7-server-rpms done done time docker exec 4ebacb14a9b5 subscription-manager repos --disable rhel-7-server-rpms ``` Observation: --------------- 1. During execution, I never saw 'Required Lock already taken error!' in the satellite. 2. At 1001st time, when I execute 'subscription-manager disable' where the satellite was having an almost good chunk of 'Host Update' task yet to execute, the execution was waiting for 5minute 44 seconds but did not fail with 'required lock' error. Script stdout: --------------- ``` . . . Result for container: 41c5a036ef47: Result for container: 1549600b5af5: Result for container: 4ebacb14a9b5: Repository 'rhel-7-server-rpms' is disabled for this system. <---------------- This 1001st request which wait for 1m48s and executed without error. real 5m44.023s user 0m0.021s sys 0m0.015s ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:3172