Bug 1643649 - Sequential Actions::Katello::Host::Update calls from subscription-manager can fail under load
Summary: Sequential Actions::Katello::Host::Update calls from subscription-manager can...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: API - Content
Version: 6.3.4
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: 6.6.0
Assignee: Mike McCune
QA Contact: Jitendra Yejare
URL:
Whiteboard:
: 1644342 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-26 21:22 UTC by Mike McCune
Modified: 2019-11-15 13:33 UTC (History)
11 users (show)

Fixed In Version: tfm-rubygem-katello-3.12.0.9-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1731277 (view as bug list)
Environment:
Last Closed: 2019-10-22 12:46:44 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Foreman Issue Tracker 25422 Normal Closed Sequential Actions::Katello::Host::Update calls from subscription-manager can fail under load 2020-01-07 00:13:47 UTC
Red Hat Knowledge Base (Solution) 3671011 None None None 2018-10-30 15:57:17 UTC
Red Hat Product Errata RHSA-2019:3172 None None None 2019-10-22 12:47:07 UTC

Description Mike McCune 2018-10-26 21:22:41 UTC
We previously switched from a sync task to an async task for host update API calls:

https://github.com/Katello/katello/pull/6709/files

This temporarily reduced a bottleneck at the passenger layer but introduces a new issue:

During high traffic situations it is likely and common for clients to issue sequential calls from the client, eg:

subscription-manager repos disable ..
subscription-manager repos enable ..

these calls end up firing off an async task but it is common that the second task will fail because the Host is locked from the 1st.

With our new ability to have multiple Dynflow executors, we should explore moving this back to a sync task so we can avoid repeated failed Tasks calling host update.

Longer term, we need to:

1) move this out of Foreman Tasks into a direct API call like we did for registration
2) A better retry method
3) Queuing per-host so we can process multiple calls for the same host and process them in order.

Comment 1 Mike McCune 2018-10-30 15:57:17 UTC
*** Bug 1644342 has been marked as a duplicate of this bug. ***

Comment 2 Mike McCune 2018-10-30 15:58:44 UTC
Comments from an another scenario that can cause this from the dupe we closed:

https://bugzilla.redhat.com/show_bug.cgi?id=1644342

"""
Description of problem:
In case a hypervisor is registered as a Host and is running virt-who, a task to update the Host can fail on acquired lock. The reason is, both rhsmcertd-worker AND virt-who can fire "PUT /rhsm/consumers/${UUID}" requests concurrently and independently, what leads to concurrent Actions::Katello::Host::Update tasks where either of them fails with "Required lock is already taken by other running tasks" error.

We should handle this better in our API, that e.g. should handle the 2nd request, wait till the 1st is completed and then continue in the 2nd one.


Version-Release number of selected component (if applicable):
Sat 6.3, expected in 6.4 as well.


How reproducible:
100% with some effort / proper timing


Steps to Reproduce:
1. Have a hypervisor registered as a Host, with virt-who installed & configured
2. Populate or update some host's facts, just to have Host update crunching the facts longer.
3. Enable rhsm debugs and restart rhsmcertd service
4. Check rhsm logs to spot time when "PUT /rhsm/consumers/${UUID}" is triggered.
5. At that time, run virt-who -o.
6. After a while, check for status of 2 Host update tasks  for this Host.


Actual results:
2nd Host update fails with "Required lock is already taken by other running tasks" error.


Expected results:
2nd Host update task proceeds well.


"""

Comment 4 Mike McCune 2018-11-09 00:15:52 UTC
Created redmine issue https://projects.theforeman.org/issues/25422 from this bug

Comment 5 pm-sat@redhat.com 2018-11-09 01:09:30 UTC
Upstream bug assigned to mmccune@redhat.com

Comment 6 pm-sat@redhat.com 2018-11-09 01:09:32 UTC
Upstream bug assigned to mmccune@redhat.com

Comment 11 Bryan Kearney 2019-05-09 18:06:41 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue https://projects.theforeman.org/issues/25422 has been resolved.

Comment 13 Jitendra Yejare 2019-08-21 07:00:19 UTC
Tested!

@Satellite 6.6 Snap 16


Steps:
-------------
1. Registered 5 RHEL7 systems to the satellite.
2. Ran 100 times disable and enable for all 5 hosts asynchronously. using script [1].
3. Monitor clients and satellite if fails.


Scripts:
--------------

[1] :
```
for i in {1..100}
do
        for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5
        do
                echo -e "\nResult for container: ${i}:\n"
                docker exec -d $i subscription-manager repos --disable rhel-7-server-rpms
        done

        for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5
        do
                echo -e "\nResult for container: ${i}:\n"
                docker exec -d $i subscription-manager repos --enable rhel-7-server-rpms
        done
done

time docker exec 4ebacb14a9b5 subscription-manager repos --disable rhel-7-server-rpms

```


Observation:
---------------

1. During execution, I never saw 'Required Lock already taken error!'
2. At 1001st time, when I execute 'subscription-manager disable' where the satellite was having an almost good chunk of 'Host Update' task yet to execute, the execution wait for 1minute 48 seconds, but did not fail with 'required lock' error.


Script stdout:
---------------

```
.

.

.

Result for container: 41c5a036ef47:


Result for container: 1549600b5af5:


Result for container: 4ebacb14a9b5:

Repository 'rhel-7-server-rpms' is disabled for this system. <---------------- This 1001st request which wait for 1m48s and executed without error.

real	1m48.293s
user	0m0.021s
sys	0m0.015s

```

Is this what we expect, to verify this bug?

Comment 14 Mike McCune 2019-08-27 21:02:22 UTC
Did you change the Administer -> Settings -> Content -> "Allow multiple concurrent Actions::Katello::Host::Update calls for one host to be processed at the same time" to 'TRUE'?

If not and it was false then you did not have a concurrency level high enough to introduce the Lock contention.

If you had that set to 'TRUE' then yes, this bug is VERIFIED.

Comment 15 Jitendra Yejare 2019-08-28 07:53:36 UTC
Verified!

@Satellite 6.6 Snap 17


Steps:
-------------
1. As per comment 14, I set setting "Allow multiple concurrent Actions::Katello::Host::Update calls for one host to be processed at the same time" to True/yes.
2. Registered 5 RHEL7 systems to the satellite.
3. Ran 100 times disable and enable for all 5 hosts asynchronously. using script [1].
4. Monitor clients and satellite if fails.


Scripts:
--------------

[1] :
```
for i in {1..100}
do
        for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5
        do
                echo -e "\nResult for container: ${i}:\n"
                docker exec -d $i subscription-manager repos --disable rhel-7-server-rpms
        done

        for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5
        do
                echo -e "\nResult for container: ${i}:\n"
                docker exec -d $i subscription-manager repos --enable rhel-7-server-rpms
        done
done

time docker exec 4ebacb14a9b5 subscription-manager repos --disable rhel-7-server-rpms

```


Observation:
---------------

1. During execution, I never saw 'Required Lock already taken error!' in the satellite.
2. At 1001st time, when I execute 'subscription-manager disable' where the satellite was having an almost good chunk of 'Host Update' task yet to execute, the execution was waiting for 5minute 44 seconds but did not fail with 'required lock' error.


Script stdout:
---------------

```
.

.

.

Result for container: 41c5a036ef47:


Result for container: 1549600b5af5:


Result for container: 4ebacb14a9b5:

Repository 'rhel-7-server-rpms' is disabled for this system. <---------------- This 1001st request which wait for 1m48s and executed without error.

real	5m44.023s
user	0m0.021s
sys	0m0.015s

```

Comment 17 errata-xmlrpc 2019-10-22 12:46:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3172


Note You need to log in before you can comment on or make changes to this bug.