1643649 – Sequential Actions::Katello::Host::Update calls from subscription-manager can fail under load

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1643649 - Sequential Actions::Katello::Host::Update calls from subscription-manager can fail under load

Summary: Sequential Actions::Katello::Host::Update calls from subscription-manager can...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	API - Content
Sub Component:
Version:	6.3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	6.6.0
Assignee:	Mike McCune
QA Contact:	Jitendra Yejare
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1644342 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-26 21:22 UTC by Mike McCune
Modified:	2023-03-24 14:19 UTC (History)
CC List:	12 users (show)
Fixed In Version:	tfm-rubygem-katello-3.12.0.9-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1731277 (view as bug list)
Environment:
Last Closed:	2019-10-22 12:46:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	25422	Normal	Closed	Sequential Actions::Katello::Host::Update calls from subscription-manager can fail under load	2021-02-19 02:03:06 UTC
Red Hat Knowledge Base (Solution)	3671011	None	None	None	2018-10-30 15:57:17 UTC
Red Hat Product Errata	RHSA-2019:3172	None	None	None	2019-10-22 12:47:07 UTC

Description Mike McCune 2018-10-26 21:22:41 UTC

We previously switched from a sync task to an async task for host update API calls:

https://github.com/Katello/katello/pull/6709/files

This temporarily reduced a bottleneck at the passenger layer but introduces a new issue:

During high traffic situations it is likely and common for clients to issue sequential calls from the client, eg:

subscription-manager repos disable ..
subscription-manager repos enable ..

these calls end up firing off an async task but it is common that the second task will fail because the Host is locked from the 1st.

With our new ability to have multiple Dynflow executors, we should explore moving this back to a sync task so we can avoid repeated failed Tasks calling host update.

Longer term, we need to:

1) move this out of Foreman Tasks into a direct API call like we did for registration
2) A better retry method
3) Queuing per-host so we can process multiple calls for the same host and process them in order.

Comment 1 Mike McCune 2018-10-30 15:57:17 UTC

*** Bug 1644342 has been marked as a duplicate of this bug. ***

Comment 2 Mike McCune 2018-10-30 15:58:44 UTC

Comments from an another scenario that can cause this from the dupe we closed:

https://bugzilla.redhat.com/show_bug.cgi?id=1644342

"""
Description of problem:
In case a hypervisor is registered as a Host and is running virt-who, a task to update the Host can fail on acquired lock. The reason is, both rhsmcertd-worker AND virt-who can fire "PUT /rhsm/consumers/${UUID}" requests concurrently and independently, what leads to concurrent Actions::Katello::Host::Update tasks where either of them fails with "Required lock is already taken by other running tasks" error.

We should handle this better in our API, that e.g. should handle the 2nd request, wait till the 1st is completed and then continue in the 2nd one.


Version-Release number of selected component (if applicable):
Sat 6.3, expected in 6.4 as well.


How reproducible:
100% with some effort / proper timing


Steps to Reproduce:
1. Have a hypervisor registered as a Host, with virt-who installed & configured
2. Populate or update some host's facts, just to have Host update crunching the facts longer.
3. Enable rhsm debugs and restart rhsmcertd service
4. Check rhsm logs to spot time when "PUT /rhsm/consumers/${UUID}" is triggered.
5. At that time, run virt-who -o.
6. After a while, check for status of 2 Host update tasks  for this Host.


Actual results:
2nd Host update fails with "Required lock is already taken by other running tasks" error.


Expected results:
2nd Host update task proceeds well.


"""

Comment 4 Mike McCune 2018-11-09 00:15:52 UTC

Created redmine issue https://projects.theforeman.org/issues/25422 from this bug

Comment 5 Satellite Program 2018-11-09 01:09:30 UTC

Upstream bug assigned to mmccune

Comment 6 Satellite Program 2018-11-09 01:09:32 UTC

Upstream bug assigned to mmccune

Comment 11 Bryan Kearney 2019-05-09 18:06:41 UTC

Moving this bug to POST for triage into Satellite 6 since the upstream issue https://projects.theforeman.org/issues/25422 has been resolved.

Comment 13 Jitendra Yejare 2019-08-21 07:00:19 UTC

Tested!

@Satellite 6.6 Snap 16


Steps:
-------------
1. Registered 5 RHEL7 systems to the satellite.
2. Ran 100 times disable and enable for all 5 hosts asynchronously. using script [1].
3. Monitor clients and satellite if fails.


Scripts:
--------------

[1] :
```
for i in {1..100}
do
        for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5
        do
                echo -e "\nResult for container: ${i}:\n"
                docker exec -d $i subscription-manager repos --disable rhel-7-server-rpms
        done

        for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5
        do
                echo -e "\nResult for container: ${i}:\n"
                docker exec -d $i subscription-manager repos --enable rhel-7-server-rpms
        done
done

time docker exec 4ebacb14a9b5 subscription-manager repos --disable rhel-7-server-rpms

```


Observation:
---------------

1. During execution, I never saw 'Required Lock already taken error!'
2. At 1001st time, when I execute 'subscription-manager disable' where the satellite was having an almost good chunk of 'Host Update' task yet to execute, the execution wait for 1minute 48 seconds, but did not fail with 'required lock' error.


Script stdout:
---------------

```
.

.

.

Result for container: 41c5a036ef47:


Result for container: 1549600b5af5:


Result for container: 4ebacb14a9b5:

Repository 'rhel-7-server-rpms' is disabled for this system. <---------------- This 1001st request which wait for 1m48s and executed without error.

real	1m48.293s
user	0m0.021s
sys	0m0.015s

```

Is this what we expect, to verify this bug?

Comment 14 Mike McCune 2019-08-27 21:02:22 UTC

Did you change the Administer -> Settings -> Content -> "Allow multiple concurrent Actions::Katello::Host::Update calls for one host to be processed at the same time" to 'TRUE'?

If not and it was false then you did not have a concurrency level high enough to introduce the Lock contention.

If you had that set to 'TRUE' then yes, this bug is VERIFIED.

Comment 15 Jitendra Yejare 2019-08-28 07:53:36 UTC

Verified!

@Satellite 6.6 Snap 17


Steps:
-------------
1. As per comment 14, I set setting "Allow multiple concurrent Actions::Katello::Host::Update calls for one host to be processed at the same time" to True/yes.
2. Registered 5 RHEL7 systems to the satellite.
3. Ran 100 times disable and enable for all 5 hosts asynchronously. using script [1].
4. Monitor clients and satellite if fails.


Scripts:
--------------

[1] :
```
for i in {1..100}
do
        for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5
        do
                echo -e "\nResult for container: ${i}:\n"
                docker exec -d $i subscription-manager repos --disable rhel-7-server-rpms
        done

        for i in c72588fe5db6 31ce842fcf7c 41c5a036ef47 1549600b5af5 4ebacb14a9b5
        do
                echo -e "\nResult for container: ${i}:\n"
                docker exec -d $i subscription-manager repos --enable rhel-7-server-rpms
        done
done

time docker exec 4ebacb14a9b5 subscription-manager repos --disable rhel-7-server-rpms

```


Observation:
---------------

1. During execution, I never saw 'Required Lock already taken error!' in the satellite.
2. At 1001st time, when I execute 'subscription-manager disable' where the satellite was having an almost good chunk of 'Host Update' task yet to execute, the execution was waiting for 5minute 44 seconds but did not fail with 'required lock' error.


Script stdout:
---------------

```
.

.

.

Result for container: 41c5a036ef47:


Result for container: 1549600b5af5:


Result for container: 4ebacb14a9b5:

Repository 'rhel-7-server-rpms' is disabled for this system. <---------------- This 1001st request which wait for 1m48s and executed without error.

real	5m44.023s
user	0m0.021s
sys	0m0.015s

```

Comment 17 errata-xmlrpc 2019-10-22 12:46:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3172

Note You need to log in before you can comment on or make changes to this bug.