1624045 – processing virt-who report blocks RHSM certs checks what can lead to 503 errors

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1624045 - processing virt-who report blocks RHSM certs checks what can lead to 503 errors

Summary: processing virt-who report blocks RHSM certs checks what can lead to 503 errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Subscription Management
Sub Component:
Version:	6.3.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	jcallaha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-30 18:10 UTC by Mike McCune
Modified:	2023-09-07 19:21 UTC (History)
CC List:	19 users (show)
Fixed In Version:	tfm-rubygem-katello-3.4.5.83-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1586210
Environment:
Last Closed:	2018-10-11 15:18:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	23995	High	Closed	processing virt-who report blocks RHSM certs checks what can lead to 503 errors	2020-05-14 13:19:31 UTC
Red Hat Knowledge Base (Solution)	3481501	None	None	None	2018-08-30 18:11:07 UTC
Red Hat Product Errata	RHBA-2018:2915	None	None	None	2018-10-11 15:18:30 UTC

Comment 4 jcallaha 2018-09-25 20:12:10 UTC

Verified in Satellite 6.3.4 Snap 2

Used this loop to send 20 virt-who reports with 25 hypervisors each to the Satellite. 

for i in {1..20}; do docker run --rm -d -e "SATHOST=<my.sat.host>" -e "COUNT=25" jacobcallahan/genvirt; done

This flooded the passenger queue, which is setup with the default 12 workers.

Then i performed a watch for the cert check (Every 2s) and watched for a 503. 

watch "curl -s -u admin:changeme -X GET https://$(hostname -f)/rhsm/consumers/a5832c5a-964d-48ee-9b3a-ece4d79d7fed/certificates/serials"

At no point did a 503 get returned. 

Additionally, grepping the httpd log didn't give any results for the error expected.

-bash-4.2# grep "queue is full" /var/log/httpd/error_log
-bash-4.2#

Comment 8 errata-xmlrpc 2018-10-11 15:18:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2915

Comment 10 Mike McCune 2018-11-29 14:35:25 UTC

** Public Bug Summary - This was resolved in 6.3.4 update to Satellite 6 **


Description of problem:
Processing a virt-who report causes one specific RHSM request type is blocked for some time. Since these requests are fired frequently, this can cause the requests occupy whole passenger queue and passenger starts to return 503. 

Once the virt-who report processing is completed, the requests RHSM requests are unblocked. Anyway the 503 errors shouldnt happen meantime.


Version-Release number of selected component (if applicable):
Sat 6.3.1


How reproducible:
100% on customer data; generic reproducer shall not be hard to develop

Steps to Reproduce:
(generic reproducer)
1. Have few thousands of systems registered, with default certCheckInterval = 240 in rhsm.conf (the lower the better for reproducer)
2. Send virt-who report with mapping of several hundreds of systems
3. During processing of the report, check WebUI status or httpd error logs

Particular reproducer without a need of having a single host that is mimicked by specific curl requests:

A) to mimic RHSM certs check request: in fact just one particular URI GET request is essential / sufficient:

curl -s -u admin:changeme -X GET https://$(hostname -f)/rhsm/consumers/${uuid}/certificates/serials

(set uuid to various UUIDs of hosts / candlepin consumer IDs, and run these requests concurrently several times)

B) to mimic virt-who report: have virt-who-report.json with HV<->VMs mappings, and run:

time curl -s -u admin:changeme -X POST -H "Content-Type: application/json" -d @virt-who-report.json 'https://your.satellite/rhsm/hypervisors?owner=Owner&env=Library'



Actual results:
3. shows 503 errors in WebUI, /var/log/httpd/error_log having "Request queue is full. Returning an error" errors.


Expected results:
3. WebUI accessible, no such errors in httpd logs.


Additional info:
Technical explanation what goes wrong (to some extent):
- virt-who report processing requires updating katello_subscription_facets postgres table in some lengthy transaction (*)
- So Katello::Api::Rhsm::CandlepinProxiesController#serials requests are stuck on step:
 @host.subscription_facet.save!
for tens(!) of seconds, till the virt-who report is finished
- these requests come from the RHSM certs check queries / particular URI request /rhsm/consumers/${uuid}/certificates/serials
- these requests get accumulated for the few tens of seconds, and for higher load of them, this can fill whole passenger request queue
- that consequently triggers the 503 errors

Note You need to log in before you can comment on or make changes to this bug.