Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1600201

Summary:	[candlepin] processing virt-who report blocks RHSM certs checks what can lead to 503 errors
Product:	Red Hat Satellite	Reporter:	Mike McCune <mmccune>
Component:	Candlepin	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED WONTFIX	QA Contact:	jcallaha
Severity:	high	Docs Contact:
Priority:	high
Version:	6.3.2	CC:	ajoseph, andrew.schofield, baitken, bcourt, bkearney, cmarinea, fwissing, hyu, jsherril, khowell, satellite6-bugs, stanislav.moravec
Target Milestone:	Unspecified	Keywords:	Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1600592 1600593 1600594 (view as bug list)		Environment:
Last Closed:	2019-10-29 13:27:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1600594
Bug Blocks:

Description Mike McCune 2018-07-11 16:38:36 UTC

This is the Candlepin specific bug spawning from:

https://bugzilla.redhat.com/show_bug.cgi?id=1586210

"""
Description of problem:
Processing a virt-who report causes one specific RHSM request type is blocked for some time. Since these requests are fired frequently, this can cause the requests occupy whole passenger queue and passenger starts to return 503. 

Once the virt-who report processing is completed, the requests RHSM requests are unblocked. Anyway the 503 errors shouldnt happen meantime.


Version-Release number of selected component (if applicable):
Sat 6.3.1


How reproducible:
100% on customer data; generic reproducer shall not be hard to develop

Steps to Reproduce:
(generic reproducer)
1. Have few thousands of systems registered, with default certCheckInterval = 240 in rhsm.conf (the lower the better for reproducer)
2. Send virt-who report with mapping of several hundreds of systems
3. During processing of the report, check WebUI status or httpd error logs

Particular reproducer without a need of having a single host that is mimicked by specific curl requests:

A) to mimic RHSM certs check request: in fact just one particular URI GET request is essential / sufficient:

curl -s -u admin:changeme -X GET https://$(hostname -f)/rhsm/consumers/${uuid}/certificates/serials

(set uuid to various UUIDs of hosts / candlepin consumer IDs, and run these requests concurrently several times)

B) to mimic virt-who report: have virt-who-report.json with HV<->VMs mappings, and run:

time curl -s -u admin:changeme -X POST -H "Content-Type: application/json" -d @virt-who-report.json 'https://your.satellite/rhsm/hypervisors?owner=Owner&env=Library'



Actual results:
3. shows 503 errors in WebUI, /var/log/httpd/error_log having "Request queue is full. Returning an error" errors.


Expected results:
3. WebUI accessible, no such errors in httpd logs.


Additional info:
Technical explanation what goes wrong (to some extent):
- virt-who report processing requires updating katello_subscription_facets postgres table in some lengthy transaction (*)
- So Katello::Api::Rhsm::CandlepinProxiesController#serials requests are stuck on step:
 @host.subscription_facet.save!
for tens(!) of seconds, till the virt-who report is finished
- these requests come from the RHSM certs check queries / particular URI request /rhsm/consumers/${uuid}/certificates/serials
- these requests get accumulated for the few tens of seconds, and for higher load of them, this can fill whole passenger request queue
- that consequently triggers the 503 errors

Particular reproducer on customer data to be provided in next comment.
"""

Comment 1 Mike McCune 2018-07-11 16:39:55 UTC

As noted here:


https://bugzilla.redhat.com/show_bug.cgi?id=1586210#c22

we are requesting an index be added to the cp_consumer_facts table (with a proper name) to speed up virt-who processing. We saw 30-40minute virt-who transactions reduced to ~5-6 minutes with the addition of this index.

"""
Ran the virt-who import and noticed that it is making a lot of relatively slow queries:

select cp_consumer.id from cp_consumer inner join cp_consumer_facts on cp_consumer.id = cp_consumer_facts.cp_consumer_id where cp_consumer_facts.mapkey = 'virt.uuid' and lower(cp_consumer_facts.element) in ( '?', '?') and cp_consumer.owner_id = '?' order by cp_consumer.updated desc

these were landing in the 200-300ms range.

I added an index:

echo "CREATE INDEX lower_case_test ON cp_consumer_facts ((lower(element)));" | sudo -u postgres psql -d candlepin


this dropped the above queries down into the 0.300ms range.

What this did was speed up the virt-who import from 30-45 minutes down to 6 minutes.

This helps overall performance, but we were still getting 503 errors during the 6 minute window of the virt-who import. I'd recommend creating this index in the short term to assist while we continue to investigate why the Satellite is unable to keep up with the load while virt-who import is running."""

Comment 2 Kevin Howell 2018-07-12 14:45:49 UTC

Note, we'll need to evaluate in the context of recent changes made to handling of hypervisor checkins in https://github.com/candlepin/candlepin/pull/2035 . We may end up handling differently in candlepin-2.1, candlepin-2.4, and master as a result...

Comment 3 Kevin Howell 2018-09-17 20:22:11 UTC

This may be resolved by use of async hypervisor check-ins. Note that these are used automatically by default if both virt-who and the version of Satellite in use support them.

Assuming I'm reading git histories correctly, looks like:
 virt-who-0.15-1
 katello-3.5.0

or greater will use async by default.

jsherril, can you confirm my statement about katello versions ^ ?

Comment 4 Justin Sherrill 2018-09-18 16:54:57 UTC

Kevin, 

yes that is correct, however the problem seemed to be that a long hypervisor checkin was holding locks on the database.  This was a problem in katello, but was resolved as part of https://bugzilla.redhat.com/show_bug.cgi?id=1586210.

In addition it seemed like the lack of an index that mike pointed out was making the problem worse (by increasing the time of the checkin).  Its possible that these two fixes together will improve the situation enough, but possibly not.  More investigation may be needed after the index gets in place.

Comment 9 Chris Snyder 2019-10-17 15:15:22 UTC

*** Bug 1756955 has been marked as a duplicate of this bug. ***

Comment 11 Red Hat Bugzilla 2024-01-06 04:25:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days