Bug 1600201 - [candlepin] processing virt-who report blocks RHSM certs checks what can lead to 503 errors
Summary: [candlepin] processing virt-who report blocks RHSM certs checks what can lead...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Candlepin
Version: 6.3.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: jcallaha
URL:
Whiteboard:
Depends On: 1600594
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-11 16:38 UTC by Mike McCune
Modified: 2024-01-06 04:25 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1600592 1600593 1600594 (view as bug list)
Environment:
Last Closed: 2019-10-29 13:27:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mike McCune 2018-07-11 16:38:36 UTC
This is the Candlepin specific bug spawning from:

https://bugzilla.redhat.com/show_bug.cgi?id=1586210

"""
Description of problem:
Processing a virt-who report causes one specific RHSM request type is blocked for some time. Since these requests are fired frequently, this can cause the requests occupy whole passenger queue and passenger starts to return 503. 

Once the virt-who report processing is completed, the requests RHSM requests are unblocked. Anyway the 503 errors shouldnt happen meantime.


Version-Release number of selected component (if applicable):
Sat 6.3.1


How reproducible:
100% on customer data; generic reproducer shall not be hard to develop

Steps to Reproduce:
(generic reproducer)
1. Have few thousands of systems registered, with default certCheckInterval = 240 in rhsm.conf (the lower the better for reproducer)
2. Send virt-who report with mapping of several hundreds of systems
3. During processing of the report, check WebUI status or httpd error logs

Particular reproducer without a need of having a single host that is mimicked by specific curl requests:

A) to mimic RHSM certs check request: in fact just one particular URI GET request is essential / sufficient:

curl -s -u admin:changeme -X GET https://$(hostname -f)/rhsm/consumers/${uuid}/certificates/serials

(set uuid to various UUIDs of hosts / candlepin consumer IDs, and run these requests concurrently several times)

B) to mimic virt-who report: have virt-who-report.json with HV<->VMs mappings, and run:

time curl -s -u admin:changeme -X POST -H "Content-Type: application/json" -d @virt-who-report.json 'https://your.satellite/rhsm/hypervisors?owner=Owner&env=Library'



Actual results:
3. shows 503 errors in WebUI, /var/log/httpd/error_log having "Request queue is full. Returning an error" errors.


Expected results:
3. WebUI accessible, no such errors in httpd logs.


Additional info:
Technical explanation what goes wrong (to some extent):
- virt-who report processing requires updating katello_subscription_facets postgres table in some lengthy transaction (*)
- So Katello::Api::Rhsm::CandlepinProxiesController#serials requests are stuck on step:
 @host.subscription_facet.save!
for tens(!) of seconds, till the virt-who report is finished
- these requests come from the RHSM certs check queries / particular URI request /rhsm/consumers/${uuid}/certificates/serials
- these requests get accumulated for the few tens of seconds, and for higher load of them, this can fill whole passenger request queue
- that consequently triggers the 503 errors

Particular reproducer on customer data to be provided in next comment.
"""

Comment 1 Mike McCune 2018-07-11 16:39:55 UTC
As noted here:


https://bugzilla.redhat.com/show_bug.cgi?id=1586210#c22

we are requesting an index be added to the cp_consumer_facts table (with a proper name) to speed up virt-who processing. We saw 30-40minute virt-who transactions reduced to ~5-6 minutes with the addition of this index.

"""
Ran the virt-who import and noticed that it is making a lot of relatively slow queries:

select cp_consumer.id from cp_consumer inner join cp_consumer_facts on cp_consumer.id = cp_consumer_facts.cp_consumer_id where cp_consumer_facts.mapkey = 'virt.uuid' and lower(cp_consumer_facts.element) in ( '?', '?') and cp_consumer.owner_id = '?' order by cp_consumer.updated desc

these were landing in the 200-300ms range.

I added an index:

echo "CREATE INDEX lower_case_test ON cp_consumer_facts ((lower(element)));" | sudo -u postgres psql -d candlepin


this dropped the above queries down into the 0.300ms range.

What this did was speed up the virt-who import from 30-45 minutes down to 6 minutes.

This helps overall performance, but we were still getting 503 errors during the 6 minute window of the virt-who import. I'd recommend creating this index in the short term to assist while we continue to investigate why the Satellite is unable to keep up with the load while virt-who import is running."""

Comment 2 Kevin Howell 2018-07-12 14:45:49 UTC
Note, we'll need to evaluate in the context of recent changes made to handling of hypervisor checkins in https://github.com/candlepin/candlepin/pull/2035 . We may end up handling differently in candlepin-2.1, candlepin-2.4, and master as a result...

Comment 3 Kevin Howell 2018-09-17 20:22:11 UTC
This may be resolved by use of async hypervisor check-ins. Note that these are used automatically by default if both virt-who and the version of Satellite in use support them.

Assuming I'm reading git histories correctly, looks like:
 virt-who-0.15-1
 katello-3.5.0

or greater will use async by default.

jsherril, can you confirm my statement about katello versions ^ ?

Comment 4 Justin Sherrill 2018-09-18 16:54:57 UTC
Kevin, 

yes that is correct, however the problem seemed to be that a long hypervisor checkin was holding locks on the database.  This was a problem in katello, but was resolved as part of https://bugzilla.redhat.com/show_bug.cgi?id=1586210.

In addition it seemed like the lack of an index that mike pointed out was making the problem worse (by increasing the time of the checkin).  Its possible that these two fixes together will improve the situation enough, but possibly not.  More investigation may be needed after the index gets in place.

Comment 9 Chris Snyder 2019-10-17 15:15:22 UTC
*** Bug 1756955 has been marked as a duplicate of this bug. ***

Comment 11 Red Hat Bugzilla 2024-01-06 04:25:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.