2027947 – HypervisorHeartbeatUpdateJob is taking long time to process and updates wrong consumer records

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2027947 - HypervisorHeartbeatUpdateJob is taking long time to process and updates wrong consumer records

Summary: HypervisorHeartbeatUpdateJob is taking long time to process and updates wrong...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Candlepin
Sub Component:
Version:	6.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	6.12.0
Assignee:	satellite6-bugs
QA Contact:	jcallaha
Docs Contact:
URL:
Whiteboard:
Depends On:	2028765 2028766
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-01 06:14 UTC by Hao Chang Yu
Modified:	2023-09-18 04:28 UTC (History)
CC List:	5 users (show)
Fixed In Version:	candlepin-4.0.14-1, candlepin-4.1.9-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2028765 2028766 (view as bug list)
Environment:
Last Closed:	2022-11-16 13:33:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	SAT-13334	0	None	None	None	2022-10-08 04:44:22 UTC
Red Hat Product Errata	RHSA-2022:8506	0	None	None	None	2022-11-16 13:33:17 UTC

Description Hao Chang Yu 2021-12-01 06:14:33 UTC

Description of problem:
Satellite with large number of hypervisors and registered consumers is taking long time to run the "HypervisorHeartbeatUpdateJob". When running this job, it seems to potentially block some Candlepin requests too, such as /certificates/serials, /release etc.

This is the offending lines: https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643


If I understand correctly, the query is suppose to update only the "lastcheckin" of the reported hypervisors. However, this query seems to update all the consumers in the organization. See the below tests:


# I have 5725 consumers in 'redhat' org
candlepin=# select count(*) from cp_consumer where owner_id in (select id from cp_owner where account = 'redhat');
 count 
-------
  5725
(1 row)

# Total hypervisor in 'redhat' org
candlepin=# select count(*) from cp_consumer a join cp_consumer_hypervisor b on a.id = b.consumer_id join cp_owner c on c.id = a.owner_id where b.reporter_id = 'my-report-id' and c.account = 'redhat';
 count 
-------
  1221
(1 row)

# Run the query in https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643
candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id =  'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
UPDATE 5725  <============================  Same as the total consumers in 'redhat' org
Time: 36482.580 ms (00:36.483) <================= Take 36 seconds. It will take even long for env with 20k+


# I think the correct query should be:
candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer_hypervisor b, cp_owner c WHERE cp_consumer.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
UPDATE 1221  <===== Equal to the total hypervisors in 'redhat' org
Time: 399.606 ms  <=========== Almost immediately

Steps to Reproduce:
1. Register a few thousands of hosts to the Satellite.
2. Configure virt-who to report at least 1000 hypervisors. You can use the fake report.
3. Run virt-who -do
4. tail the /var/log/candlepin/candlepin.log


Actual results:
HypervisorHeartbeatUpdateJob takes long time to run

Expected results:
HypervisorHeartbeatUpdateJob should finish pretty quick.

Additional info:


# explain UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id =  'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Update on cp_consumer  (cost=781.33..19166.56 rows=1167965 width=1702)
   ->  Hash Join  (cost=781.33..19166.56 rows=1167965 width=1702)
         Hash Cond: ((b.consumer_id)::text = (a.id)::text)
         ->  Nested Loop  (cost=0.41..15317.45 rows=1167965 width=1709) <======================= Is like looping the whole tables
               ->  Seq Scan on cp_consumer_hypervisor b  (cost=0.00..90.85 rows=1223 width=39)
                     Filter: ((reporter_id)::text = 'my-report-id'::text)
               ->  Materialize  (cost=0.41..629.43 rows=955 width=1670)
                     ->  Nested Loop  (cost=0.41..624.65 rows=955 width=1670)
                           ->  Seq Scan on cp_owner c  (cost=0.00..1.07 rows=1 width=39)
                                 Filter: ((account)::text = 'redhat'::text)
                           ->  Index Scan using cp_consumer_owner_id_idx on cp_consumer  (cost=0.41..604.48 rows=1910 width=1664)
                                 Index Cond: ((owner_id)::text = (c.id)::text)
         ->  Hash  (cost=709.30..709.30 rows=5730 width=39)
               ->  Seq Scan on cp_consumer a  (cost=0.00..709.30 rows=5730 width=39)
(14 rows)

Comment 1 William Poteat 2021-12-02 15:20:26 UTC

Can we also get an explain plan on the updated query? Thanks.

Comment 2 William Poteat 2021-12-06 17:02:53 UTC

"If I understand correctly, the query is suppose to update only the "lastcheckin" of the reported hypervisors"

It is supposed to update any hypervisor with the corresponding reporter id and org. It is not determined by the contents of the hypervisor report.
In scenarios where none of the hypervisors have changed, the lastcheckin date is updated via the heartbeat, but there is not a hypervisor report sent at all for the HypervisorUpdateJob.

Comment 3 William Poteat 2021-12-06 17:54:48 UTC

I see where the current query gets it wrong and updates all rows for the org. Will fix.

Comment 6 jcallaha 2022-10-11 16:05:28 UTC

Verified in Satellite 6.12 Snap 14

Ran the hypervisor/guest flood script provided by https://github.com/JacobCallahan/content-host-d
python flood.py -s my.sat.host.com -m host --hypervisors 3000 --guests 1 -t ubi7 --exit-criteria reg --limit 25

with some additional test hypervisors included, this brought the total to 3,010

candlepin=# select count(*) from cp_consumer a join cp_consumer_hypervisor b on a.id = b.consumer_id join cp_owner c on c.id = a.owner_id where c.account = 'Default_Organization';
 count 
-------
  3010
(1 row)

Later, additional testing would add an additional 1,010 hypervisors. These final 1,000 are what was repeatedly submitted to the Satellite for updates.
The overall update job completed twice as fast as the initial report, in about 2m.

INFO  org.candlepin.async.JobManager - Job "Hypervisor Update" completed in 29693ms

Additionally, I decompiled the ConsumerCurator.class file and found the updated query associated with the heartbeat update.
That query matches the suggested changes.
query = "UPDATE cp_consumer consumer SET lastcheckin = :checkin FROM cp_consumer_hypervisor hypervisor, cp_owner owner WHERE consumer.id = hypervisor.consumer_id AND hypervisor.reporter_id = :reporter AND consumer.owner_id = owner.id AND owner.account = :ownerKey";

Comment 10 errata-xmlrpc 2022-11-16 13:33:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.12 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8506

Comment 11 Red Hat Bugzilla 2023-09-18 04:28:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.