Bug 2027947

Summary:	HypervisorHeartbeatUpdateJob is taking long time to process and updates wrong consumer records
Product:	Red Hat Satellite	Reporter:	Hao Chang Yu <hyu>
Component:	Candlepin	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	jcallaha
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	6.10.0	CC:	juwatts, nmoumoul, redakkan, wpoteat, zhunting
Target Milestone:	6.12.0	Keywords:	Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	candlepin-4.0.14-1, candlepin-4.1.9-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2028765 2028766 (view as bug list)		Environment:
Last Closed:	2022-11-16 13:33:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2028765, 2028766
Bug Blocks:

Description Hao Chang Yu 2021-12-01 06:14:33 UTC

Description of problem:
Satellite with large number of hypervisors and registered consumers is taking long time to run the "HypervisorHeartbeatUpdateJob". When running this job, it seems to potentially block some Candlepin requests too, such as /certificates/serials, /release etc.

This is the offending lines: https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643


If I understand correctly, the query is suppose to update only the "lastcheckin" of the reported hypervisors. However, this query seems to update all the consumers in the organization. See the below tests:


# I have 5725 consumers in 'redhat' org
candlepin=# select count(*) from cp_consumer where owner_id in (select id from cp_owner where account = 'redhat');
 count 
-------
  5725
(1 row)

# Total hypervisor in 'redhat' org
candlepin=# select count(*) from cp_consumer a join cp_consumer_hypervisor b on a.id = b.consumer_id join cp_owner c on c.id = a.owner_id where b.reporter_id = 'my-report-id' and c.account = 'redhat';
 count 
-------
  1221
(1 row)

# Run the query in https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643
candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id =  'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
UPDATE 5725  <============================  Same as the total consumers in 'redhat' org
Time: 36482.580 ms (00:36.483) <================= Take 36 seconds. It will take even long for env with 20k+


# I think the correct query should be:
candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer_hypervisor b, cp_owner c WHERE cp_consumer.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
UPDATE 1221  <===== Equal to the total hypervisors in 'redhat' org
Time: 399.606 ms  <=========== Almost immediately

Steps to Reproduce:
1. Register a few thousands of hosts to the Satellite.
2. Configure virt-who to report at least 1000 hypervisors. You can use the fake report.
3. Run virt-who -do
4. tail the /var/log/candlepin/candlepin.log


Actual results:
HypervisorHeartbeatUpdateJob takes long time to run

Expected results:
HypervisorHeartbeatUpdateJob should finish pretty quick.

Additional info:


# explain UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id =  'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Update on cp_consumer  (cost=781.33..19166.56 rows=1167965 width=1702)
   ->  Hash Join  (cost=781.33..19166.56 rows=1167965 width=1702)
         Hash Cond: ((b.consumer_id)::text = (a.id)::text)
         ->  Nested Loop  (cost=0.41..15317.45 rows=1167965 width=1709) <======================= Is like looping the whole tables
               ->  Seq Scan on cp_consumer_hypervisor b  (cost=0.00..90.85 rows=1223 width=39)
                     Filter: ((reporter_id)::text = 'my-report-id'::text)
               ->  Materialize  (cost=0.41..629.43 rows=955 width=1670)
                     ->  Nested Loop  (cost=0.41..624.65 rows=955 width=1670)
                           ->  Seq Scan on cp_owner c  (cost=0.00..1.07 rows=1 width=39)
                                 Filter: ((account)::text = 'redhat'::text)
                           ->  Index Scan using cp_consumer_owner_id_idx on cp_consumer  (cost=0.41..604.48 rows=1910 width=1664)
                                 Index Cond: ((owner_id)::text = (c.id)::text)
         ->  Hash  (cost=709.30..709.30 rows=5730 width=39)
               ->  Seq Scan on cp_consumer a  (cost=0.00..709.30 rows=5730 width=39)
(14 rows)

Comment 1 William Poteat 2021-12-02 15:20:26 UTC

Can we also get an explain plan on the updated query? Thanks.

Comment 2 William Poteat 2021-12-06 17:02:53 UTC

"If I understand correctly, the query is suppose to update only the "lastcheckin" of the reported hypervisors"

It is supposed to update any hypervisor with the corresponding reporter id and org. It is not determined by the contents of the hypervisor report.
In scenarios where none of the hypervisors have changed, the lastcheckin date is updated via the heartbeat, but there is not a hypervisor report sent at all for the HypervisorUpdateJob.

Comment 3 William Poteat 2021-12-06 17:54:48 UTC

I see where the current query gets it wrong and updates all rows for the org. Will fix.

Comment 6 jcallaha 2022-10-11 16:05:28 UTC

Verified in Satellite 6.12 Snap 14

Ran the hypervisor/guest flood script provided by https://github.com/JacobCallahan/content-host-d
python flood.py -s my.sat.host.com -m host --hypervisors 3000 --guests 1 -t ubi7 --exit-criteria reg --limit 25

with some additional test hypervisors included, this brought the total to 3,010

candlepin=# select count(*) from cp_consumer a join cp_consumer_hypervisor b on a.id = b.consumer_id join cp_owner c on c.id = a.owner_id where c.account = 'Default_Organization';
 count 
-------
  3010
(1 row)

Later, additional testing would add an additional 1,010 hypervisors. These final 1,000 are what was repeatedly submitted to the Satellite for updates.
The overall update job completed twice as fast as the initial report, in about 2m.

INFO  org.candlepin.async.JobManager - Job "Hypervisor Update" completed in 29693ms

Additionally, I decompiled the ConsumerCurator.class file and found the updated query associated with the heartbeat update.
That query matches the suggested changes.
query = "UPDATE cp_consumer consumer SET lastcheckin = :checkin FROM cp_consumer_hypervisor hypervisor, cp_owner owner WHERE consumer.id = hypervisor.consumer_id AND hypervisor.reporter_id = :reporter AND consumer.owner_id = owner.id AND owner.account = :ownerKey";

Comment 10 errata-xmlrpc 2022-11-16 13:33:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.12 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8506

Comment 11 Red Hat Bugzilla 2023-09-18 04:28:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days