2028765 – HypervisorHeartbeatUpdateJob is taking long time to process and updates wrong consumer records

Bug 2028765 - HypervisorHeartbeatUpdateJob is taking long time to process and updates wrong consumer records

Summary: HypervisorHeartbeatUpdateJob is taking long time to process and updates wrong...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Candlepin
Classification:	Community
Component:	candlepin
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	candlepin-bugs
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2027947
TreeView+	depends on / blocked

Reported:	2021-12-03 08:43 UTC by Nikos Moumoulidis
Modified:	2022-01-20 10:16 UTC (History)
CC List:	5 users (show)
Fixed In Version:	candlepin-4.1.9-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2027947
Environment:
Last Closed:	2022-01-20 10:16:49 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	candlepin candlepin pull 3177	0	None	Merged	2028765: Change to query for correct updating and performance [M]	2021-12-07 09:07:36 UTC

Description Nikos Moumoulidis 2021-12-03 08:43:10 UTC

+++ This bug was initially created as a clone of Bug #2027947 +++

Description of problem:
Satellite with large number of hypervisors and registered consumers is taking long time to run the "HypervisorHeartbeatUpdateJob". When running this job, it seems to potentially block some Candlepin requests too, such as /certificates/serials, /release etc.

This is the offending lines: https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643


If I understand correctly, the query is suppose to update only the "lastcheckin" of the reported hypervisors. However, this query seems to update all the consumers in the organization. See the below tests:


# I have 5725 consumers in 'redhat' org
candlepin=# select count(*) from cp_consumer where owner_id in (select id from cp_owner where account = 'redhat');
 count 
-------
  5725
(1 row)

# Total hypervisor in 'redhat' org
candlepin=# select count(*) from cp_consumer a join cp_consumer_hypervisor b on a.id = b.consumer_id join cp_owner c on c.id = a.owner_id where b.reporter_id = 'my-report-id' and c.account = 'redhat';
 count 
-------
  1221
(1 row)

# Run the query in https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643
candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id =  'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
UPDATE 5725  <============================  Same as the total consumers in 'redhat' org
Time: 36482.580 ms (00:36.483) <================= Take 36 seconds. It will take even long for env with 20k+


# I think the correct query should be:
candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer_hypervisor b, cp_owner c WHERE cp_consumer.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
UPDATE 1221  <===== Equal to the total hypervisors in 'redhat' org
Time: 399.606 ms  <=========== Almost immediately

Steps to Reproduce:
1. Register a few thousands of hosts to the Satellite.
2. Configure virt-who to report at least 1000 hypervisors. You can use the fake report.
3. Run virt-who -do
4. tail the /var/log/candlepin/candlepin.log


Actual results:
HypervisorHeartbeatUpdateJob takes long time to run

Expected results:
HypervisorHeartbeatUpdateJob should finish pretty quick.

Additional info:


# explain UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id =  'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Update on cp_consumer  (cost=781.33..19166.56 rows=1167965 width=1702)
   ->  Hash Join  (cost=781.33..19166.56 rows=1167965 width=1702)
         Hash Cond: ((b.consumer_id)::text = (a.id)::text)
         ->  Nested Loop  (cost=0.41..15317.45 rows=1167965 width=1709) <======================= Is like looping the whole tables
               ->  Seq Scan on cp_consumer_hypervisor b  (cost=0.00..90.85 rows=1223 width=39)
                     Filter: ((reporter_id)::text = 'my-report-id'::text)
               ->  Materialize  (cost=0.41..629.43 rows=955 width=1670)
                     ->  Nested Loop  (cost=0.41..624.65 rows=955 width=1670)
                           ->  Seq Scan on cp_owner c  (cost=0.00..1.07 rows=1 width=39)
                                 Filter: ((account)::text = 'redhat'::text)
                           ->  Index Scan using cp_consumer_owner_id_idx on cp_consumer  (cost=0.41..604.48 rows=1910 width=1664)
                                 Index Cond: ((owner_id)::text = (c.id)::text)
         ->  Hash  (cost=709.30..709.30 rows=5730 width=39)
               ->  Seq Scan on cp_consumer a  (cost=0.00..709.30 rows=5730 width=39)
(14 rows)

--- Additional comment from William Poteat on 2021-12-02 15:20:26 UTC ---

Can we also get an explain plan on the updated query? Thanks.

Comment 1 Hao Chang Yu 2021-12-03 09:35:22 UTC

The explain plan for updated query as requested

# explain UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer_hypervisor b, cp_owner c WHERE cp_consumer.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Update on cp_consumer  (cost=106.54..748.10 rows=204 width=1696)
   ->  Hash Join  (cost=106.54..748.10 rows=204 width=1696)
         Hash Cond: ((cp_consumer.id)::text = (b.consumer_id)::text)
         ->  Nested Loop  (cost=0.41..639.45 rows=955 width=1670)
               ->  Seq Scan on cp_owner c  (cost=0.00..1.07 rows=1 width=39)
                     Filter: ((account)::text = 'redhat'::text)
               ->  Index Scan using cp_consumer_owner_id_idx on cp_consumer  (cost=0.41..619.28 rows=1910 width=1664)
                     Index Cond: ((owner_id)::text = (c.id)::text)
         ->  Hash  (cost=90.85..90.85 rows=1223 width=39)
               ->  Seq Scan on cp_consumer_hypervisor b  (cost=0.00..90.85 rows=1223 width=39)
                     Filter: ((reporter_id)::text = 'my-report-id'::text)

Note You need to log in before you can comment on or make changes to this bug.