Bug 2028765

Summary: HypervisorHeartbeatUpdateJob is taking long time to process and updates wrong consumer records
Product: [Community] Candlepin Reporter: Nikos Moumoulidis <nmoumoul>
Component: candlepinAssignee: candlepin-bugs
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.0CC: bcourt, dsynk, hyu, redakkan, wpoteat
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-4.1.9-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2027947 Environment:
Last Closed: 2022-01-20 10:16:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2027947    

Description Nikos Moumoulidis 2021-12-03 08:43:10 UTC
+++ This bug was initially created as a clone of Bug #2027947 +++

Description of problem:
Satellite with large number of hypervisors and registered consumers is taking long time to run the "HypervisorHeartbeatUpdateJob". When running this job, it seems to potentially block some Candlepin requests too, such as /certificates/serials, /release etc.

This is the offending lines: https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643


If I understand correctly, the query is suppose to update only the "lastcheckin" of the reported hypervisors. However, this query seems to update all the consumers in the organization. See the below tests:


# I have 5725 consumers in 'redhat' org
candlepin=# select count(*) from cp_consumer where owner_id in (select id from cp_owner where account = 'redhat');
 count 
-------
  5725
(1 row)

# Total hypervisor in 'redhat' org
candlepin=# select count(*) from cp_consumer a join cp_consumer_hypervisor b on a.id = b.consumer_id join cp_owner c on c.id = a.owner_id where b.reporter_id = 'my-report-id' and c.account = 'redhat';
 count 
-------
  1221
(1 row)

# Run the query in https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643
candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id =  'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
UPDATE 5725  <============================  Same as the total consumers in 'redhat' org
Time: 36482.580 ms (00:36.483) <================= Take 36 seconds. It will take even long for env with 20k+


# I think the correct query should be:
candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer_hypervisor b, cp_owner c WHERE cp_consumer.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
UPDATE 1221  <===== Equal to the total hypervisors in 'redhat' org
Time: 399.606 ms  <=========== Almost immediately

Steps to Reproduce:
1. Register a few thousands of hosts to the Satellite.
2. Configure virt-who to report at least 1000 hypervisors. You can use the fake report.
3. Run virt-who -do
4. tail the /var/log/candlepin/candlepin.log


Actual results:
HypervisorHeartbeatUpdateJob takes long time to run

Expected results:
HypervisorHeartbeatUpdateJob should finish pretty quick.

Additional info:


# explain UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id =  'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Update on cp_consumer  (cost=781.33..19166.56 rows=1167965 width=1702)
   ->  Hash Join  (cost=781.33..19166.56 rows=1167965 width=1702)
         Hash Cond: ((b.consumer_id)::text = (a.id)::text)
         ->  Nested Loop  (cost=0.41..15317.45 rows=1167965 width=1709) <======================= Is like looping the whole tables
               ->  Seq Scan on cp_consumer_hypervisor b  (cost=0.00..90.85 rows=1223 width=39)
                     Filter: ((reporter_id)::text = 'my-report-id'::text)
               ->  Materialize  (cost=0.41..629.43 rows=955 width=1670)
                     ->  Nested Loop  (cost=0.41..624.65 rows=955 width=1670)
                           ->  Seq Scan on cp_owner c  (cost=0.00..1.07 rows=1 width=39)
                                 Filter: ((account)::text = 'redhat'::text)
                           ->  Index Scan using cp_consumer_owner_id_idx on cp_consumer  (cost=0.41..604.48 rows=1910 width=1664)
                                 Index Cond: ((owner_id)::text = (c.id)::text)
         ->  Hash  (cost=709.30..709.30 rows=5730 width=39)
               ->  Seq Scan on cp_consumer a  (cost=0.00..709.30 rows=5730 width=39)
(14 rows)

--- Additional comment from William Poteat on 2021-12-02 15:20:26 UTC ---

Can we also get an explain plan on the updated query? Thanks.

Comment 1 Hao Chang Yu 2021-12-03 09:35:22 UTC
The explain plan for updated query as requested

# explain UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer_hypervisor b, cp_owner c WHERE cp_consumer.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Update on cp_consumer  (cost=106.54..748.10 rows=204 width=1696)
   ->  Hash Join  (cost=106.54..748.10 rows=204 width=1696)
         Hash Cond: ((cp_consumer.id)::text = (b.consumer_id)::text)
         ->  Nested Loop  (cost=0.41..639.45 rows=955 width=1670)
               ->  Seq Scan on cp_owner c  (cost=0.00..1.07 rows=1 width=39)
                     Filter: ((account)::text = 'redhat'::text)
               ->  Index Scan using cp_consumer_owner_id_idx on cp_consumer  (cost=0.41..619.28 rows=1910 width=1664)
                     Index Cond: ((owner_id)::text = (c.id)::text)
         ->  Hash  (cost=90.85..90.85 rows=1223 width=39)
               ->  Seq Scan on cp_consumer_hypervisor b  (cost=0.00..90.85 rows=1223 width=39)
                     Filter: ((reporter_id)::text = 'my-report-id'::text)