Description of problem: Satellite with large number of hypervisors and registered consumers is taking long time to run the "HypervisorHeartbeatUpdateJob". When running this job, it seems to potentially block some Candlepin requests too, such as /certificates/serials, /release etc. This is the offending lines: https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643 If I understand correctly, the query is suppose to update only the "lastcheckin" of the reported hypervisors. However, this query seems to update all the consumers in the organization. See the below tests: # I have 5725 consumers in 'redhat' org candlepin=# select count(*) from cp_consumer where owner_id in (select id from cp_owner where account = 'redhat'); count ------- 5725 (1 row) # Total hypervisor in 'redhat' org candlepin=# select count(*) from cp_consumer a join cp_consumer_hypervisor b on a.id = b.consumer_id join cp_owner c on c.id = a.owner_id where b.reporter_id = 'my-report-id' and c.account = 'redhat'; count ------- 1221 (1 row) # Run the query in https://github.com/candlepin/candlepin/blob/master/src/main/java/org/candlepin/model/ConsumerCurator.java#L637-L643 candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat'; UPDATE 5725 <============================ Same as the total consumers in 'redhat' org Time: 36482.580 ms (00:36.483) <================= Take 36 seconds. It will take even long for env with 20k+ # I think the correct query should be: candlepin=# UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer_hypervisor b, cp_owner c WHERE cp_consumer.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat'; UPDATE 1221 <===== Equal to the total hypervisors in 'redhat' org Time: 399.606 ms <=========== Almost immediately Steps to Reproduce: 1. Register a few thousands of hosts to the Satellite. 2. Configure virt-who to report at least 1000 hypervisors. You can use the fake report. 3. Run virt-who -do 4. tail the /var/log/candlepin/candlepin.log Actual results: HypervisorHeartbeatUpdateJob takes long time to run Expected results: HypervisorHeartbeatUpdateJob should finish pretty quick. Additional info: # explain UPDATE cp_consumer SET lastcheckin = '2021-12-01 14:05:19.131+10' FROM cp_consumer a, cp_consumer_hypervisor b, cp_owner c WHERE a.id = b.consumer_id AND b.reporter_id = 'my-report-id' AND cp_consumer.owner_id = c.id AND c.account = 'redhat'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Update on cp_consumer (cost=781.33..19166.56 rows=1167965 width=1702) -> Hash Join (cost=781.33..19166.56 rows=1167965 width=1702) Hash Cond: ((b.consumer_id)::text = (a.id)::text) -> Nested Loop (cost=0.41..15317.45 rows=1167965 width=1709) <======================= Is like looping the whole tables -> Seq Scan on cp_consumer_hypervisor b (cost=0.00..90.85 rows=1223 width=39) Filter: ((reporter_id)::text = 'my-report-id'::text) -> Materialize (cost=0.41..629.43 rows=955 width=1670) -> Nested Loop (cost=0.41..624.65 rows=955 width=1670) -> Seq Scan on cp_owner c (cost=0.00..1.07 rows=1 width=39) Filter: ((account)::text = 'redhat'::text) -> Index Scan using cp_consumer_owner_id_idx on cp_consumer (cost=0.41..604.48 rows=1910 width=1664) Index Cond: ((owner_id)::text = (c.id)::text) -> Hash (cost=709.30..709.30 rows=5730 width=39) -> Seq Scan on cp_consumer a (cost=0.00..709.30 rows=5730 width=39) (14 rows)
Can we also get an explain plan on the updated query? Thanks.
"If I understand correctly, the query is suppose to update only the "lastcheckin" of the reported hypervisors" It is supposed to update any hypervisor with the corresponding reporter id and org. It is not determined by the contents of the hypervisor report. In scenarios where none of the hypervisors have changed, the lastcheckin date is updated via the heartbeat, but there is not a hypervisor report sent at all for the HypervisorUpdateJob.
I see where the current query gets it wrong and updates all rows for the org. Will fix.
Verified in Satellite 6.12 Snap 14 Ran the hypervisor/guest flood script provided by https://github.com/JacobCallahan/content-host-d python flood.py -s my.sat.host.com -m host --hypervisors 3000 --guests 1 -t ubi7 --exit-criteria reg --limit 25 with some additional test hypervisors included, this brought the total to 3,010 candlepin=# select count(*) from cp_consumer a join cp_consumer_hypervisor b on a.id = b.consumer_id join cp_owner c on c.id = a.owner_id where c.account = 'Default_Organization'; count ------- 3010 (1 row) Later, additional testing would add an additional 1,010 hypervisors. These final 1,000 are what was repeatedly submitted to the Satellite for updates. The overall update job completed twice as fast as the initial report, in about 2m. INFO org.candlepin.async.JobManager - Job "Hypervisor Update" completed in 29693ms Additionally, I decompiled the ConsumerCurator.class file and found the updated query associated with the heartbeat update. That query matches the suggested changes. query = "UPDATE cp_consumer consumer SET lastcheckin = :checkin FROM cp_consumer_hypervisor hypervisor, cp_owner owner WHERE consumer.id = hypervisor.consumer_id AND hypervisor.reporter_id = :reporter AND consumer.owner_id = owner.id AND owner.account = :ownerKey";
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.12 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8506
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days