1435013 – [RFE] Randomize and/or Distribute the execution of rhsmcertd over a large Satellite 6 Deployment

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1435013 - [RFE] Randomize and/or Distribute the execution of rhsmcertd over a large Satellite 6 Deployment

Summary: [RFE] Randomize and/or Distribute the execution of rhsmcertd over a large Sat...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	subscription-manager
Sub Component:
Version:	7.3
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Snyder
QA Contact:	John Sefler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	sat6-fe-walmart
TreeView+	depends on / blocked

Reported:	2017-03-22 21:57 UTC by Jason Dickerson
Modified:	2023-06-04 18:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:	subscription-manager-1.19.6-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-01 19:21:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	candlepin subscription-manager pull 1577	None	closed	1435013: Add splay to all checks done by rhsmcertd	2020-02-20 16:45:51 UTC
Red Hat Bugzilla	1440251	high	CLOSED	Building of rhsmcertd is broken at RHEL	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:2083	normal	SHIPPED_LIVE	python-rhsm and subscription-manager bug fix and enhancement update	2017-08-01 18:14:19 UTC

Internal Links: 1440251

Description Jason Dickerson 2017-03-22 21:57:58 UTC

1. Proposed title of this feature request

[RFE] Randomize and/or Distribute the execution of rhsmcertd over a large Satellite 6 Deployment

3. What is the nature and description of the request?

I have a Satellite 6 deployment with roughly 56,000 hosts. We have changed the check-in interval for rhsmcertd from the default 4h to 8h. Every 8 hours, we have a large number of hosts checking in at the same time. This drives our passenger usage on satellite to the limit and beyond. We need a solution to spread the check-ins over time, so we do not exceed the passenger queue, resulting in the Satellite 6 UI becoming unresponsive.

on large scale satellite deployments the passenger queue is maxed out for a time, attempting to process all the rhsmcertd requests, and the UI will be unresponsive, until the queue goes down.

4. Why does the customer need this? (List the business requirements here)

This causes periods of time where the Satellite UI is under heavy load and unresponsive.

5. How would the customer like to achieve this? (List the functional requirements here)

rhsmcertd should randomize it's start time, to allow for such situations and not max out the passenger queue on satellite. Otherwise another mechanism should be used to ensure rhsmcertd checkins do not occur at the same time for a large group of hosts.

6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the

requirement is successfully implemented.
restart rhsmcertd on a large number of hosts, and determine if they all check in at the same time.

7. Is there already an existing RFE upstream or in Red Hat Bugzilla?

Not that I know of.

8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?

ASAP. this is impacting their ability to use the Satellite UI.

9. Is the sales team involved in this request and do they have any additional input?

Yes, and I do not believe so at this time.

10. List any affected packages or components.

subscription manager on rhel 6 and 7
Satellite 6.2.8

11. Would the customer be able to assist in testing this functionality if implemented?

Absolutely

Comment 3 Chris Snyder 2017-04-06 18:54:55 UTC

I have added an external tracker to a PR against upstream rhsmcertd for an implementation of this feature.

Comment 6 Rehana 2017-05-10 12:40:09 UTC

Below are the test scenarios that will be used to verify the bug on latest subscription-manager build;
subscription-manager: 1.19.12-1.el7
python-rhsm: 1.19.6-1.el7

1)Demonstrates how the Initial auto-heal and cert-checks are randomized between two guest machines with new configuration parameter "splay" set to 1  (ON)
2)Demonstrates how the original behaviour (ie , initial check happens after default configured 2mins interval) when "splay" set to 0 (OFF)

1) Scenario 1 : Demonstrates how the Initial auto-heal and cert-checks are randomized between two guest machines with new configuration parameter "splay" default set to 1  (ON)       
On machine 1 :
--------------------

1: Register guest machine 1 to server
2: make sure the auto-attach ,cert-check and splay with default value
[rhsmcertd]
   autoattachinterval = [1440]
   certcheckinterval = [240]
   splay = 1

3: Restart rhsmcertd and check the rhsmcert.log
[root@dhcp151-211 ~]# service rhsmcertd restart
Redirecting to /bin/systemctl restart rhsmcertd.service
[root@dhcp151-211 ~]# tail -f /var/log/rhsm/rhsmcertd.log 
Tue May  9 20:12:27 2017 [INFO] (Cert Check) Certificates updated.
Wed May 10 00:12:31 2017 [INFO] (Cert Check) Certificates updated.
Wed May 10 04:12:29 2017 [INFO] (Cert Check) Certificates updated.
Wed May 10 05:35:57 2017 [WARN] (Auto-attach) Update failed (255), retry will occur on next run.
Wed May 10 06:44:02 2017 [INFO] rhsmcertd is shutting down...
Wed May 10 06:44:02 2017 [INFO] Starting rhsmcertd...
Wed May 10 06:44:02 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 06:44:02 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 06:44:02 2017 [INFO] Waiting 2.0 minutes plus 28937 splay seconds [29057 seconds total] before performing first auto-attach.
Wed May 10 06:44:02 2017 [INFO] Waiting 2.0 minutes plus 10726 splay seconds [10846 seconds total] before performing first cert check.

^^ Notice the Random splay seconds on the guest machine 1 , due to which the first auto-attach on this machine will be performing  at 29057 seconds and cert check at 10846 seconds respectively

On machine 2 : 
---------------------

1: Register guest machine 2 to server
2: make sure the auto-attach ,cert-check and splay with default value
[rhsmcertd]
   autoattachinterval = [1440]
   certcheckinterval = [240]
   splay = 1

3: Restart rhsmcertd and check the rhsmcert.log

[root@dhcp35-238 ~]# service rhsmcertd restart
Redirecting to /bin/systemctl restart rhsmcertd.service
[root@dhcp35-238 ~]# tail -f /var/log/rhsm/rhsmcertd.log 
Wed May 10 15:47:39 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 15:47:39 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 15:47:39 2017 [INFO] Waiting 2.0 minutes plus 46554 splay seconds [46674 seconds total] before performing first auto-attach.
Wed May 10 15:47:39 2017 [INFO] Waiting 2.0 minutes plus 9072 splay seconds [9192 seconds total] before performing first cert check.
Wed May 10 16:14:16 2017 [INFO] rhsmcertd is shutting down...
Wed May 10 16:14:16 2017 [INFO] Starting rhsmcertd...
Wed May 10 16:14:16 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 16:14:16 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 16:14:16 2017 [INFO] Waiting 2.0 minutes plus 62750 splay seconds [62870 seconds total] before performing first auto-attach.
Wed May 10 16:14:16 2017 [INFO] Waiting 2.0 minutes plus 3288 splay seconds [3408 seconds total] before performing first cert check.

^^ Notice the Random splay seconds on the guest machine 2 , due to which the first auto-attach on this machine will be performing  at 62870 seconds and cert check at 3408 seconds respectively

Thus , with new rhsm config parameter 'splay' set to "1" , the machines will have rhsmcertd running at slightly different times there by reducing the load when large number machines restart simulatenouesly 

2) Scenarion 2 Demonstrates the orginal behaviour (ie , initial check happens after default configured 2mins interval) when "splay" set to 0 (OFF)

[root@dhcp35-238 ~]# subscription-manager config --rhsmcertd.splay 0

[root@dhcp35-238 ~]# service rhsmcertd restart
Redirecting to /bin/systemctl restart rhsmcertd.service

[root@dhcp35-238 ~]# tail -f /var/log/rhsm/rhsmcertd.log 
Wed May 10 16:14:16 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 16:14:16 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 16:14:16 2017 [INFO] Waiting 2.0 minutes plus 62750 splay seconds [62870 seconds total] before performing first auto-attach.
Wed May 10 16:14:16 2017 [INFO] Waiting 2.0 minutes plus 3288 splay seconds [3408 seconds total] before performing first cert check.
Wed May 10 16:43:57 2017 [INFO] rhsmcertd is shutting down...
Wed May 10 16:43:57 2017 [INFO] Starting rhsmcertd...
Wed May 10 16:43:57 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 16:43:57 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 16:43:57 2017 [INFO] Waiting 2.0 minutes plus 0 splay seconds [120 seconds total] before performing first auto-attach.
Wed May 10 16:43:57 2017 [INFO] Waiting 2.0 minutes plus 0 splay seconds [120 seconds total] before performing first cert check.

^^ ^ Notice the Random splay seconds is no longer applied, there by defaulting the initial check to happen in 2mins 

Conclusion : 
===========
When splay set to 1 , The randomized splay value will always be between 0 and the interval being randomized. 
example : for the auto attach splay amount , the value should be between 0 and 86400. (with the default value for autoattachinterval of 1440 min (86400 seconds))

when splay set to 0, the rhsmcertd will be default to 2min check 

Bsed on the above verification , moving this bug to Verified

Comment 7 errata-xmlrpc 2017-08-01 19:21:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2083

Note You need to log in before you can comment on or make changes to this bug.