Bug 1435013

Summary: [RFE] Randomize and/or Distribute the execution of rhsmcertd over a large Satellite 6 Deployment
Product: Red Hat Enterprise Linux 7 Reporter: Jason Dickerson <jdickers>
Component: subscription-managerAssignee: Chris Snyder <csnyder>
Status: CLOSED ERRATA QA Contact: John Sefler <jsefler>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.3CC: bkearney, csnyder, khowell, redakkan, skallesh, wpinheir
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: subscription-manager-1.19.6-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 19:21:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1430554    

Description Jason Dickerson 2017-03-22 21:57:58 UTC
1. Proposed title of this feature request

[RFE] Randomize and/or Distribute the execution of rhsmcertd over a large Satellite 6 Deployment


3. What is the nature and description of the request?

I have a Satellite 6 deployment with roughly 56,000 hosts.  We have changed the check-in interval for rhsmcertd from the default 4h to 8h.  Every 8 hours, we have a large number of hosts checking in at the same time.  This drives our passenger usage on satellite to the limit and beyond.  We need a solution to spread the check-ins over time, so we do not exceed the passenger queue, resulting in the Satellite 6 UI becoming unresponsive.  

on large scale satellite deployments the passenger queue is maxed out for a time, attempting to process all the rhsmcertd requests, and the UI will be unresponsive, until the queue goes down.  


4. Why does the customer need this? (List the business requirements here)

This causes periods of time where the Satellite UI is under heavy load and unresponsive.  


5. How would the customer like to achieve this? (List the functional requirements here)

rhsmcertd should randomize it's start time, to allow for such situations and not max out the passenger queue on satellite.  Otherwise another mechanism should be used to ensure rhsmcertd checkins do not occur at the same time for a large group of hosts.  


6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the 

requirement is successfully implemented.
restart rhsmcertd on a large number of hosts, and determine if they all check in at the same time.


7. Is there already an existing RFE upstream or in Red Hat Bugzilla?

Not that I know of.


8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?

ASAP.  this is impacting their ability to use the Satellite UI.


9. Is the sales team involved in this request and do they have any additional input?

Yes, and I do not believe so at this time.


10. List any affected packages or components.

subscription manager on rhel 6 and 7
Satellite 6.2.8


11. Would the customer be able to assist in testing this functionality if implemented?

Absolutely

Comment 3 Chris Snyder 2017-04-06 18:54:55 UTC
I have added an external tracker to a PR against upstream rhsmcertd for an implementation of this feature.

Comment 6 Rehana 2017-05-10 12:40:09 UTC
Below are the test scenarios that will be used to verify the bug on latest subscription-manager build;
subscription-manager: 1.19.12-1.el7
python-rhsm: 1.19.6-1.el7

1)Demonstrates how the Initial auto-heal and cert-checks are randomized between two guest machines with new configuration parameter "splay" set to 1  (ON)
2)Demonstrates how the original behaviour (ie , initial check happens after default configured 2mins interval) when "splay" set to 0 (OFF)

1) Scenario 1 : Demonstrates how the Initial auto-heal and cert-checks are randomized between two guest machines with new configuration parameter "splay" default set to 1  (ON)       
On machine 1 :
--------------------

1: Register guest machine 1 to server
2: make sure the auto-attach ,cert-check and splay with default value
[rhsmcertd]
   autoattachinterval = [1440]
   certcheckinterval = [240]
   splay = 1

3: Restart rhsmcertd and check the rhsmcert.log
[root@dhcp151-211 ~]# service rhsmcertd restart
Redirecting to /bin/systemctl restart rhsmcertd.service
[root@dhcp151-211 ~]# tail -f /var/log/rhsm/rhsmcertd.log 
Tue May  9 20:12:27 2017 [INFO] (Cert Check) Certificates updated.
Wed May 10 00:12:31 2017 [INFO] (Cert Check) Certificates updated.
Wed May 10 04:12:29 2017 [INFO] (Cert Check) Certificates updated.
Wed May 10 05:35:57 2017 [WARN] (Auto-attach) Update failed (255), retry will occur on next run.
Wed May 10 06:44:02 2017 [INFO] rhsmcertd is shutting down...
Wed May 10 06:44:02 2017 [INFO] Starting rhsmcertd...
Wed May 10 06:44:02 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 06:44:02 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 06:44:02 2017 [INFO] Waiting 2.0 minutes plus 28937 splay seconds [29057 seconds total] before performing first auto-attach.
Wed May 10 06:44:02 2017 [INFO] Waiting 2.0 minutes plus 10726 splay seconds [10846 seconds total] before performing first cert check.

^^ Notice the Random splay seconds on the guest machine 1 , due to which the first auto-attach on this machine will be performing  at 29057 seconds and cert check at 10846 seconds respectively

On machine 2 : 
---------------------

1: Register guest machine 2 to server
2: make sure the auto-attach ,cert-check and splay with default value
[rhsmcertd]
   autoattachinterval = [1440]
   certcheckinterval = [240]
   splay = 1

3: Restart rhsmcertd and check the rhsmcert.log

[root@dhcp35-238 ~]# service rhsmcertd restart
Redirecting to /bin/systemctl restart rhsmcertd.service
[root@dhcp35-238 ~]# tail -f /var/log/rhsm/rhsmcertd.log 
Wed May 10 15:47:39 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 15:47:39 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 15:47:39 2017 [INFO] Waiting 2.0 minutes plus 46554 splay seconds [46674 seconds total] before performing first auto-attach.
Wed May 10 15:47:39 2017 [INFO] Waiting 2.0 minutes plus 9072 splay seconds [9192 seconds total] before performing first cert check.
Wed May 10 16:14:16 2017 [INFO] rhsmcertd is shutting down...
Wed May 10 16:14:16 2017 [INFO] Starting rhsmcertd...
Wed May 10 16:14:16 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 16:14:16 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 16:14:16 2017 [INFO] Waiting 2.0 minutes plus 62750 splay seconds [62870 seconds total] before performing first auto-attach.
Wed May 10 16:14:16 2017 [INFO] Waiting 2.0 minutes plus 3288 splay seconds [3408 seconds total] before performing first cert check.

^^ Notice the Random splay seconds on the guest machine 2 , due to which the first auto-attach on this machine will be performing  at 62870 seconds and cert check at 3408 seconds respectively

Thus , with new rhsm config parameter 'splay' set to "1" , the machines will have rhsmcertd running at slightly different times there by reducing the load when large number machines restart simulatenouesly 

2) Scenarion 2 Demonstrates the orginal behaviour (ie , initial check happens after default configured 2mins interval) when "splay" set to 0 (OFF)

[root@dhcp35-238 ~]# subscription-manager config --rhsmcertd.splay 0

[root@dhcp35-238 ~]# service rhsmcertd restart
Redirecting to /bin/systemctl restart rhsmcertd.service

[root@dhcp35-238 ~]# tail -f /var/log/rhsm/rhsmcertd.log 
Wed May 10 16:14:16 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 16:14:16 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 16:14:16 2017 [INFO] Waiting 2.0 minutes plus 62750 splay seconds [62870 seconds total] before performing first auto-attach.
Wed May 10 16:14:16 2017 [INFO] Waiting 2.0 minutes plus 3288 splay seconds [3408 seconds total] before performing first cert check.
Wed May 10 16:43:57 2017 [INFO] rhsmcertd is shutting down...
Wed May 10 16:43:57 2017 [INFO] Starting rhsmcertd...
Wed May 10 16:43:57 2017 [INFO] Auto-attach interval: 1440.0 minutes [86400 seconds]
Wed May 10 16:43:57 2017 [INFO] Cert check interval: 240.0 minutes [14400 seconds]
Wed May 10 16:43:57 2017 [INFO] Waiting 2.0 minutes plus 0 splay seconds [120 seconds total] before performing first auto-attach.
Wed May 10 16:43:57 2017 [INFO] Waiting 2.0 minutes plus 0 splay seconds [120 seconds total] before performing first cert check.

^^ ^ Notice the Random splay seconds is no longer applied, there by defaulting the initial check to happen in 2mins 

Conclusion : 
===========
When splay set to 1 , The randomized splay value will always be between 0 and the interval being randomized. 
example : for the auto attach splay amount , the value should be between 0 and 86400. (with the default value for autoattachinterval of 1440 min (86400 seconds))

when splay set to 0, the rhsmcertd will be default to 2min check 

Bsed on the above verification , moving this bug to Verified

Comment 7 errata-xmlrpc 2017-08-01 19:21:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2083