Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
Monitor and Stop operations for the "ocf:heartbeat:Delay" resource fail at default settings ( out of the box config ). This is because the default "mondelay" and "stopdelay" timeouts are the exact same as the timeout period for monitor and stop operations in the cluster.
Version-Release number of selected component (if applicable):
# rpm -q resource-agents kernel
resource-agents-4.10.0-34.el9.x86_64
kernel-5.14.0-70.13.1.el9_0.x86_64
How reproducible:
Monitor failures appear to be 100% faillure. I have had a couple of successful stop operations, but most fail at default configuration:
Steps to Reproduce:
1. Created resource with default settings ( no additional options ), and disabled to run "debug-<operation>" test:
~~~
[root@clusterb-rhel9 ~]# pcs resource create test-delay Delay
Assumed agent name 'ocf:heartbeat:Delay' (deduced from 'Delay')
[root@clusterb-rhel9 ~]# pcs resource disable test-delay
~~~
2. Start operation is successful with default settings ( successful ):
~~~
[root@clustera-rhel9 ~]# pcs resource debug-start test-delay
Operation force-start for test-delay (ocf:heartbeat:Delay) returned 0 (ok)
~~~
3. Monitor operation times out with default settings:
~~~
[root@clustera-rhel9 ~]# pcs resource debug-monitor test-delay
Operation force-check for test-delay (ocf:heartbeat:Delay) could not be executed (Timed Out: Process did not exit within specified timeout)
crm_resource: Error performing operation: Error occurred
~~~
4. Stop operations time out with default settings:
~~~
# Can only be ran after a "debug-start" to start the resource. Otherwise reports as already down:
[root@clustera-rhel9 ~]# pcs resource debug-stop test-delay
Operation force-stop for test-delay (ocf:heartbeat:Delay) could not be executed (Timed Out: Process did not exit within specified timeout)
crm_resource: Error performing operation: Error occurred
~~~
- Current default monitor and stop delay times in the RA, match the default timeout periods for "monitor" and "stop" operations:
~~~
[root@clustera-rhel9 ~]# rpm -q resource-agents
resource-agents-4.10.0-34.el9.x86_64
~~~
~~~
$ vim /usr/lib/ocf/resource.d/heartbeat/Delay
----------------------->8--------------------------
33 OCF_RESKEY_startdelay_default="20"
34 OCF_RESKEY_stopdelay_default="30"
35 OCF_RESKEY_mondelay_default="30"
36
37 : ${OCF_RESKEY_startdelay=${OCF_RESKEY_startdelay_default}}
38 : ${OCF_RESKEY_stopdelay=${OCF_RESKEY_stopdelay_default}}
39 : ${OCF_RESKEY_mondelay=${OCF_RESKEY_mondelay_default}}
~~~
~~~
$ pcs config
----------------------->8--------------------------
Resource: test-delay (class=ocf provider=heartbeat type=Delay)
Meta Attrs: target-role=Stopped
Operations: monitor interval=10s timeout=30s (test-delay-monitor-interval-10s)
start interval=0s timeout=30s (test-delay-start-interval-0s)
stop interval=0s timeout=30s (test-delay-stop-interval-0s)
~~~
So we should probably reduce the default delay for the resource agent for both of these operations. Otherwise they will fail out of the box.
Actual results:
Start operations are successful.
Monitor operations timed out.
Stop operations timed out.
Expected results:
All operations ( start, stop monitor ) should be successful with a default configuration.
Additional info:
- Shane Bradley++ has pointed out that the resource description is also incorrect. Stop and monitor delays are not set to the same as the start delay. Not sure if this should be update here or in another Bugzilla
~~~
[root@rhel8-node1 ~]# pcs resource describe Delay
Assumed agent name 'ocf:heartbeat:Delay' (deduced from 'Delay')
ocf:heartbeat:Delay - Waits for a defined timespan
This script is a test resource for introducing delay.
Resource options:
startdelay: How long in seconds to delay on start operation.
stopdelay: How long in seconds to delay on stop operation. Defaults to "startdelay" if unspecified. <---
mondelay: How long in seconds to delay on monitor operation. Defaults to "startdelay" if unspecified. <---
~~~
- Both the description discrepancy and this issue were likely introduced in this change, which set default timeouts for stopdelay and startdelay to 30s:
https://github.com/ClusterLabs/resource-agents/commit/baa4cdf6afb9df801d40895f2a9ffcf7d2c8fdae
Comment 1Oyvind Albrigtsen
2023-05-26 15:00:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2023:6312
Description of problem: Monitor and Stop operations for the "ocf:heartbeat:Delay" resource fail at default settings ( out of the box config ). This is because the default "mondelay" and "stopdelay" timeouts are the exact same as the timeout period for monitor and stop operations in the cluster. Version-Release number of selected component (if applicable): # rpm -q resource-agents kernel resource-agents-4.10.0-34.el9.x86_64 kernel-5.14.0-70.13.1.el9_0.x86_64 How reproducible: Monitor failures appear to be 100% faillure. I have had a couple of successful stop operations, but most fail at default configuration: Steps to Reproduce: 1. Created resource with default settings ( no additional options ), and disabled to run "debug-<operation>" test: ~~~ [root@clusterb-rhel9 ~]# pcs resource create test-delay Delay Assumed agent name 'ocf:heartbeat:Delay' (deduced from 'Delay') [root@clusterb-rhel9 ~]# pcs resource disable test-delay ~~~ 2. Start operation is successful with default settings ( successful ): ~~~ [root@clustera-rhel9 ~]# pcs resource debug-start test-delay Operation force-start for test-delay (ocf:heartbeat:Delay) returned 0 (ok) ~~~ 3. Monitor operation times out with default settings: ~~~ [root@clustera-rhel9 ~]# pcs resource debug-monitor test-delay Operation force-check for test-delay (ocf:heartbeat:Delay) could not be executed (Timed Out: Process did not exit within specified timeout) crm_resource: Error performing operation: Error occurred ~~~ 4. Stop operations time out with default settings: ~~~ # Can only be ran after a "debug-start" to start the resource. Otherwise reports as already down: [root@clustera-rhel9 ~]# pcs resource debug-stop test-delay Operation force-stop for test-delay (ocf:heartbeat:Delay) could not be executed (Timed Out: Process did not exit within specified timeout) crm_resource: Error performing operation: Error occurred ~~~ - Current default monitor and stop delay times in the RA, match the default timeout periods for "monitor" and "stop" operations: ~~~ [root@clustera-rhel9 ~]# rpm -q resource-agents resource-agents-4.10.0-34.el9.x86_64 ~~~ ~~~ $ vim /usr/lib/ocf/resource.d/heartbeat/Delay ----------------------->8-------------------------- 33 OCF_RESKEY_startdelay_default="20" 34 OCF_RESKEY_stopdelay_default="30" 35 OCF_RESKEY_mondelay_default="30" 36 37 : ${OCF_RESKEY_startdelay=${OCF_RESKEY_startdelay_default}} 38 : ${OCF_RESKEY_stopdelay=${OCF_RESKEY_stopdelay_default}} 39 : ${OCF_RESKEY_mondelay=${OCF_RESKEY_mondelay_default}} ~~~ ~~~ $ pcs config ----------------------->8-------------------------- Resource: test-delay (class=ocf provider=heartbeat type=Delay) Meta Attrs: target-role=Stopped Operations: monitor interval=10s timeout=30s (test-delay-monitor-interval-10s) start interval=0s timeout=30s (test-delay-start-interval-0s) stop interval=0s timeout=30s (test-delay-stop-interval-0s) ~~~ So we should probably reduce the default delay for the resource agent for both of these operations. Otherwise they will fail out of the box. Actual results: Start operations are successful. Monitor operations timed out. Stop operations timed out. Expected results: All operations ( start, stop monitor ) should be successful with a default configuration. Additional info: - Shane Bradley++ has pointed out that the resource description is also incorrect. Stop and monitor delays are not set to the same as the start delay. Not sure if this should be update here or in another Bugzilla ~~~ [root@rhel8-node1 ~]# pcs resource describe Delay Assumed agent name 'ocf:heartbeat:Delay' (deduced from 'Delay') ocf:heartbeat:Delay - Waits for a defined timespan This script is a test resource for introducing delay. Resource options: startdelay: How long in seconds to delay on start operation. stopdelay: How long in seconds to delay on stop operation. Defaults to "startdelay" if unspecified. <--- mondelay: How long in seconds to delay on monitor operation. Defaults to "startdelay" if unspecified. <--- ~~~ - Both the description discrepancy and this issue were likely introduced in this change, which set default timeouts for stopdelay and startdelay to 30s: https://github.com/ClusterLabs/resource-agents/commit/baa4cdf6afb9df801d40895f2a9ffcf7d2c8fdae