Bug 2209433 - ocf:heartbeat:Delay RA fails monitor and stop operations when using default settings
Summary: ocf:heartbeat:Delay RA fails monitor and stop operations when using default s...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: resource-agents
Version: 9.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 9.3
Assignee: Oyvind Albrigtsen
QA Contact: cluster-qe
Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks: 2209436
TreeView+ depends on / blocked
 
Reported: 2023-05-23 21:18 UTC by Joshua Baker
Modified: 2023-08-10 15:41 UTC (History)
8 users (show)

Fixed In Version: resource-agents-4.10.0-42.el9
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 2209436 (view as bug list)
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CLUSTERQE-6812 0 None None None 2023-07-12 11:16:22 UTC
Red Hat Issue Tracker RHELPLAN-157973 0 None None None 2023-05-23 21:19:51 UTC
Red Hat Knowledge Base (Solution) 7014950 0 None None None 2023-05-24 12:14:00 UTC

Description Joshua Baker 2023-05-23 21:18:40 UTC
Description of problem:
Monitor and Stop operations for the "ocf:heartbeat:Delay" resource fail at default settings ( out of the box config ). This is because the default "mondelay" and "stopdelay" timeouts are the exact same as the timeout period for monitor and stop operations in the cluster. 

Version-Release number of selected component (if applicable):

# rpm -q resource-agents kernel
resource-agents-4.10.0-34.el9.x86_64
kernel-5.14.0-70.13.1.el9_0.x86_64

How reproducible:
Monitor failures appear to be 100% faillure. I have had a couple of successful stop operations, but most fail at default configuration:

Steps to Reproduce:

1. Created resource with default settings ( no additional options ), and disabled to run "debug-<operation>" test:
~~~
[root@clusterb-rhel9 ~]# pcs resource create test-delay Delay
Assumed agent name 'ocf:heartbeat:Delay' (deduced from 'Delay')

[root@clusterb-rhel9 ~]# pcs resource disable test-delay
~~~

2. Start operation is successful with default settings ( successful ):
~~~
[root@clustera-rhel9 ~]# pcs resource debug-start test-delay
Operation force-start for test-delay (ocf:heartbeat:Delay) returned 0 (ok)
~~~

3. Monitor operation times out with default settings:
~~~
[root@clustera-rhel9 ~]# pcs resource debug-monitor test-delay
Operation force-check for test-delay (ocf:heartbeat:Delay) could not be executed (Timed Out: Process did not exit within specified timeout)
crm_resource: Error performing operation: Error occurred
~~~

4. Stop operations time out with default settings:
~~~
# Can only be ran after a "debug-start" to start the resource. Otherwise reports as already down:
[root@clustera-rhel9 ~]# pcs resource debug-stop test-delay
Operation force-stop for test-delay (ocf:heartbeat:Delay) could not be executed (Timed Out: Process did not exit within specified timeout)
crm_resource: Error performing operation: Error occurred
~~~

- Current default monitor and stop delay times in the RA, match the default timeout periods for "monitor" and "stop" operations:
~~~
[root@clustera-rhel9 ~]# rpm -q resource-agents
resource-agents-4.10.0-34.el9.x86_64
~~~

~~~
$ vim /usr/lib/ocf/resource.d/heartbeat/Delay
----------------------->8--------------------------
 33 OCF_RESKEY_startdelay_default="20"
 34 OCF_RESKEY_stopdelay_default="30"
 35 OCF_RESKEY_mondelay_default="30"
 36 
 37 : ${OCF_RESKEY_startdelay=${OCF_RESKEY_startdelay_default}}
 38 : ${OCF_RESKEY_stopdelay=${OCF_RESKEY_stopdelay_default}}
 39 : ${OCF_RESKEY_mondelay=${OCF_RESKEY_mondelay_default}}
~~~

~~~
$ pcs config
----------------------->8--------------------------
 Resource: test-delay (class=ocf provider=heartbeat type=Delay)
  Meta Attrs: target-role=Stopped
  Operations: monitor interval=10s timeout=30s (test-delay-monitor-interval-10s)
              start interval=0s timeout=30s (test-delay-start-interval-0s)
              stop interval=0s timeout=30s (test-delay-stop-interval-0s)
~~~

So we should probably reduce the default delay for the resource agent for both of these operations. Otherwise they will fail out of the box. 

Actual results:
Start operations are successful.
Monitor operations timed out.
Stop operations timed out.

Expected results:
All operations ( start, stop monitor ) should be successful with a default configuration.

Additional info:
- Shane Bradley++ has pointed out that the resource description is also incorrect. Stop and monitor delays are not set to the same as the start delay. Not sure if this should be update here or in another Bugzilla

~~~
[root@rhel8-node1 ~]# pcs resource describe Delay
Assumed agent name 'ocf:heartbeat:Delay' (deduced from 'Delay')
ocf:heartbeat:Delay - Waits for a defined timespan

This script is a test resource for introducing delay.

Resource options:
  startdelay: How long in seconds to delay on start operation.
  stopdelay: How long in seconds to delay on stop operation. Defaults to "startdelay" if unspecified. <---
  mondelay: How long in seconds to delay on monitor operation. Defaults to "startdelay" if unspecified. <---
~~~

- Both the description discrepancy and this issue were likely introduced in this change, which set default timeouts for stopdelay and startdelay to 30s:

https://github.com/ClusterLabs/resource-agents/commit/baa4cdf6afb9df801d40895f2a9ffcf7d2c8fdae

Comment 1 Oyvind Albrigtsen 2023-05-26 15:00:29 UTC
https://github.com/ClusterLabs/resource-agents/pull/1871


Note You need to log in before you can comment on or make changes to this bug.