Bug 2209436

Summary: ocf:heartbeat:Delay RA fails monitor and stop operations when using default settings
Product: Red Hat Enterprise Linux 8 Reporter: Joshua Baker <jobaker>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: ASSIGNED --- QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.7CC: agk, cluster-maint, fdinitto, oalbrigt, sbradley
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2209433 Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2209433    
Bug Blocks:    

Description Joshua Baker 2023-05-23 21:36:07 UTC
+++ This bug was initially created as a clone of Bug #2209433 +++

Description of problem:
Monitor and Stop operations for the "ocf:heartbeat:Delay" resource fail at default settings ( out of the box config ). This is because the default "mondelay" and "stopdelay" timeouts are the exact same as the timeout period for monitor and stop operations in the cluster. 

Version-Release number of selected component (if applicable):

# rpm -q resource-agents kernel
resource-agents-4.9.0-29.el8_7.3.x86_64
kernel-4.18.0-425.3.1.el8.x86_64

How reproducible:
Monitor failures appear to be 100% faillure. I have had a couple of successful stop operations, but most fail at default configuration:

Steps to Reproduce:

1. Created resource with default settings ( no additional options ), and disabled to run "debug-<operation>" test:
~~~
[root@rhel8-node1 ~]# pcs resource create test-delay Delay
Assumed agent name 'ocf:heartbeat:Delay' (deduced from 'Delay')

[root@rhel8-node1 ~]# pcs resource disable test-delay
~~~

2. Start operation is successful with default settings ( successful ):
~~~
[root@rhel8-node2 ~]# pcs resource debug-start test-delay
Operation force-start for test-delay (ocf:heartbeat:Delay) returned 0 (ok)
~~~

3. Monitor operation times out with default settings:
~~~
[root@rhel8-node1 ~]# pcs resource debug-monitor test-delay
Operation force-check for test-delay (ocf:heartbeat:Delay) could not be executed (Timed Out: Resource agent did not exit within specified timeout)
crm_resource: Error performing operation: Error occurred
~~~

4. Stop operations time out with default settings:
~~~
# Can only be ran after a "debug-start" to start the resource. Otherwise reports as already down:
[root@rhel8-node2 ~]# pcs resource debug-stop test-delay
Operation force-stop for test-delay (ocf:heartbeat:Delay) could not be executed (Timed Out: Resource agent did not exit within specified timeout)
crm_resource: Error performing operation: Error occurred
~~~~~~

- Current default monitor and stop delay times in the RA, match the default timeout periods for "monitor" and "stop" operations:
~~~
[root@rhel8-node1 ~]# rpm -q resource-agents kernel
resource-agents-4.9.0-29.el8_7.3.x86_64
~~~

~~~
$ vim /usr/lib/ocf/resource.d/heartbeat/Delay
----------------------->8--------------------------
 33 OCF_RESKEY_startdelay_default="20"
 34 OCF_RESKEY_stopdelay_default="30"
 35 OCF_RESKEY_mondelay_default="30"
 36 
 37 : ${OCF_RESKEY_startdelay=${OCF_RESKEY_startdelay_default}}
 38 : ${OCF_RESKEY_stopdelay=${OCF_RESKEY_stopdelay_default}}
 39 : ${OCF_RESKEY_mondelay=${OCF_RESKEY_mondelay_default}}
~~~

~~~
[root@rhel8-node2 ~]# pcs config show
----------------------->8--------------------------
  Resource: test-delay (class=ocf provider=heartbeat type=Delay)
    Attributes: test-delay-instance_attributes
      mondelay=10
    Operations:
      monitor: test-delay-monitor-interval-10s
        interval=10s
        timeout=30s <---
      start: test-delay-start-interval-0s
        interval=0s
        timeout=30s <---
      stop: test-delay-stop-interval-0s
        interval=0s
        timeout=30s <---
~~~

So we should probably reduce the default delay for the resource agent for both of these operations. Otherwise they will fail out of the box. 

Actual results:
Start operations are successful.
Monitor operations timed out.
Stop operations timed out.

Expected results:
All operations ( start, stop monitor ) should be successful with a default configuration.

Additional info:
- Shane Bradley++ has pointed out that the resource description is also incorrect. Stop and monitor delays are not set to the same as the start delay. Not sure if this should be update here or in another Bugzilla

~~~
[root@rhel8-node1 ~]# pcs resource describe Delay
Assumed agent name 'ocf:heartbeat:Delay' (deduced from 'Delay')
ocf:heartbeat:Delay - Waits for a defined timespan

This script is a test resource for introducing delay.

Resource options:
  startdelay: How long in seconds to delay on start operation.
  stopdelay: How long in seconds to delay on stop operation. Defaults to "startdelay" if unspecified. <---
  mondelay: How long in seconds to delay on monitor operation. Defaults to "startdelay" if unspecified. <---
~~~

- Both the description discrepancy and this issue were likely introduced in this change, which set default timeouts for stopdelay and startdelay to 30s:

https://github.com/ClusterLabs/resource-agents/commit/baa4cdf6afb9df801d40895f2a9ffcf7d2c8fdae