Bug 1919910

Summary: The fail counts cannot be reset after the failure-timeout until cluster-recheck-interval arrived
Product: Red Hat Enterprise Linux 7 Reporter: chengliu <chengliu>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.9CC: cluster-maint, sbradley
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-25 15:29:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description chengliu 2021-01-25 11:48:27 UTC
Description of problem:

After failure-timeout reached, the fail count isn't cleared and resources did not move back to the original node according to the location restriction.
It is waiting for cluster-recheck-interval to expire.

Version-Release number of selected component (if applicable):

pacemaker-1.1.23-1.el7

How reproducible:

Steps to Reproduce:
1. 2 node cluster
2. Configure the resource with meta attributes "failure-timeout=120s migration-threshold=1"
3. Configure location constraints for resources

Actual results:

The resources move back to the original node after cluster-recheck-interval expired instead of checking the failure-timeout setting.
~~~
  Resource: Website (class=ocf provider=heartbeat type=apache)
   Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl=http://127.0.0.1/server-status
   Meta Attrs: failure-timeout=120s migration-threshold=1   
   Operations: monitor interval=10s timeout=20s (Website-monitor-interval-10s)
               start interval=0s timeout=40s (Website-start-interval-0s)
               stop interval=0s timeout=60s (Website-stop-interval-0s)

Location Constraints:
  Resource: apachegroup
    Enabled on: host2.example.com (score:2000) (id:location-apachegroup-host2.example.com-2000)

# pcs property --all | grep recheck
 cluster-recheck-interval: 15min

[root@host2 ~]# date; pcs resource
Mon Jan 25 02:24:43 EST 2021
 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started host2.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started host2.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started host2.example.com
     Website	(ocf::heartbeat:apache):	Started host2.example.com

[root@host2 ~]# date && killall -9 httpd && sleep 30 && crm_resource --wait && pcs resource
Mon Jan 25 02:25:52 EST 2021
 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started host1.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started host1.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started host1.example.com
     Website	(ocf::heartbeat:apache):	Started host1.example.com

[root@host2 ~]# date; pcs resource
Mon Jan 25 02:28:12 EST 2021
 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started host1.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started host1.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started host1.example.com
     Website	(ocf::heartbeat:apache):	Started host1.example.com

[root@host2 ~]# pcs resource failcount show Website
Failcounts for resource 'Website'
  host2.example.com: 1

[root@host2 ~]# date;pcs resource failcount show Website;pcs resource
Mon Jan 25 02:42:56 EST 2021
No failcounts for resource 'Website'
 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started host2.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started host2.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started host2.example.com
     Website	(ocf::heartbeat:apache):	Started host2.example.com

/var/log/messages
...
Jan 25 02:25:58 host1 pengine[1458]:    info: Website has failed 1 times on host2.example.com
Jan 25 02:25:58 host1 pengine[1458]: warning: Forcing Website away from host2.example.com after 1 failures (max=1)
...
Jan 25 02:25:58 host1 pengine[1458]:  notice:  * Move       my_lvm         ( host2.example.com -> host1.example.com )
Jan 25 02:25:58 host1 pengine[1458]:  notice:  * Move       my_fs          ( host2.example.com -> host1.example.com )
Jan 25 02:25:58 host1 pengine[1458]:  notice:  * Move       VirtualIP      ( host2.example.com -> host1.example.com )
Jan 25 02:25:58 host1 pengine[1458]:  notice:  * Recover    Website        ( host2.example.com -> host1.example.com )
...
Jan 25 02:41:00 host1 pengine[1458]:  notice: Clearing failure of Website on host2.example.com because it expired
Jan 25 02:41:00 host1 pengine[1458]:    info: Website has failed 1 times on host2.example.com
Jan 25 02:41:00 host1 pengine[1458]:  notice: Clearing failure of Website on host2.example.com because it expired
Jan 25 02:41:00 host1 pengine[1458]:  notice: Re-initiated expired calculated failure Website_monitor_10000 (rc=7, magic=0:7;16:179:0:7a8ad70e-ad05-403a-93bc-b174050386e3) on host2.example.com
...
Jan 25 02:41:00 host1 pengine[1458]:  notice:  * Move       my_lvm         ( host1.example.com -> host2.example.com )
Jan 25 02:41:00 host1 pengine[1458]:  notice:  * Move       my_fs          ( host1.example.com -> host2.example.com )
Jan 25 02:41:00 host1 pengine[1458]:  notice:  * Move       VirtualIP      ( host1.example.com -> host2.example.com )
Jan 25 02:41:00 host1 pengine[1458]:  notice:  * Move       Website        ( host1.example.com -> host2.example.com )
~~~

Expected results:

The resources move back to the original node after failure-timeout.

Additional info:

Comment 2 Ken Gaillot 2021-01-25 15:29:23 UTC
Hi,

This is a known limitation in the RHEL 7 Pacemaker version, documented in the reference guide:

    https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-resourceopts-haar

The limitation was removed in RHEL 8.2. Due to the life cycle phase of RHEL 7, it will not be backported.