RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1747553 - Dynamically calculate Pacemaker's cluster recheck interval
Summary: Dynamically calculate Pacemaker's cluster recheck interval
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.1
Hardware: All
OS: All
high
medium
Target Milestone: pre-dev-freeze
: 8.2
Assignee: Ken Gaillot
QA Contact: michal novacek
URL:
Whiteboard:
Depends On: 1752538
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-30 19:38 UTC by Ken Gaillot
Modified: 2024-10-01 16:19 UTC (History)
5 users (show)

Fixed In Version: pacemaker-2.0.3-1.el8
Doc Type: Enhancement
Doc Text:
Feature: Pacemaker now calculates the cluster recheck interval dynamically. Reason: Previously, Pacemaker's failure-timeout resource meta-attribute, ocf:pacemaker:remote reconnect_interval instance attribute, and time-based configuration rules were not guaranteed to be checked any more often than the value of the cluster-recheck-interval cluster property. When using those features, users would have to know to set a low cluster-recheck-interval, which generates unnecessary system load during normal operation. Result: Users can configure these features without having to know anything about the cluster-recheck-interval property, and the cluster reacts in a timely manner without adding system load for cluster calculations when not needed.
Clone Of:
Environment:
Last Closed: 2020-04-28 15:38:28 UTC
Type: Enhancement
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pcs cluster report (835.54 KB, application/x-bzip)
2020-02-13 12:52 UTC, michal novacek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-27848 0 None None None 2024-06-13 22:18:58 UTC
Red Hat Knowledge Base (Solution) 4615881 0 None None None 2019-11-27 13:32:52 UTC
Red Hat Product Errata RHEA-2020:1609 0 None None None 2020-04-28 15:39:10 UTC

Internal Links: 1919910

Description Ken Gaillot 2019-08-30 19:38:52 UTC
Description of problem: Currently, Pacemaker's failure-timeout resource meta-attribute, ocf:pacemaker:remote reconnect_interval instance attribute, and time-based configuration rules are not guaranteed to be checked any more often than the value of the cluster-recheck-interval cluster property.

This leads to users setting cluster-recheck-interval very low, sometimes as low as 1min, which causes unnecessary load on the system during normal operation.


Steps to Reproduce:
1. Configure a cluster of at least 2 nodes and at least one resource.
2. Configure the resource's failure-timeout to 2min and the cluster-recheck-interval to 10min.
3. Cause a failure of the resource, then observe how long it takes to clear the failure.

Actual results: Unless a new cluster event (resource failure, configuration change, etc.) intervenes, the failure will not be cleaned for 10 minutes.


Expected results: The failure is cleaned shortly after 2 minutes have passed.

Comment 1 Ken Gaillot 2019-08-30 19:46:11 UTC
The planned implementation is to have Pacemaker's scheduler calculate the time when these features (if used) would change state, and add the minimum such time to the transition graph. Pacemaker's controller would read this value and set the recheck timer to the minimum of the cluster-recheck-interval and the time remaining until the specified time.

Comment 2 Ken Gaillot 2019-09-24 16:05:25 UTC
Fixed upstream as of https://github.com/ClusterLabs/pacemaker/pull/1890

Note to QA: the dynamic recheck applies to resource failure-timeout, ocf:pacemaker:remote reconnect_interval, and rules with date_expression *except* date expressions with "operation" set to "date_spec". date_spec still relies on cluster-recheck-interval for evaluation. 

Example pcs commands:

# pcs resource update rsc1 meta failure-timeout=200s

Set a failure-timeout for rsc1.


# pcs node add-remote remote-host-1 reconnect_interval=60s op monitor interval="20s"

Create an ocf:pacemaker:remote resource connecting to host "remote-host-1", and attempt reconnections every 60s if a connection attempt fails.


# pcs constraint location rsc2 rule '#uname' eq 'node2' and date in_range "2019-09-24 11:00:00 -05:00" to "2019-09-24 11:05:00 -05:00"

Set a time-based location constraint so that rsc2 prefers a certain node for 5 minutes.

Comment 3 Patrik Hagara 2019-09-30 17:00:13 UTC
qa_ack+, reproducer in description and comment#2

Comment 9 michal novacek 2020-02-05 13:23:14 UTC
I have checked that failcount is reset after failure-timeout with pacemaker pacemaker-2.0.2-3.el8_1.2.x86_64.

---

Test method:

failure-timeout is set to 120s meaning that failcount of the resource is set to 
0 after two minutes.

migration-threshold is set to 1 meaning the resource is moved to other node after 
the first failure. At the same time location constraint is set for the
resource to prefer node1. Resource is initially running on node1.

When resource is killed on node1 it is moved to node2 due to
migration-threshold=1. However the failure-timeout will reset the count after
10s and the resource should be moved back to node1. Before the fix the
failure-timeout will not be honored and the resource will stay on node2 before
other action will initiate.


Common:
------

Initiallly the resource is running on virt-137 with no failcount (full config: [1]).

> [root@virt-137 ~]# pcs resource
 ...
 webserver	(ocf::heartbeat:apache):	Started virt-137

> [root@virt-137 ~]# pcs resource failcount show webserver
No failcounts for resource 'webserver'

> [root@virt-137 ~]# export PS1='[\D{+%T} \u@\h \W]\$ '


Before the fix (pacemaker-2.0.2-3.el8_1.2.x86_64)
-------------------------------------------------


> [+13:32:06 root@virt-137 ~]# date && killall -9 httpd && sleep 30 && crm_resource --wait && pcs resource; pcs resource failcount show webserver
 Clone Set: locking-clone [locking]
     Started: [ virt-136 virt-137 ]
 webserver	(ocf::heartbeat:apache):	Started virt-136
Failcounts for resource 'webserver'
  virt-137: 1

> [+13:32:42 root@virt-137 ~]# sleep 120 && pcs resource
 Clone Set: locking-clone [locking]
     Started: [ virt-136 virt-137 ]
 webserver	(ocf::heartbeat:apache):	Started virt-136


After failure-timeout seconds resource did not move back to original node. It
moved after cluster-recheck-interval have passed.

After the fix (pacemaker-2.0.3-4.el8.x86_64)
--------------------------------------------

> [+14:12:20 root@virt-137 ~]# date && killall -9 httpd && sleep 30 && crm_resource --wait && pcs resource; pcs resource failcount show webserver
Wed Feb  5 14:13:07 CET 2020
  * Clone Set: locking-clone [locking]:
    * Started: [ virt-136 virt-137 ]
  * webserver	(ocf::heartbeat:apache):	Started virt-136
Failcounts for resource 'webserver'
  virt-137: 1

> [+14:13:39 root@virt-137 ~]# date && sleep 120 && pcs resource
Wed Feb  5 14:13:41 CET 2020
  * Clone Set: locking-clone [locking]:
    * Started: [ virt-136 virt-137 ]
  * webserver	(ocf::heartbeat:apache):	Started virt-137

> [+14:15:42 root@virt-137 ~]# date && pcs resource failcount show webserver
Wed Feb  5 14:17:24 CET 2020
No failcounts for resource 'webserver'

Failcount have been reset correcty after 120s and the resource have been moved
back to prefered node.


--
[1]: pcs config

[root@virt-137 ~]# pcs config
Cluster Name: STSRHTS5570
Corosync Nodes:
 virt-136 virt-137
Pacemaker Nodes:
 virt-136 virt-137

Resources:
 Clone: locking-clone
  Group: locking
   Resource: dlm (class=ocf provider=pacemaker type=controld)
    Operations: monitor interval=30s (dlm-monitor-interval-30s)
                start interval=0s timeout=90s (dlm-start-interval-0s)
                stop interval=0s timeout=100s (dlm-stop-interval-0s)
   Resource: lvmlockd (class=ocf provider=heartbeat type=lvmlockd)
    Attributes: with_cmirrord=1
    Operations: monitor interval=30s (lvmlockd-monitor-interval-30s)
                start interval=0s timeout=90s (lvmlockd-start-interval-0s)
                stop interval=0s timeout=90s (lvmlockd-stop-interval-0s)
 Resource: webserver (class=ocf provider=heartbeat type=apache)
  Attributes: configfile=/etc/httpd/conf/httpd.conf httpd=/usr/sbin/httpd statusurl=http://127.0.0.1/server-status
  Meta Attrs: failure-timeout=120s migration-threshold=1
  Operations: monitor interval=30s (webserver-monitor-interval-30s)
              start interval=0s timeout=40s (webserver-start-interval-0s)
              stop interval=0s timeout=60s (webserver-stop-interval-0s)

Stonith Devices:
 Resource: fence-virt-136 (class=stonith type=fence_xvm)
  Attributes: delay=5 pcmk_host_check=static-list pcmk_host_list=virt-136 pcmk_host_map=virt-136:virt-136.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-136-monitor-interval-60s)
 Resource: fence-virt-137 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-137 pcmk_host_map=virt-137:virt-137.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-137-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: webserver
    Enabled on:
      Node: virt-137 (score:INFINITY) (id:location-webserver-virt-137-INFINITY)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS5570
 dc-version: 2.0.2-3.el8_1.2-744a30d655
 have-watchdog: false
 no-quorum-policy: freeze

Quorum:
  Options:

Comment 11 michal novacek 2020-02-10 12:02:24 UTC
I have checked that time based rule for constraint is checked at latest after                                  
failure-timeout seconds as opposed to being checked only after cluster-recheck                               
interval for pacemaker-2.0.2-3.el8_1.2.x86_64.                                                              
                                                                                                            
---                                                                                                         
                                                                                                            
Test method:                                                                                                
                                                                                                            
failure-timeout is set to 120s which means that that time-based rule is checked
to be expired (and disabled if it is) after 120s instead of the same happening
after cluster-recheck interval (15min).

To prove it works we do the following:
 - start with no time-based constraint
 - set resource to prefer node1 with score=100
 - set time-based constraint that will last five minutes for the resources to 
    prefer node2
 - resource will be moved to node2 based on time-based constraint
 - after five minutes at latest the resource will move back to node1 as time-based 
    constraint expires and the location constraint takes over

Common:
------

Initiallly the resource is running on virt-136 (full config: [1]).

> [root@virt-136 ~]# pcs resource
 webserver      (ocf::heartbeat:apache):        Started virt-137

> [root@virt-136 ~]# pcs constraint location
Location Constraints:
  Resource: webserver
    Enabled on:
      Node: virt-137 (score:100)

Before the fix (pacemaker-2.0.2-3.el8_1.2.x86_64)
-------------------------------------------------

> [root@virt-136 ~]# pcs constraint location webserver \
        rule \
        constraint-id=time-based-prefer-virt-136 \
        '#uname' eq 'virt-136' \
        and date in_range "$(date +'%F %T')" to "$(date --date='5 minutes' +'%F %T')" \
        && crm_resource --wait \
        && date
Mon Feb 10 12:09:48 CET 2020

> [root@virt-136 ~]# date && pcs resource && sleep 300 && pcs resource
Mon Feb 10 12:09:53 CET 2020
 webserver      (ocf::heartbeat:apache):        Started virt-136

 webserver      (ocf::heartbeat:apache):        Started virt-136

# Resource is not moved back to original location after failure-timeout but
# only after cluster-recheck-interval occurs.

> [root@virt-136 ~]# cat /var/log/pacemaker/pacemaker.log
...
Feb 10 12:24:47 virt-136 pacemaker-controld  [4710] (do_state_transition)     notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
Feb 10 12:24:47 virt-136 pacemaker-controld  [4710] (do_state_transition)     notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
Feb 10 12:24:47 virt-136 pacemaker-controld  [4710] (do_state_transition)     info: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (determine_online_status_fencing)         info: Node virt-136 is active
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (determine_online_status)         info: Node virt-136 is online
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (determine_online_status_fencing)         info: Node virt-137 is active
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (determine_online_status)         info: Node virt-137 is online
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (unpack_node_loop)        info: Node 1 is already processed
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (unpack_node_loop)        info: Node 2 is already processed
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (unpack_node_loop)        info: Node 1 is already processed
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (unpack_node_loop)        info: Node 2 is already processed
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (common_print)    info: fence-virt-136    (stonith:fence_xvm):    Started virt-136
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (common_print)    info: fence-virt-137    (stonith:fence_xvm):    Started virt-137
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (common_print)    info: webserver (ocf::heartbeat:apache):        Started virt-136
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (RecurringOp)     info:  Start recurring monitor (30s) for webserver on virt-137
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (LogActions)      info: Leave   fence-virt-136    (Started virt-136)
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (LogActions)      info: Leave   fence-virt-137    (Started virt-137)
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (LogAction)       notice:  * Move       webserver          ( virt-136 -> virt-137 )  
Feb 10 12:24:47 virt-136 pacemaker-schedulerd[4709] (pcmk__log_transition_summary)    notice: Calculated transition 11, saving inputs in /var/lib/pacemaker/pengine/pe-input-1147.bz2
Feb 10 12:24:47 virt-136 pacemaker-controld  [4710] (do_state_transition)     info: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response
...
Feb 10 12:24:47  apache(webserver)[18928]:    INFO: Attempting graceful stop of apache PID 12493
Feb 10 12:24:49  apache(webserver)[18928]:    INFO: apache stopped.
Feb 10 12:24:49 virt-136 pacemaker-execd     [4707] (log_finished)    info: finished - rsc:webserver action:stop call_id:31 pid:18928 exit-code:0 exec-time:1176ms queue-time:0ms
Feb 10 12:24:49 virt-136 pacemaker-controld  [4710] (process_lrm_event)       notice: Result of stop operation for webserver on virt-136: 0 (ok) | call=31 key=webserver_stop_0 confirmed=true cib-update=80
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_process_request)     info: Forwarding cib_modify operation for section status to all (origin=local/crmd/80)
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: --- 0.15.7 2
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: +++ 0.15.8 (null)
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib:  @num_updates=8
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='webserver']/lrm_rsc_op[@id='webserver_last_0']:  @transition-magic=0:0;8:11:0:e8fe8897-a242-49c3-a717-35f87cc5ac47, @call-id=31, @rc-code=0, @op-status=0, @exec-time=1176
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-136/crmd/80, version=0.15.8)
Feb 10 12:24:49 virt-136 pacemaker-controld  [4710] (match_graph_event)       info: Action webserver_stop_0 (8) confirmed on virt-136 (rc=0)
Feb 10 12:24:49 virt-136 pacemaker-controld  [4710] (te_rsc_command)  notice: Initiating start operation webserver_start_0 on virt-137 | action 9
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: --- 0.15.8 2
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: +++ 0.15.9 (null)
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib:  @num_updates=9
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='webserver']/lrm_rsc_op[@id='webserver_last_0']:  @operation_key=webserver_start_0, @operation=start, @transition-key=9:11:0:e8fe8897-a242-49c3-a717-35f87cc5ac47, @transition-magic=-1:193;9:11:0:e8fe8897-a242-49c3-a717-35f87cc5ac47, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1581333889, @last-rc-change=158133388
Feb 10 12:24:49 virt-136 pacemaker-based     [4705] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-137/crmd/40, version=0.15.9)
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: --- 0.15.9 2
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: +++ 0.15.10 (null)
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib:  @num_updates=10
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='webserver']/lrm_rsc_op[@id='webserver_last_0']:  @transition-magic=0:0;9:11:0:e8fe8897-a242-49c3-a717-35f87cc5ac47, @call-id=28, @rc-code=0, @op-status=0, @exec-time=1320
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-137/crmd/41, version=0.15.10)
Feb 10 12:24:50 virt-136 pacemaker-controld  [4710] (match_graph_event)       info: Action webserver_start_0 (9) confirmed on virt-137 (rc=0)
Feb 10 12:24:50 virt-136 pacemaker-controld  [4710] (te_rsc_command)  notice: Initiating monitor operation webserver_monitor_30000 on virt-137 | action 10
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: --- 0.15.10 2
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: +++ 0.15.11 (null)
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib:  @num_updates=11
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='webserver']/lrm_rsc_op[@id='webserver_monitor_30000']:  @transition-key=10:11:0:e8fe8897-a242-49c3-a717-35f87cc5ac47, @transition-magic=-1:193;10:11:0:e8fe8897-a242-49c3-a717-35f87cc5ac47, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1581333890, @exec-time=0
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-137/crmd/42, version=0.15.11)
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: --- 0.15.11 2
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: Diff: +++ 0.15.12 (null)
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib:  @num_updates=12
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_perform_op)  info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='webserver']/lrm_rsc_op[@id='webserver_monitor_30000']:  @transition-magic=0:0;10:11:0:e8fe8897-a242-49c3-a717-35f87cc5ac47, @call-id=29, @rc-code=0, @op-status=0, @exec-time=100
Feb 10 12:24:50 virt-136 pacemaker-based     [4705] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-137/crmd/43, version=0.15.12)
Feb 10 12:24:50 virt-136 pacemaker-controld  [4710] (match_graph_event)       info: Action webserver_monitor_30000 (10) confirmed on virt-137 (rc=0)
Feb 10 12:24:50 virt-136 pacemaker-controld  [4710] (run_graph)       notice: Transition 11 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1147.bz2): Complete
Feb 10 12:24:50 virt-136 pacemaker-controld  [4710] (do_log)  info: Input I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
Feb 10 12:24:50 virt-136 pacemaker-controld  [4710] (do_state_transition)     notice: State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=not


After the fix (pacemaker-2.0.3-4.el8.x86_64)
--------------------------------------------

> [root@virt-136 ~]# pcs constraint location webserver \
        rule \
        constraint-id=time-based-prefer-virt-136 \
        '#uname' eq 'virt-136' \
        and date in_range "$(date +'%F %T')" to "$(date --date='5 minutes' +'%F %T')" \
        && crm_resource --wait \
        && date
Mon Feb 10 11:30:23 CET 2020

> [root@virt-136 ~]# date && pcs resource && sleep 300 && pcs resource
Mon Feb 10 11:30:34 CET 2020

  webserver   (ocf::heartbeat:apache):        Started virt-136

  webserver   (ocf::heartbeat:apache):        Started virt-137

> [root@virt-136 ~]# cat /var/log/pacemaker/pacemaker.log
...
> # time-based rule comes into effect
Feb 10 11:30:23 virt-136 pacemaker-controld[7157]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Feb 10 11:30:23 virt-136 pacemaker-schedulerd[7156]: notice:  * Move       webserver          ( virt-137 -> virt-136 )
Feb 10 11:30:23 virt-136 pacemaker-schedulerd[7156]: notice: Calculated transition 495, saving inputs in /var/lib/pacemaker/pengine/pe-input-1130.bz2
Feb 10 11:30:23 virt-136 pacemaker-controld[7157]: notice: Initiating stop operation webserver_stop_0 on virt-137
Feb 10 11:30:24 virt-136 pacemaker-controld[7157]: notice: Initiating start operation webserver_start_0 locally on virt-136
Feb 10 11:30:25 virt-136 httpd[11637]: Server configured, listening on: port 80
Feb 10 11:30:25 virt-136 pacemaker-controld[7157]: notice: Result of start operation for webserver on virt-136: 0 (ok)
Feb 10 11:30:25 virt-136 pacemaker-controld[7157]: notice: Initiating monitor operation webserver_monitor_30000 locally on virt-136
Feb 10 11:30:25 virt-136 pacemaker-controld[7157]: notice: Result of monitor operation for webserver on virt-136: 0 (ok)
Feb 10 11:30:25 virt-136 pacemaker-controld[7157]: notice: Transition 495 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1130.bz2): Complete
Feb 10 11:30:25 virt-136 pacemaker-controld[7157]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
...
> # time-based rule expires.
Feb 10 11:35:23 virt-136 pacemaker-controld[7157]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Feb 10 11:35:23 virt-136 pacemaker-schedulerd[7156]: notice:  * Move       webserver          ( virt-136 -> virt-137 )
Feb 10 11:35:23 virt-136 pacemaker-schedulerd[7156]: notice: Calculated transition 496, saving inputs in /var/lib/pacemaker/pengine/pe-input-1131.bz2
Feb 10 11:35:23 virt-136 pacemaker-controld[7157]: notice: Initiating stop operation webserver_stop_0 locally on virt-136
Feb 10 11:35:23 virt-136 apache(webserver)[14418]: INFO: Attempting graceful stop of apache PID 11637
Feb 10 11:35:25 virt-136 apache(webserver)[14418]: INFO: apache stopped.
Feb 10 11:35:25 virt-136 pacemaker-controld[7157]: notice: Result of stop operation for webserver on virt-136: 0 (ok)
Feb 10 11:35:25 virt-136 pacemaker-controld[7157]: notice: Initiating start operation webserver_start_0 on virt-137
Feb 10 11:35:26 virt-136 pacemaker-controld[7157]: notice: Initiating monitor operation webserver_monitor_30000 on virt-137
Feb 10 11:35:26 virt-136 pacemaker-controld[7157]: notice: Transition 496 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1131.bz2): Complete
Feb 10 11:35:26 virt-136 pacemaker-controld[7157]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
...




----
[1]: pcs config

[root@virt-136 ~]# pcs config
Cluster Name: STSRHTS5570
Corosync Nodes:
 virt-136 virt-137
Pacemaker Nodes:
 virt-136 virt-137

Resources:
 Resource: webserver (class=ocf provider=heartbeat type=apache)
  Attributes: configfile=/etc/httpd/conf/httpd.conf httpd=/usr/sbin/httpd statusurl=http://127.0.0.1/server-status
  Meta Attrs: failure-timeout=120s
  Operations: monitor interval=30s (webserver-monitor-interval-30s)
              start interval=0s timeout=40s (webserver-start-interval-0s)
              stop interval=0s timeout=60s (webserver-stop-interval-0s)

Stonith Devices:
 Resource: fence-virt-136 (class=stonith type=fence_xvm)
  Attributes: delay=5 pcmk_host_check=static-list pcmk_host_list=virt-136 pcmk_host_map=virt-136:virt-136.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-136-monitor-interval-60s)
 Resource: fence-virt-137 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-137 pcmk_host_map=virt-137:virt-137.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-137-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: webserver
    Enabled on:
      Node: virt-137 (score:100) (id:location-webserver-virt-137-100)

Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS5570
 dc-version: 2.0.3-4.el8-4b1f869f0f
 have-watchdog: false
 last-lrm-refresh: 1580905822
 no-quorum-policy: freeze

Quorum:
  Options:

Comment 13 michal novacek 2020-02-13 12:52:42 UTC
Created attachment 1662954 [details]
pcs cluster report


Cluster report taken after remote resource have not been moved back to original node.

Comment 20 errata-xmlrpc 2020-04-28 15:38:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:1609


Note You need to log in before you can comment on or make changes to this bug.