Description of problem: Pini has observed that fence_compute often gets the following error: Apr 12 06:37:23 [979787] controller-1 stonith-ng: warning: log_action: fence_compute[259311] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ] Apr 12 06:37:23 [979787] controller-1 stonith-ng: warning: log_action: fence_compute[259311] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ] Apr 12 06:37:23 [979787] controller-1 stonith-ng: warning: log_action: fence_compute[259311] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ] Apr 12 06:37:28 [979787] controller-1 stonith-ng: warning: log_action: fence_compute[261144] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ] Apr 12 06:37:28 [979787] controller-1 stonith-ng: warning: log_action: fence_compute[261144] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ] Apr 12 06:37:28 [979787] controller-1 stonith-ng: warning: log_action: fence_compute[261144] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ] Now this is after one of the compute nodes gets manually crashed. The reason for these evacuations seems to be that pengine decides to move the VIP from one node to another. pr 12 06:37:04 [979790] controller-1 pengine: notice: LogAction: * Move ip-10.0.0.110 ( controller-1 -> controller-0 ) Apr 12 06:37:04 [979790] controller-1 pengine: info: common_print: ip-10.0.0.110 (ocf::heartbeat:IPaddr2): Started controller-1 Apr 12 06:37:04 [979790] controller-1 pengine: info: RecurringOp: Start recurring monitor (10s) for ip-10.0.0.110 on controller-0 Apr 12 06:37:04 [979790] controller-1 pengine: notice: LogAction: * Move ip-10.0.0.110 ( controller-1 -> controller-0 ) Apr 12 06:37:04 [979791] controller-1 crmd: notice: te_rsc_command: Initiating stop operation ip-10.0.0.110_stop_0 locally on controller-1 | action 182 Apr 12 06:37:04 [979788] controller-1 lrmd: info: cancel_recurring_action: Cancelling ocf operation ip-10.0.0.110_monitor_10000 Apr 12 06:37:04 [979791] controller-1 crmd: info: do_lrm_rsc_op: Performing key=182:89:0:2d1c6e04-5bcc-4876-b4c2-40564b4a3912 op=ip-10.0.0.110_stop_0 Apr 12 06:37:04 [979788] controller-1 lrmd: info: log_execute: executing - rsc:ip-10.0.0.110 action:stop call_id:154 Apr 12 06:37:04 [979791] controller-1 crmd: info: process_lrm_event: Result of monitor operation for ip-10.0.0.110 on controller-1: Cancelled | call=138 key=ip-10.0.0.110_monitor_ This unnecessary VIP move seems to be what is tripping fence_compute up in this case.
Do we tell it not to though? Is there any kind of stickiness set?
Verified, Deployment tested on core_puddle_version: 2018-04-19.2 : [root@controller-0 ~]# pcs config|grep ' Resource: ip\|Meta Attrs:' Resource: ip-192.168.24.9 (class=ocf provider=heartbeat type=IPaddr2) Meta Attrs: resource-stickiness=INFINITY Resource: ip-10.0.0.104 (class=ocf provider=heartbeat type=IPaddr2) Meta Attrs: resource-stickiness=INFINITY Resource: ip-172.17.1.19 (class=ocf provider=heartbeat type=IPaddr2) Meta Attrs: resource-stickiness=INFINITY Resource: ip-172.17.1.10 (class=ocf provider=heartbeat type=IPaddr2) Meta Attrs: resource-stickiness=INFINITY Resource: ip-172.17.3.10 (class=ocf provider=heartbeat type=IPaddr2) Meta Attrs: resource-stickiness=INFINITY Resource: ip-172.17.4.17 (class=ocf provider=heartbeat type=IPaddr2) Meta Attrs: resource-stickiness=INFINITY Vip stickiness verified, i.e. vip remains on the same node , tested against compute nodes reboots , tested and verified using instance-ha rally plugin : https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-pidone-qe.git;a=tree;f=CI;h=77821b76aad84ef01f3bb32c752b09e4d153d545;hb=HEAD
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086