Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1566374 - [OSP] on an IHA setup pacemaker-1.1.18-11 moves VIPs around unnecessarily
[OSP] on an IHA setup pacemaker-1.1.18-11 moves VIPs around unnecessarily
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo (Show other bugs)
13.0 (Queens)
All Linux
urgent Severity urgent
: beta
: 13.0 (Queens)
Assigned To: Michele Baldessari
pkomarov
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-04-12 03:56 EDT by Michele Baldessari
Modified: 2018-06-27 09:51 EDT (History)
10 users (show)

See Also:
Fixed In Version: puppet-tripleo-8.3.2-0.20180411174306.el7ost puppet-pacemaker-0.7.2-0.20180413040146.44ef58f.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-06-27 09:50:58 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1763586 None None None 2018-04-13 02:54 EDT
OpenStack gerrit 553206 None master: MERGED puppet-pacemaker: Allow meta_params and op_params on resource::ip (Iadf0cd3805f72141563707f43130945c9d362f5c) 2018-04-17 14:40 EDT
OpenStack gerrit 561300 None stable/queens: MERGED puppet-tripleo: Add resource-stickiness=INFINITY to VIPs (I6862452d2250ac4c2c3e04840983510a3cd13536) 2018-04-17 14:40 EDT
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 09:51 EDT

  None (edit)
Description Michele Baldessari 2018-04-12 03:56:53 EDT
Description of problem:
Pini has observed that fence_compute often gets  the following error:
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ]                                                                                                                                                                                                               
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ]                                                                                       
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ]                                                                                                                                                                                                                     
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ]                                                                                                                                                                                                               
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ]                                                                                       
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ]

Now this is after one of the compute nodes gets manually crashed.
The reason for these evacuations seems to be that pengine decides to move the VIP from one node to another.

pr 12 06:37:04 [979790] controller-1    pengine:   notice: LogAction:   * Move       ip-10.0.0.110                        (    controller-1 -> controller-0 )                                                                               
Apr 12 06:37:04 [979790] controller-1    pengine:     info: common_print:       ip-10.0.0.110   (ocf::heartbeat:IPaddr2):       Started controller-1                                                                                         
Apr 12 06:37:04 [979790] controller-1    pengine:     info: RecurringOp:         Start recurring monitor (10s) for ip-10.0.0.110 on controller-0                                                                                             
Apr 12 06:37:04 [979790] controller-1    pengine:   notice: LogAction:   * Move       ip-10.0.0.110                        (    controller-1 -> controller-0 )                                                                               
Apr 12 06:37:04 [979791] controller-1       crmd:   notice: te_rsc_command:     Initiating stop operation ip-10.0.0.110_stop_0 locally on controller-1 | action 182                                                                          
Apr 12 06:37:04 [979788] controller-1       lrmd:     info: cancel_recurring_action:    Cancelling ocf operation ip-10.0.0.110_monitor_10000                                                                                                 
Apr 12 06:37:04 [979791] controller-1       crmd:     info: do_lrm_rsc_op:      Performing key=182:89:0:2d1c6e04-5bcc-4876-b4c2-40564b4a3912 op=ip-10.0.0.110_stop_0                                                                         
Apr 12 06:37:04 [979788] controller-1       lrmd:     info: log_execute:        executing - rsc:ip-10.0.0.110 action:stop call_id:154                                                                                                        
Apr 12 06:37:04 [979791] controller-1       crmd:     info: process_lrm_event:  Result of monitor operation for ip-10.0.0.110 on controller-1: Cancelled | call=138 key=ip-10.0.0.110_monitor_                                               

This unnecessary VIP move seems to be what is tripping fence_compute up in this case.
Comment 5 Andrew Beekhof 2018-04-13 01:39:28 EDT
Do we tell it not to though?
Is there any kind of stickiness set?
Comment 15 pkomarov 2018-04-23 01:47:59 EDT
Verified, 

Deployment tested on core_puddle_version: 2018-04-19.2 : 

[root@controller-0 ~]# pcs config|grep ' Resource: ip\|Meta Attrs:'

 Resource: ip-192.168.24.9 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-10.0.0.104 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.1.19 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.1.10 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.3.10 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.4.17 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 

Vip stickiness verified, i.e. vip remains on the same node , tested against compute nodes reboots , tested and verified using instance-ha rally plugin : 
https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-pidone-qe.git;a=tree;f=CI;h=77821b76aad84ef01f3bb32c752b09e4d153d545;hb=HEAD
Comment 18 errata-xmlrpc 2018-06-27 09:50:58 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.