Bug 1566374 - [OSP] on an IHA setup pacemaker-1.1.18-11 moves VIPs around unnecessarily
Summary: [OSP] on an IHA setup pacemaker-1.1.18-11 moves VIPs around unnecessarily
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 13.0 (Queens)
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: beta
: 13.0 (Queens)
Assignee: Michele Baldessari
QA Contact: pkomarov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-12 07:56 UTC by Michele Baldessari
Modified: 2018-06-27 13:51 UTC (History)
10 users (show)

Fixed In Version: puppet-tripleo-8.3.2-0.20180411174306.el7ost puppet-pacemaker-0.7.2-0.20180413040146.44ef58f.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:50:58 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Launchpad 1763586 None None None 2018-04-13 06:54:12 UTC
OpenStack gerrit 553206 None master: MERGED puppet-pacemaker: Allow meta_params and op_params on resource::ip (Iadf0cd3805f72141563707f43130945c9d362f5c) 2018-04-17 18:40:52 UTC
OpenStack gerrit 561300 None stable/queens: MERGED puppet-tripleo: Add resource-stickiness=INFINITY to VIPs (I6862452d2250ac4c2c3e04840983510a3cd13536) 2018-04-17 18:40:47 UTC
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:51:40 UTC

Description Michele Baldessari 2018-04-12 07:56:53 UTC
Description of problem:
Pini has observed that fence_compute often gets  the following error:
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ]                                                                                                                                                                                                               
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ]                                                                                       
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ]                                                                                                                                                                                                                     
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ]                                                                                                                                                                                                               
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ]                                                                                       
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ]

Now this is after one of the compute nodes gets manually crashed.
The reason for these evacuations seems to be that pengine decides to move the VIP from one node to another.

pr 12 06:37:04 [979790] controller-1    pengine:   notice: LogAction:   * Move       ip-10.0.0.110                        (    controller-1 -> controller-0 )                                                                               
Apr 12 06:37:04 [979790] controller-1    pengine:     info: common_print:       ip-10.0.0.110   (ocf::heartbeat:IPaddr2):       Started controller-1                                                                                         
Apr 12 06:37:04 [979790] controller-1    pengine:     info: RecurringOp:         Start recurring monitor (10s) for ip-10.0.0.110 on controller-0                                                                                             
Apr 12 06:37:04 [979790] controller-1    pengine:   notice: LogAction:   * Move       ip-10.0.0.110                        (    controller-1 -> controller-0 )                                                                               
Apr 12 06:37:04 [979791] controller-1       crmd:   notice: te_rsc_command:     Initiating stop operation ip-10.0.0.110_stop_0 locally on controller-1 | action 182                                                                          
Apr 12 06:37:04 [979788] controller-1       lrmd:     info: cancel_recurring_action:    Cancelling ocf operation ip-10.0.0.110_monitor_10000                                                                                                 
Apr 12 06:37:04 [979791] controller-1       crmd:     info: do_lrm_rsc_op:      Performing key=182:89:0:2d1c6e04-5bcc-4876-b4c2-40564b4a3912 op=ip-10.0.0.110_stop_0                                                                         
Apr 12 06:37:04 [979788] controller-1       lrmd:     info: log_execute:        executing - rsc:ip-10.0.0.110 action:stop call_id:154                                                                                                        
Apr 12 06:37:04 [979791] controller-1       crmd:     info: process_lrm_event:  Result of monitor operation for ip-10.0.0.110 on controller-1: Cancelled | call=138 key=ip-10.0.0.110_monitor_                                               

This unnecessary VIP move seems to be what is tripping fence_compute up in this case.

Comment 5 Andrew Beekhof 2018-04-13 05:39:28 UTC
Do we tell it not to though?
Is there any kind of stickiness set?

Comment 15 pkomarov 2018-04-23 05:47:59 UTC
Verified, 

Deployment tested on core_puddle_version: 2018-04-19.2 : 

[root@controller-0 ~]# pcs config|grep ' Resource: ip\|Meta Attrs:'

 Resource: ip-192.168.24.9 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-10.0.0.104 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.1.19 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.1.10 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.3.10 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.4.17 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 

Vip stickiness verified, i.e. vip remains on the same node , tested against compute nodes reboots , tested and verified using instance-ha rally plugin : 
https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-pidone-qe.git;a=tree;f=CI;h=77821b76aad84ef01f3bb32c752b09e4d153d545;hb=HEAD

Comment 18 errata-xmlrpc 2018-06-27 13:50:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.