1566374 – [OSP] on an IHA setup pacemaker-1.1.18-11 moves VIPs around unnecessarily

Bug 1566374 - [OSP] on an IHA setup pacemaker-1.1.18-11 moves VIPs around unnecessarily

Summary: [OSP] on an IHA setup pacemaker-1.1.18-11 moves VIPs around unnecessarily

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	13.0 (Queens)
Assignee:	Michele Baldessari
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-12 07:56 UTC by Michele Baldessari
Modified:	2018-06-27 13:51 UTC (History)
CC List:	10 users (show)
Fixed In Version:	puppet-tripleo-8.3.2-0.20180411174306.el7ost puppet-pacemaker-0.7.2-0.20180413040146.44ef58f.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-27 13:50:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1763586	None	None	None	2018-04-13 06:54:12 UTC
OpenStack gerrit	553206	None	master: MERGED	puppet-pacemaker: Allow meta_params and op_params on resource::ip (Iadf0cd3805f72141563707f43130945c9d362f5c)	2018-04-17 18:40:52 UTC
OpenStack gerrit	561300	None	stable/queens: MERGED	puppet-tripleo: Add resource-stickiness=INFINITY to VIPs (I6862452d2250ac4c2c3e04840983510a3cd13536)	2018-04-17 18:40:47 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:51:40 UTC

Description Michele Baldessari 2018-04-12 07:56:53 UTC

Description of problem:
Pini has observed that fence_compute often gets  the following error:
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ]                                                                                                                                                                                                               
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ]                                                                                       
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ]                                                                                                                                                                                                                     
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ]                                                                                                                                                                                                               
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ]                                                                                       
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ]

Now this is after one of the compute nodes gets manually crashed.
The reason for these evacuations seems to be that pengine decides to move the VIP from one node to another.

pr 12 06:37:04 [979790] controller-1    pengine:   notice: LogAction:   * Move       ip-10.0.0.110                        (    controller-1 -> controller-0 )                                                                               
Apr 12 06:37:04 [979790] controller-1    pengine:     info: common_print:       ip-10.0.0.110   (ocf::heartbeat:IPaddr2):       Started controller-1                                                                                         
Apr 12 06:37:04 [979790] controller-1    pengine:     info: RecurringOp:         Start recurring monitor (10s) for ip-10.0.0.110 on controller-0                                                                                             
Apr 12 06:37:04 [979790] controller-1    pengine:   notice: LogAction:   * Move       ip-10.0.0.110                        (    controller-1 -> controller-0 )                                                                               
Apr 12 06:37:04 [979791] controller-1       crmd:   notice: te_rsc_command:     Initiating stop operation ip-10.0.0.110_stop_0 locally on controller-1 | action 182                                                                          
Apr 12 06:37:04 [979788] controller-1       lrmd:     info: cancel_recurring_action:    Cancelling ocf operation ip-10.0.0.110_monitor_10000                                                                                                 
Apr 12 06:37:04 [979791] controller-1       crmd:     info: do_lrm_rsc_op:      Performing key=182:89:0:2d1c6e04-5bcc-4876-b4c2-40564b4a3912 op=ip-10.0.0.110_stop_0                                                                         
Apr 12 06:37:04 [979788] controller-1       lrmd:     info: log_execute:        executing - rsc:ip-10.0.0.110 action:stop call_id:154                                                                                                        
Apr 12 06:37:04 [979791] controller-1       crmd:     info: process_lrm_event:  Result of monitor operation for ip-10.0.0.110 on controller-1: Cancelled | call=138 key=ip-10.0.0.110_monitor_                                               

This unnecessary VIP move seems to be what is tripping fence_compute up in this case.

Comment 5 Andrew Beekhof 2018-04-13 05:39:28 UTC

Do we tell it not to though?
Is there any kind of stickiness set?

Comment 15 pkomarov 2018-04-23 05:47:59 UTC

Verified, 

Deployment tested on core_puddle_version: 2018-04-19.2 : 

[root@controller-0 ~]# pcs config|grep ' Resource: ip\|Meta Attrs:'

 Resource: ip-192.168.24.9 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-10.0.0.104 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.1.19 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.1.10 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.3.10 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.4.17 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 

Vip stickiness verified, i.e. vip remains on the same node , tested against compute nodes reboots , tested and verified using instance-ha rally plugin : 
https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-pidone-qe.git;a=tree;f=CI;h=77821b76aad84ef01f3bb32c752b09e4d153d545;hb=HEAD

Comment 18 errata-xmlrpc 2018-06-27 13:50:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.