Bug 1566374

Summary: [OSP] on an IHA setup pacemaker-1.1.18-11 moves VIPs around unnecessarily
Product: Red Hat OpenStack Reporter: Michele Baldessari <michele>
Component: puppet-tripleoAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: pkomarov
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: abeekhof, aherr, cfeist, chjones, cluster-maint, jjoyce, jschluet, pkomarov, slinaber, tvignaud
Target Milestone: betaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: puppet-tripleo-8.3.2-0.20180411174306.el7ost puppet-pacemaker-0.7.2-0.20180413040146.44ef58f.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:50:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michele Baldessari 2018-04-12 07:56:53 UTC
Description of problem:
Pini has observed that fence_compute often gets  the following error:
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ]                                                                                                                                                                                                               
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ]                                                                                       
Apr 12 06:37:23 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[259311] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ]                                                                                                                                                                                                                     
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ REQ: curl -g -i -X GET http://10.0.0.110:5000 -H "Accept: application/json" -H "User-Agent: python-keystoneclient" ]                                                                                                                                                                                                               
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ Starting new HTTP connection (1): 10.0.0.110 ]                                                                                       
Apr 12 06:37:28 [979787] controller-1 stonith-ng:  warning: log_action: fence_compute[261144] stderr: [ keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.0.0.110:5000 ]

Now this is after one of the compute nodes gets manually crashed.
The reason for these evacuations seems to be that pengine decides to move the VIP from one node to another.

pr 12 06:37:04 [979790] controller-1    pengine:   notice: LogAction:   * Move       ip-10.0.0.110                        (    controller-1 -> controller-0 )                                                                               
Apr 12 06:37:04 [979790] controller-1    pengine:     info: common_print:       ip-10.0.0.110   (ocf::heartbeat:IPaddr2):       Started controller-1                                                                                         
Apr 12 06:37:04 [979790] controller-1    pengine:     info: RecurringOp:         Start recurring monitor (10s) for ip-10.0.0.110 on controller-0                                                                                             
Apr 12 06:37:04 [979790] controller-1    pengine:   notice: LogAction:   * Move       ip-10.0.0.110                        (    controller-1 -> controller-0 )                                                                               
Apr 12 06:37:04 [979791] controller-1       crmd:   notice: te_rsc_command:     Initiating stop operation ip-10.0.0.110_stop_0 locally on controller-1 | action 182                                                                          
Apr 12 06:37:04 [979788] controller-1       lrmd:     info: cancel_recurring_action:    Cancelling ocf operation ip-10.0.0.110_monitor_10000                                                                                                 
Apr 12 06:37:04 [979791] controller-1       crmd:     info: do_lrm_rsc_op:      Performing key=182:89:0:2d1c6e04-5bcc-4876-b4c2-40564b4a3912 op=ip-10.0.0.110_stop_0                                                                         
Apr 12 06:37:04 [979788] controller-1       lrmd:     info: log_execute:        executing - rsc:ip-10.0.0.110 action:stop call_id:154                                                                                                        
Apr 12 06:37:04 [979791] controller-1       crmd:     info: process_lrm_event:  Result of monitor operation for ip-10.0.0.110 on controller-1: Cancelled | call=138 key=ip-10.0.0.110_monitor_                                               

This unnecessary VIP move seems to be what is tripping fence_compute up in this case.

Comment 5 Andrew Beekhof 2018-04-13 05:39:28 UTC
Do we tell it not to though?
Is there any kind of stickiness set?

Comment 15 pkomarov 2018-04-23 05:47:59 UTC
Verified, 

Deployment tested on core_puddle_version: 2018-04-19.2 : 

[root@controller-0 ~]# pcs config|grep ' Resource: ip\|Meta Attrs:'

 Resource: ip-192.168.24.9 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-10.0.0.104 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.1.19 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.1.10 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.3.10 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 
 Resource: ip-172.17.4.17 (class=ocf provider=heartbeat type=IPaddr2)
  Meta Attrs: resource-stickiness=INFINITY 

Vip stickiness verified, i.e. vip remains on the same node , tested against compute nodes reboots , tested and verified using instance-ha rally plugin : 
https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-pidone-qe.git;a=tree;f=CI;h=77821b76aad84ef01f3bb32c752b09e4d153d545;hb=HEAD

Comment 18 errata-xmlrpc 2018-06-27 13:50:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086