Bug 1313150

Summary: VIP network interface in the qlbaas namespaces are not recreated when the OpenStack cluster is standby/unstandby.
Product: Red Hat OpenStack Reporter: Pratik Pravin Bandarkar <pbandark>
Component: openstack-neutronAssignee: Nir Magnezi <nmagnezi>
Status: CLOSED CURRENTRELEASE QA Contact: Alexander Stafeyev <astafeye>
Severity: high Docs Contact:
Priority: high    
Version: 6.0 (Juno)CC: amuller, apevec, chrisw, lhh, nyechiel, pbandark, srevivo, tfreger
Target Milestone: asyncKeywords: TestOnly, Unconfirmed, ZStream
Target Release: 6.0 (Juno)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-neutron-2014.2.3-35.el7ost Doc Type: Bug Fix
Doc Text:
Cause: Upon failover, the pcs cluster invoke ovs-cleanup on the standby node which removes VIP network interface in the qlbaas namespaces. Consequence: VIP network interface in the qlbaas namespaces are not recreated when the pcs cluster failover for the second time, now back to the original active node. When the lbaas agent starts again, it detects that the haproxy process is still running and it attempt to _reload_pool(), which fails with the Error: 'Unable to deploy instance for pool' Fix: Modify the neutron-netns-cleanup script to kill haproxy, same as we do for keepalived and radvd in other cases Result: The agent reconfigure the interfaces.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-06 18:53:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
patch none

Description Pratik Pravin Bandarkar 2016-03-01 05:42:37 UTC
Description of problem:
 VIP network interface in the qlbaas namespaces are not recreated when the OpenStack cluster is standby/unstandby

Version-Release number of selected component (if applicable):
RHOS6

How reproducible:

(1) Standby OpenStack cluster
[root@os3ctrl03 ~]# pcs cluster standby --all

(2) Note HAProxy loadbalancer process is still running:
[root@os3ctrl03 ~]# ps -ef | grep lbaas
nobody   22148     1  0 10:22 ?        00:00:00 haproxy -f /var/lib/neutron/lbaas/e9adbe8b-fbe3-4c1d-9a49-42b6113de190/conf -p /var/lib/neutron/lbaas/e9adbe8b-fbe3-4c1d-9a49-42b6113de190/pid -sf 18150

(3) Unstandby cluster
[root@os3ctrl03 ~]# pcs cluster unstandby --all

(4) Note same HAProxy loadbalancer process is respawned on a new PID
neutron  14529     1 11 10:29 ?        00:00:00 /usr/bin/python /usr/bin/neutron-lbaas-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /usr/share/neutron/neutron-lbaas-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/lbaas_agent.ini --config-dir /etc/neutron/conf.d/neutron-lbaas-agent --log-file /var/log/neutron/lbaas-agent.log
nobody   14759     1  0 10:29 ?        00:00:00 haproxy -f /var/lib/neutron/lbaas/e9adbe8b-fbe3-4c1d-9a49-42b6113de190/conf -p /var/lib/neutron/lbaas/e9adbe8b-fbe3-4c1d-9a49-42b6113de190/pid -sf 22148

(5) Check network interfaces in the namespace of the loadbalancer.  Note the VIP network interface is missing!
[root@os3ctrl03 neutron(openstack_admin)]# ip netns exec qlbaas-e9adbe8b-fbe3-4c1d-9a49-42b6113de190 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

(6) Note restarts of neutron-lbaas-agent do not resolve.  Only workaround is described below.
_______________________________________________


Steps to workaround:

(1) Stop neutron-lbaas-agent
pcs resource disable neutron-lbaas-agent-clone

(2) Kill running haproxy loadbalancer processes
[root@os3ctrl03 neutron(openstack_admin)]# ps -ef | grep lbaas
nobody    9794     1  0 10:37 ?        00:00:00 haproxy -f /var/lib/neutron/lbaas/e9adbe8b-fbe3-4c1d-9a49-42b6113de190/conf -p /var/lib/neutron/lbaas/e9adbe8b-fbe3-4c1d-9a49-42b6113de190/pid -sf 14759
[root@os3ctrl03 neutron(openstack_admin)]# kill 9794
[root@os3ctrl03 neutron(openstack_admin)]# ps -ef | grep lbaas
[root@os3ctrl03 neutron(openstack_admin)]#

(3) Start neutron-lbaas-agent - agent will recreate haproxy processes and VIP interface in the qlbaas namespaces
[root@os3ctrl03 neutron(openstack_admin)]# pcs resource enable neutron-lbaas-agent-clone

(4) Check the namespace for network interfaces in the haproxy loadbalancer.  Note the VIP network interface has returned!
[root@os3ctrl03 neutron(openstack_admin)]# ps -ef | grep lbaas
neutron  15123     1  8 10:49 ?        00:00:01 /usr/bin/python /usr/bin/neutron-lbaas-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /usr/share/neutron/neutron-lbaas-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/lbaas_agent.ini --config-dir /etc/neutron/conf.d/neutron-lbaas-agent --log-file /var/log/neutron/lbaas-agent.log
nobody   15762     1  0 10:49 ?        00:00:00 haproxy -f /var/lib/neutron/lbaas/e9adbe8b-fbe3-4c1d-9a49-42b6113de190/conf -p /var/lib/neutron/lbaas/e9adbe8b-fbe3-4c1d-9a49-42b6113de190/pid
[root@os3ctrl03 neutron(openstack_admin)]# ip netns exec qlbaas-e9adbe8b-fbe3-4c1d-9a49-42b6113de190 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
107: tap65d0c08e-e3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:bf:1d:19 brd ff:ff:ff:ff:ff:ff
    inet 192.168.33.33/24 brd 192.168.33.255 scope global tap65d0c08e-e3
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:febf:1d19/64 scope link
       valid_lft forever preferred_lft forever


Actual results:
VIP network interface in the qlbaas namespaces are not recreated when the OpenStack cluster is standby/unstandby.

Expected results:
VIP network interface in the qlbaas namespaces should get recreated when the OpenStack cluster is standby/unstandby.

Additional info:
lbaas is configured in pacemaker.

Comment 23 Nir Magnezi 2016-06-06 11:45:08 UTC
Verification steps for QE:
==========================
0. Have a cluster of two lbaas agents.
1. Configure PCS to invoke neutron-netns-cleanup on the standby node, upon failover.
2. repeat the first 6 steps from comment #0 and verify VIP interfaces are in place and loadbalancing works.
3. Verify neutron-netns-cleanup did not kill any haproxy process that is not related to the lbaas agent.

Comment 24 Lon Hohberger 2016-06-23 18:20:37 UTC
According to our records, this should be resolved by openstack-neutron-2014.2.3-37.el7ost.  This build is available now.

Comment 25 Pratik Pravin Bandarkar 2016-07-01 09:04:54 UTC
(In reply to Lon Hohberger from comment #24)
> According to our records, this should be resolved by
> openstack-neutron-2014.2.3-37.el7ost.  This build is available now.

Does that mean the fix is already included with "openstack-neutron-2014.2.3-37.el7ost" ?


customer confirmed that the test fix solved the issue. Now he is waiting for errata/official fix.

Comment 26 Nir Magnezi 2016-07-03 19:02:13 UTC
(In reply to Pratik Pravin Bandarkar from comment #25)
> (In reply to Lon Hohberger from comment #24)
> > According to our records, this should be resolved by
> > openstack-neutron-2014.2.3-37.el7ost.  This build is available now.
> 
> Does that mean the fix is already included with
> "openstack-neutron-2014.2.3-37.el7ost" ?
> 
>
That would be a yes.
If you look at the 'Fixed In Version' field you'll notice that the fix is incorporated starting from openstack-neutron-2014.2.3-35.el7ost
 
> customer confirmed that the test fix solved the issue. Now he is waiting for
> errata/official fix.

The fix is officially included in OSP6.

Comment 27 Nir Magnezi 2016-07-04 12:06:37 UTC
Created attachment 1176009 [details]
patch

The gerrithub url seem to be broken for unknown reason.
attaching the patch to this rhbz.

Comment 28 Toni Freger 2016-07-06 08:38:28 UTC
Code tested on latest OSP6 puddle - openstack-neutron-2014.2.3-38.el7ost.noarch

Within neutron-netns-cleanup file under - /usr/lib/ocf/lib/neutron/neutron-netns-cleanup