Bug 1578765

Summary: Packet loss during standby L3 agent restart
Product: Red Hat OpenStack Reporter: Yurii Prokulevych <yprokule>
Component: openstack-neutronAssignee: Slawek Kaplonski <skaplons>
Status: CLOSED ERRATA QA Contact: Alexander Stafeyev <astafeye>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: amuller, augol, bcafarel, bhaley, ccamacho, chrisw, mbultel, mburns, mcornea, nyechiel, skaplons, srevivo
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-12.0.2-0.20180421011360.0ec54fd.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1579502 1584844 1584845 (view as bug list) Environment:
Last Closed: 2018-06-27 13:56:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1579502, 1579503, 1579505, 1584844, 1584845    

Description Yurii Prokulevych 2018-05-16 11:34:23 UTC
Description of problem:
-----------------------
We got 3% packet loss during upgrade of RHOS-12 Networker composable deployment:

l3_agent_start_ping.sh

openstack overcloud upgrade run \
        --stack qe-Cloud-0 \
        --roles Networker --playbook all 2>&1

...
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.17              : ok=9    changed=0    unreachable=0    failed=0   ',
 u'192.168.24.21              : ok=9    changed=0    unreachable=0    failed=0   ',
 u'']
Success
Completed Overcloud Upgrade Run for Networker with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] 
[Wed May 16 06:00:09 EDT 2018] Finished major upgrade for Networker role
1196 packets transmitted, 1158 received, 3% packet loss, time 1196026ms
rtt min/avg/max/mdev = 0.551/3.974/499.700/30.818 ms
Ping loss higher than 1% detected

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
openstack-tripleo-heat-templates-8.0.2-19.el7ost.noarch
python-tripleoclient-9.2.1-9.el7ost.noarch

openstack-neutron-openvswitch-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch
python2-neutronclient-6.7.0-1.el7ost.noarch
openstack-neutron-common-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch
openstack-neutron-ml2-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch
openstack-neutron-metering-agent-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch
python-neutron-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch
openstack-neutron-lbaas-12.0.1-0.20180424200349.cdbf25c.el7ost.noarch
openstack-neutron-sriov-nic-agent-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch
puppet-neutron-12.4.1-0.20180412211913.el7ost.noarch
python-neutron-lbaas-12.0.1-0.20180424200349.cdbf25c.el7ost.noarch
openstack-neutron-linuxbridge-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch
python2-neutron-lib-1.13.0-1.el7ost.noarch
openstack-neutron-lbaas-ui-4.0.1-0.20180326210834.a2c502e.el7ost.noarch
openstack-neutron-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch


Steps to Reproduce:
-------------------
1. Upgrade UC
2. Run overcloud upgrade prepare
3. Launch VM on oc and assign fip to it
4. Start ping to fip before each role upgrade and stop after

Actual results:
---------------
Packet loss > 1% causes automation to fail


Expected results:
-----------------
Packet loss < 2% during upgrade


Additional info:
----------------
Virtual env: 3controller + 3ceph + 3database + 3messaging + 2compute + 2networker
 IPv6, undercloud/overcloud with SSL and custom overcloud stack name = qe-Cloud-0

Comment 3 Slawek Kaplonski 2018-05-16 21:22:20 UTC
I deployed clean OSP-13 with 2 ha L3 agents and I was easily able to reproduce this issue. It looks that every time when L3 agent on STANDBY node is restarted there is packet loss when pinging FIP.
We did long debugging with Brian and Brent today and what we found is that undercloud (node from which FIP was pinging) node is sending 3 unicast ARP requests who has FIP and there is no answer for it, then it sends broadcast ARP request and it gets response from ACTIVE node.
During packet loss we found also that ICMP requests are coming to standby node instead of active one for some time
There is nothing wrong in L3 agent logs on both nodes.
HA port on standby node is going DOWN and then UP again and that is visible in ovs agent's logs so maybe that will be some clue where to look culprit of this issue.
I will continue debugging tomorrow morning.

Comment 4 Brian Haley 2018-05-17 12:59:57 UTC
Slawek found that this is being caused by the standby router coming up with IPv6 enabled and sending an MLDv2 advertisement out, somehow it is causing something in the network to learn the MAC of the HA router is at a different location (both routers have same MAC on qg- port but only one is usually active) and sending packets that direction.  There is a WIP patch upstream now.

Comment 12 Alexander Stafeyev 2018-06-06 13:14:53 UTC
(In reply to Slawek Kaplonski from comment #3)
> I deployed clean OSP-13 with 2 ha L3 agents and I was easily able to
> reproduce this issue. It looks that every time when L3 agent on STANDBY node
> is restarted there is packet loss when pinging FIP.
> We did long debugging with Brian and Brent today and what we found is that
> undercloud (node from which FIP was pinging) node is sending 3 unicast ARP
> requests who has FIP and there is no answer for it, then it sends broadcast
> ARP request and it gets response from ACTIVE node.
> During packet loss we found also that ICMP requests are coming to standby
> node instead of active one for some time
> There is nothing wrong in L3 agent logs on both nodes.
> HA port on standby node is going DOWN and then UP again and that is visible
> in ovs agent's logs so maybe that will be some clue where to look culprit of
> this issue.
> I will continue debugging tomorrow morning.

Slawek, 
I have 3 controllers. Is it enough to restart 2 standby l3 agent dockers while pinging the FIP ? 

Thanks

Comment 13 Slawek Kaplonski 2018-06-06 13:17:15 UTC
Alexander,
Yes. It is enough. Restart of one standby L3 agent should be also enough.

Comment 14 Alexander Stafeyev 2018-06-06 13:53:05 UTC
[root@controller-0 ~]# rpm -qa | grep ck-neutron-12
openstack-neutron-12.0.2-0.20180421011360.0ec54fd.el7ost.noarch
[root@controller-0 ~]# 



No ping failures was seen.

Comment 17 errata-xmlrpc 2018-06-27 13:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086