|Summary:||Packet loss during standby L3 agent restart|
|Product:||Red Hat OpenStack||Reporter:||Yurii Prokulevych <yprokule>|
|Component:||openstack-neutron||Assignee:||Slawek Kaplonski <skaplons>|
|Status:||CLOSED ERRATA||QA Contact:||Alexander Stafeyev <astafeye>|
|Version:||13.0 (Queens)||CC:||amuller, augol, bcafarel, bhaley, ccamacho, chrisw, mbultel, mburns, mcornea, nyechiel, skaplons, srevivo|
|Target Milestone:||rc||Keywords:||Triaged, ZStream|
|Target Release:||13.0 (Queens)|
|Fixed In Version:||openstack-neutron-12.0.2-0.20180421011360.0ec54fd.el7ost||Doc Type:||If docs needed, set a value|
|Doc Text:||Story Points:||---|
|:||1579502 1584844 1584845 (view as bug list)||Environment:|
|Last Closed:||2018-06-27 13:56:23 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
|Bug Depends On:|
|Bug Blocks:||1579502, 1579503, 1579505, 1584844, 1584845|
Description Yurii Prokulevych 2018-05-16 11:34:23 UTC
Description of problem: ----------------------- We got 3% packet loss during upgrade of RHOS-12 Networker composable deployment: l3_agent_start_ping.sh openstack overcloud upgrade run \ --stack qe-Cloud-0 \ --roles Networker --playbook all 2>&1 ... u'PLAY RECAP *********************************************************************', u'192.168.24.17 : ok=9 changed=0 unreachable=0 failed=0 ', u'192.168.24.21 : ok=9 changed=0 unreachable=0 failed=0 ', u''] Success Completed Overcloud Upgrade Run for Networker with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] [Wed May 16 06:00:09 EDT 2018] Finished major upgrade for Networker role 1196 packets transmitted, 1158 received, 3% packet loss, time 1196026ms rtt min/avg/max/mdev = 0.551/3.974/499.700/30.818 ms Ping loss higher than 1% detected Version-Release number of selected component (if applicable): ------------------------------------------------------------- openstack-tripleo-heat-templates-8.0.2-19.el7ost.noarch python-tripleoclient-9.2.1-9.el7ost.noarch openstack-neutron-openvswitch-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch python2-neutronclient-6.7.0-1.el7ost.noarch openstack-neutron-common-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch openstack-neutron-ml2-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch openstack-neutron-metering-agent-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch python-neutron-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch openstack-neutron-lbaas-12.0.1-0.20180424200349.cdbf25c.el7ost.noarch openstack-neutron-sriov-nic-agent-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch puppet-neutron-12.4.1-0.20180412211913.el7ost.noarch python-neutron-lbaas-12.0.1-0.20180424200349.cdbf25c.el7ost.noarch openstack-neutron-linuxbridge-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch python2-neutron-lib-1.13.0-1.el7ost.noarch openstack-neutron-lbaas-ui-4.0.1-0.20180326210834.a2c502e.el7ost.noarch openstack-neutron-12.0.2-0.20180421011359.0ec54fd.el7ost.noarch Steps to Reproduce: ------------------- 1. Upgrade UC 2. Run overcloud upgrade prepare 3. Launch VM on oc and assign fip to it 4. Start ping to fip before each role upgrade and stop after Actual results: --------------- Packet loss > 1% causes automation to fail Expected results: ----------------- Packet loss < 2% during upgrade Additional info: ---------------- Virtual env: 3controller + 3ceph + 3database + 3messaging + 2compute + 2networker IPv6, undercloud/overcloud with SSL and custom overcloud stack name = qe-Cloud-0
Comment 3 Slawek Kaplonski 2018-05-16 21:22:20 UTC
I deployed clean OSP-13 with 2 ha L3 agents and I was easily able to reproduce this issue. It looks that every time when L3 agent on STANDBY node is restarted there is packet loss when pinging FIP. We did long debugging with Brian and Brent today and what we found is that undercloud (node from which FIP was pinging) node is sending 3 unicast ARP requests who has FIP and there is no answer for it, then it sends broadcast ARP request and it gets response from ACTIVE node. During packet loss we found also that ICMP requests are coming to standby node instead of active one for some time There is nothing wrong in L3 agent logs on both nodes. HA port on standby node is going DOWN and then UP again and that is visible in ovs agent's logs so maybe that will be some clue where to look culprit of this issue. I will continue debugging tomorrow morning.
Comment 4 Brian Haley 2018-05-17 12:59:57 UTC
Slawek found that this is being caused by the standby router coming up with IPv6 enabled and sending an MLDv2 advertisement out, somehow it is causing something in the network to learn the MAC of the HA router is at a different location (both routers have same MAC on qg- port but only one is usually active) and sending packets that direction. There is a WIP patch upstream now.
Comment 12 Alexander Stafeyev 2018-06-06 13:14:53 UTC
(In reply to Slawek Kaplonski from comment #3) > I deployed clean OSP-13 with 2 ha L3 agents and I was easily able to > reproduce this issue. It looks that every time when L3 agent on STANDBY node > is restarted there is packet loss when pinging FIP. > We did long debugging with Brian and Brent today and what we found is that > undercloud (node from which FIP was pinging) node is sending 3 unicast ARP > requests who has FIP and there is no answer for it, then it sends broadcast > ARP request and it gets response from ACTIVE node. > During packet loss we found also that ICMP requests are coming to standby > node instead of active one for some time > There is nothing wrong in L3 agent logs on both nodes. > HA port on standby node is going DOWN and then UP again and that is visible > in ovs agent's logs so maybe that will be some clue where to look culprit of > this issue. > I will continue debugging tomorrow morning. Slawek, I have 3 controllers. Is it enough to restart 2 standby l3 agent dockers while pinging the FIP ? Thanks
Comment 13 Slawek Kaplonski 2018-06-06 13:17:15 UTC
Alexander, Yes. It is enough. Restart of one standby L3 agent should be also enough.
Comment 14 Alexander Stafeyev 2018-06-06 13:53:05 UTC
[root@controller-0 ~]# rpm -qa | grep ck-neutron-12 openstack-neutron-12.0.2-0.20180421011360.0ec54fd.el7ost.noarch [root@controller-0 ~]# No ping failures was seen.
Comment 17 errata-xmlrpc 2018-06-27 13:56:23 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086