Bug 1578765
Summary: | Packet loss during standby L3 agent restart | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Yurii Prokulevych <yprokule> | |
Component: | openstack-neutron | Assignee: | Slawek Kaplonski <skaplons> | |
Status: | CLOSED ERRATA | QA Contact: | Alexander Stafeyev <astafeye> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 13.0 (Queens) | CC: | amuller, augol, bcafarel, bhaley, ccamacho, chrisw, mbultel, mburns, mcornea, nyechiel, skaplons, srevivo | |
Target Milestone: | rc | Keywords: | Triaged, ZStream | |
Target Release: | 13.0 (Queens) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-neutron-12.0.2-0.20180421011360.0ec54fd.el7ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1579502 1584844 1584845 (view as bug list) | Environment: | ||
Last Closed: | 2018-06-27 13:56:23 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1579502, 1579503, 1579505, 1584844, 1584845 |
Description
Yurii Prokulevych
2018-05-16 11:34:23 UTC
I deployed clean OSP-13 with 2 ha L3 agents and I was easily able to reproduce this issue. It looks that every time when L3 agent on STANDBY node is restarted there is packet loss when pinging FIP. We did long debugging with Brian and Brent today and what we found is that undercloud (node from which FIP was pinging) node is sending 3 unicast ARP requests who has FIP and there is no answer for it, then it sends broadcast ARP request and it gets response from ACTIVE node. During packet loss we found also that ICMP requests are coming to standby node instead of active one for some time There is nothing wrong in L3 agent logs on both nodes. HA port on standby node is going DOWN and then UP again and that is visible in ovs agent's logs so maybe that will be some clue where to look culprit of this issue. I will continue debugging tomorrow morning. Slawek found that this is being caused by the standby router coming up with IPv6 enabled and sending an MLDv2 advertisement out, somehow it is causing something in the network to learn the MAC of the HA router is at a different location (both routers have same MAC on qg- port but only one is usually active) and sending packets that direction. There is a WIP patch upstream now. (In reply to Slawek Kaplonski from comment #3) > I deployed clean OSP-13 with 2 ha L3 agents and I was easily able to > reproduce this issue. It looks that every time when L3 agent on STANDBY node > is restarted there is packet loss when pinging FIP. > We did long debugging with Brian and Brent today and what we found is that > undercloud (node from which FIP was pinging) node is sending 3 unicast ARP > requests who has FIP and there is no answer for it, then it sends broadcast > ARP request and it gets response from ACTIVE node. > During packet loss we found also that ICMP requests are coming to standby > node instead of active one for some time > There is nothing wrong in L3 agent logs on both nodes. > HA port on standby node is going DOWN and then UP again and that is visible > in ovs agent's logs so maybe that will be some clue where to look culprit of > this issue. > I will continue debugging tomorrow morning. Slawek, I have 3 controllers. Is it enough to restart 2 standby l3 agent dockers while pinging the FIP ? Thanks Alexander, Yes. It is enough. Restart of one standby L3 agent should be also enough. [root@controller-0 ~]# rpm -qa | grep ck-neutron-12 openstack-neutron-12.0.2-0.20180421011360.0ec54fd.el7ost.noarch [root@controller-0 ~]# No ping failures was seen. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |