Description of problem: A design problem in keepalived makes unnecessary DNS requests, during configuration reloads. If the network node configured DNS server is not accesible on the qrouter-* namespace (the external network), then keepalived will get stuck for ~60 seconds on every floating IP change related to the router being served by the specific keepalived. The MASTER server will flap around between HA routers, causing connectivity issues with the external network for 1-2 minutes. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. use a DNS server in the management network in the network nodes 2. create a router (with l3 ha routers enabled) 3. assign a FIP to an instance 4. ping the FIP, or the router IP 5. assign another FIP to a different instance, or make any modification to floating IPs over the same router. Actual results: The external connectivity breaks for 1-2 minutes. Expected results: The external connectivity works without issue, the MASTER router doesn't flap. Additional info:
The related keepalived bug is: https://bugzilla.redhat.com/show_bug.cgi?id=1181107 And we have a workaround: echo 127.0.0.1 $( hostname ) >>/etc/hosts
The impacts of this bug is that VRRP is not working and we do not have HA solution for OSP-6. The short term fix should be done as described in comment 1 by the installer. adding echo 127.0.0.1 $( hostname ) >>/etc/hosts in the network/controller nodes And the long term fix is in keepalive (bz 1181107) in combination of Neutron who has to provide the router_id parameter.
I'd argue that this does not depend on BZ#1181107 since you have a valid workaround.
(In reply to Ryan O'Hara from comment #6) > I'd argue that this does not depend on BZ#1181107 since you have a valid > workaround. We're using the bz tracker to handle the proper fix, the workaround is being done in deployment. Btw, and Important, after talking with Fabio di Nitto, we found the provided workaround into the first comment is risky: Use this one, otherwise it could cause problems to pacemaker: dig A $(hostname) | grep -A1 "ANSWER SEC" | tail -n 1 | awk '{print $NF " " $1}' | sed -e 's/.$//g' >>/etc/hosts ; grep $(hostname) /etc/hosts || echo "Failure setting up the hostname entry"
The workaround is available via staypuft, currently we are waiting for the right' fix in keepalive, lowering the priority.
Pushing this one a bit, the fix is dependant on keepalived fix which is not available yet.
Hi Livnat/Meguel, We need to prioritize this bug in order to verify the older one https://bugzilla.redhat.com/show_bug.cgi?id=1181107 Thanks
trying to apply this to tripleo; from what I understand the workaround is in comment #1 : echo 127.0.0.1 $( hostname ) >>/etc/hosts does the mapping for `hostname` have to be against 127.0.0.1 or can it be against any, valid, local ip?
(In reply to Giulio Fidente from comment #11) > trying to apply this to tripleo; from what I understand the workaround is in > comment #1 : > > echo 127.0.0.1 $( hostname ) >>/etc/hosts > > does the mapping for `hostname` have to be against 127.0.0.1 or can it be > against any, valid, local ip? Use this workaround (or modified) better: dig A $(hostname) | grep -A1 "ANSWER SEC" | tail -n 1 | awk '{print $NF " " $1}' | sed -e 's/.$//g' >>/etc/hosts ; grep $(hostname) /etc/hosts || echo "Failure setting up the hostname entry" It will work better with a valid IP. (In reply to Toni Freger from comment #10) > Hi Livnat/Meguel, > > We need to prioritize this bug in order to verify the older one > https://bugzilla.redhat.com/show_bug.cgi?id=1181107 > > Thanks Hi Toni, will do after the neutron mid-cycle sprint in Israel. Thanks.
keepalived commit https://github.com/acassen/keepalived/commit/9d028acd327e722e6692eaa9d47e3914e16edf3a significantly reduces the number of DNS requests made, and also added an option that allows no DNS requets to be made.
This merged upstream https://review.openstack.org/#/c/343312/ And I've proposed it for OSP10, but will be merged after final upstream release.
Tested on OpenStack/10.0-RHEL-7/2016-11-19.4/RH7-RHOS-10.0/ openstack-neutron-9.1.0-5.el7ost.src.rpm With 3 controllers and 1 compute Steps to Reproduce: 1. use a DNS server in the management network in the network nodes 2. create a router (with l3 ha routers enabled) 3. assign a FIP to an instance 4. ping the FIP, or the router IP 5. assign another FIP to a different instance, or make any modification to floating IPs over the same router. The issue didn't reproduce, connectivity to the VM remain stable. The active router didn't flip.
(In reply to Toni Freger from comment #19) > Tested on OpenStack/10.0-RHEL-7/2016-11-19.4/RH7-RHOS-10.0/ > openstack-neutron-9.1.0-5.el7ost.src.rpm > With 3 controllers and 1 compute > > Steps to Reproduce: > 1. use a DNS server in the management network in the network nodes > 2. create a router (with l3 ha routers enabled) > 3. assign a FIP to an instance > 4. ping the FIP, or the router IP > 5. assign another FIP to a different instance, or make any modification to > floating IPs over the same router. > > The issue didn't reproduce, connectivity to the VM remain stable. > The active router didn't flip. Pinging you on IRC for just in case, Make sure that it was verified also while removing the workaround (see Comment 12) that OSPD sets on the system.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html