Description of problem: In testing OSP 9 I have run into an issue where tempest a test, running on the Director node, creates an instance and logs into that instance to do some operation. The test will run successfully several times but at around the 10th instance the test cannot SSH into the instance, the SSH client timesout and test fails. If I run the test again it will fail right away, SSH timeout. The particular test I am using is: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern But I can reproduce the problem with any test that ssh's into an instance from the director node. My environment is as such. Public network for floating IPs is named public and that is the default floating ip pool configured in nova.conf as well. public network ip is 192.168.191.0/24 Tests are running with tenant isolation turned on so each tests get's it own network, router and subnet and floating ip on the public network. Version-Release number of selected component (if applicable): How reproducible: I have reproduced this problem on three different OSP 9 stamps. Steps to Reproduce: 1. Run tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern test 5-10 times in a row 2. Test will eventually fail because it cannot ssh into instance. 3. Actual results: SSH timeout Expected results: Test successfully Additional info: I have also validated this is not an issue on a OSP 8 stamp, the test will pass every time.
Are there any errors in overcloud log files?
There are no errors that I can find in the overcloud logs on controllers and compute nodes.
Upon further investigation we were able to validate that the instance is reachable from the network namespace on the controller for the instance that is not reachable from the outside (the Director node in this case). So it seems like the problem is access from the outside only. OVS/Neutron is resolving the floatig ip from outside. Again this only happens over time when several instances CAN be ssh'd into from the director node.
--------------------------------------------------------------------------------------------- Internet 192.168.191.1 63 00:01:e8:8b:c1:57 Po 100 Vl 191 CP Internet 192.168.191.2 62 fa:16:3e:01:54:fb Po 43 Vl 191 CP Internet 192.168.191.3 6 fa:16:3e:64:32:5a Po 42 Vl 191 CP Internet 192.168.191.4 5 fa:16:3e:a7:02:78 Po 42 Vl 191 CP Internet 192.168.191.5 4 fa:16:3e:a7:02:78 Po 42 Vl 191 CP Internet 192.168.191.6 4 fa:16:3e:a7:02:78 Po 42 Vl 191 CP Internet 192.168.191.7 1 fa:16:3e:04:f6:0f Po 42 Vl 191 CP Internet 192.168.191.8 0 fa:16:3e:04:f6:0f Po 42 Vl 191 CP Internet 192.168.191.9 0 fa:16:3e:04:f6:0f Po 42 Vl 191 CP Internet 192.168.191.10 36 fa:16:3e:bc:7e:f4 Po 43 Vl 191 CP Internet 192.168.191.12 62 fa:16:3e:d1:17:3a Po 42 Vl 191 CP Internet 192.168.191.13 62 fa:16:3e:d6:b4:aa Po 43 Vl 191 CP Internet 192.168.191.15 171 fa:16:3e:76:11:ec Po 42 Vl 191 CP Internet 192.168.191.16 62 fa:16:3e:d1:17:3a Po 42 Vl 191 CP Internet 192.168.191.17 62 fa:16:3e:76:11:ec Po 42 Vl 191 CP Internet 192.168.191.18 52 fa:16:3e:bc:7e:f4 Po 43 Vl 191 CP Internet 192.168.191.19 14 fa:16:3e:fc:1c:22 Po 43 Vl 191 CP Internet 192.168.191.20 14 fa:16:3e:fc:1c:22 Po 43 Vl 191 CP Internet 192.168.191.21 13 fa:16:3e:fc:1c:22 Po 43 Vl 191 CP Internet 192.168.191.22 12 fa:16:3e:fc:1c:22 Po 43 Vl 191 CP Internet 192.168.191.23 11 fa:16:3e:7a:f5:9b Po 42 Vl 191 CP Internet 192.168.191.24 11 fa:16:3e:7a:f5:9b Po 42 Vl 191 CP Internet 192.168.191.25 10 fa:16:3e:7a:f5:9b Po 42 Vl 191 CP Internet 192.168.191.26 9 fa:16:3e:7a:f5:9b Po 42 Vl 191 CP Internet 192.168.191.27 8 fa:16:3e:64:32:5a Po 42 Vl 191 CP Internet 192.168.191.28 8 fa:16:3e:64:32:5a Po 42 Vl 191 CP Internet 192.168.191.29 7 fa:16:3e:64:32:5a Po 42 Vl 191 CP Internet 192.168.191.30 5 fa:16:3e:a7:02:78 Po 42 Vl 191 CP Internet 192.168.191.252 - 00:01:e8:8b:c1:3f - Vl 191 CP MHT1R1M_SW03#clear arp-cache vlan 191 MHT1R1M_SW03#sh arp Protocol Address Age(min) Hardware Address Interface VLAN CPU --------------------------------------------------------------------------------------------- Internet 192.168.190.106 7 52:54:00:05:9a:75 Po 20 Vl 190 CP Internet 192.168.190.109 80 4e:69:7a:e6:bb:28 Po 42 Vl 190 CP Internet 192.168.190.110 80 42:6b:cc:e3:7c:a2 Po 43 Vl 190 CP Internet 192.168.190.242 77 00:50:56:aa:60:ed Po 1 Vl 190 CP Internet 192.168.190.250 80 42:6b:cc:e3:7c:a2 Po 43 Vl 190 CP Internet 192.168.190.252 - 00:01:e8:8b:c1:3f - Vl 190 CP Internet 192.168.191.1 0 00:01:e8:8b:c1:57 Po 100 Vl 191 CP Internet 192.168.191.2 0 fa:16:3e:01:54:fb Po 43 Vl 191 CP Internet 192.168.191.7 0 fa:16:3e:04:f6:0f Po 42 Vl 191 CP Internet 192.168.191.9 0 fa:16:3e:04:f6:0f Po 42 Vl 191 CP Internet 192.168.191.10 0 fa:16:3e:bc:7e:f4 Po 43 Vl 191 CP Internet 192.168.191.12 0 fa:16:3e:d1:17:3a Po 42 Vl 191 CP Internet 192.168.191.13 0 fa:16:3e:d6:b4:aa Po 43 Vl 191 CP Internet 192.168.191.15 0 fa:16:3e:76:11:ec Po 42 Vl 191 CP Internet 192.168.191.16 0 fa:16:3e:d1:17:3a Po 42 Vl 191 CP Internet 192.168.191.17 0 fa:16:3e:76:11:ec Po 42 Vl 191 CP Internet 192.168.191.18 0 fa:16:3e:bc:7e:f4 Po 43 Vl 191 CP Internet 192.168.191.19 0 fa:16:3e:04:f6:0f Po 42 Vl 191 CP Internet 192.168.191.252 - 00:01:e8:8b:c1:3f - Vl 191 CP
(In reply to Randy Perryman from comment #4) If you check .19 you will see it moved from Port-Channel 43 to Port-Channel 42, something is preventing the arp entry from being updated. This is the same switch config that worked with OSP 8.
Okay this seems related to https://bugs.launchpad.net/neutron/+bug/1268995 -------------------- Looking at the controllers I see the send_arp_for_ha is not set. [heat-admin@red-controller-0 ~]$ sudo grep send_arp /etc/neutron/l3_agent.ini #send_arp_for_ha = 3 [heat-admin@red-controller-0 ~]$ exit logout Connection to 192.168.120.129 closed. [stack@director ~]$ ssh cntl1 Last login: Fri Oct 14 18:49:43 2016 from 192.168.120.106 [heat-admin@red-controller-1 ~]$ sudo grep send_arp /etc/neutron/l3_agent.ini #send_arp_for_ha = 3 ------------------ How do I set this in our yamls?
(In reply to Randy Perryman from comment #6) > Okay this seems related to > > https://bugs.launchpad.net/neutron/+bug/1268995 > > -------------------- > Looking at the controllers I see the send_arp_for_ha is not set. > > > [heat-admin@red-controller-0 ~]$ sudo grep send_arp /etc/neutron/l3_agent.ini > #send_arp_for_ha = 3 > [heat-admin@red-controller-0 ~]$ exit > logout > Connection to 192.168.120.129 closed. > [stack@director ~]$ ssh cntl1 > Last login: Fri Oct 14 18:49:43 2016 from 192.168.120.106 > [heat-admin@red-controller-1 ~]$ sudo grep send_arp /etc/neutron/l3_agent.ini > #send_arp_for_ha = 3 > > > ------------------ > How do I set this in our yamls? The default is 3, so it's enabled. Looking at the Launchpad bug you linked, check out comment 10: https://bugs.launchpad.net/neutron/+bug/1268995/comments/10
How do we confirm it is set to 3 in the running system?
(In reply to Randy Perryman from comment #8) > How do we confirm it is set to 3 in the running system? You pasted l3_agent.ini, it shows it's commented out, therefor it's using the default of 3. If you want to see for yourself, you can find out the PID of the L3 agent on the system, then: kill -s SIGUSR2 $L3_AGENT_PID It will spit out the full list of conf options it's actively using.
This is all I get from that: [root@overcloud-controller-0 neutron]# ps axf | grep l3 2912 pts/0 S+ 0:00 \_ grep --color=auto l3 150607 ? Ss 86:01 /usr/bin/python2 /usr/bin/neutron-l3-agent --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file /var/log/neutron/l3-agent.log [root@overcloud-controller-0 neutron]# kill -s SIGUSR2 150607 [root@overcloud-controller-0 neutron]# ---------------
(In reply to Randy Perryman from comment #10) > This is all I get from that: > [root@overcloud-controller-0 neutron]# ps axf | grep l3 > 2912 pts/0 S+ 0:00 \_ grep --color=auto l3 > 150607 ? Ss 86:01 /usr/bin/python2 /usr/bin/neutron-l3-agent > --config-file /usr/share/neutron/neutron-dist.conf --config-dir > /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf > --config-dir /etc/neutron/conf.d/common --config-dir > /etc/neutron/conf.d/neutron-l3-agent --log-file /var/log/neutron/l3-agent.log > [root@overcloud-controller-0 neutron]# kill -s SIGUSR2 150607 > [root@overcloud-controller-0 neutron]# > --------------- It should be in the L3 agent logs.
thanks found it says 3 for both servers I have also verified that the switch is configured to accept gratuitous Arp
any ideas why we are not updating?
(In reply to Randy Perryman from comment #13) > any ideas why we are not updating the ARP Cache? Especially as they do learn.
Created attachment 1211554 [details] Plain Text file of tcpdump from the Switch This is tcpdump from the switches themselves 192.168.191.21 is the IP Assigned to the Tenant Router 192.168.191.22 is the Floating IP Assigned to an Instance You can see a Gratuitous ARP for 192.168.191.21 then a few seconds later a who as 192.168.191.22, but you never see the Gratuitous ARP.
Steps to recreate: 1. Deploy OSP 9 with Jetstream code using 2 - 3 controllers, VRRP, and Bond'd NICs 2. Create a Floating IP network with 5 IP's 3. Create a router and allocate IP's 4. Create 2 instances and assign Floating IP's 5. Ping from outside source 6. Delete all Instances/Floating/Router 7. Repeat in new Tenant 8. Failure should happen
Should Add the Router for the Floating IP's is the Network device that the controllers are directly connected to.
https://bugs.launchpad.net/neutron/+bug/1585165/comments/16 This looks suspiciously close to what we are seeing. The Floating IP is not being cleaned up correctly or reassigned correctly.
Any further diagnostics that would help us determine if this is a potential environmental problem or if this is exactly the upstream issue Randy mentions in https://bugzilla.redhat.com/show_bug.cgi?id=1384108#c18
(In reply to Mike Orazi from comment #19) > Any further diagnostics that would help us determine if this is a potential > environmental problem or if this is exactly the upstream issue Randy > mentions in https://bugzilla.redhat.com/show_bug.cgi?id=1384108#c18 You could manually issue ARPing with -A and -U from the router namespace and see if either resolve the issue. One caveat is that we use HA routers that issue GARPs from keepalived, not from the L3 agent code. If we'll need to make modifications to the way we send GARPs we'll have to do it in keepalived, which is possible but more difficult.
keepalive after Upgrade from OSP 8 to OSP 9 keepalived-1.2.13-7.el7.x86_64 Fresh install of JS 6.0 (OSP 9) keepalived-1.2.13-7.el7.x86_64 Working on creating a OSP 8 install to see what that is at.
doing rpm -qa on the image file - keepalived-1.2.13-7.el7.x86_64
(In reply to Randy Perryman from comment #23) > doing rpm -qa on the image file - keepalived-1.2.13-7.el7.x86_64 Just validated on a JS 5.0 OSP 8 install that the keepalive is keepalived-1.2.13-7.el7.x86_64.
Do we have fix?
The upstream issue was fixed, but then someone commented that they see the same issue. So either the issue was never actually fixed, or the commenter only saw a similar symptom, but actually hit a different issue. Randy, it isn't clear from the comments if you were able to use Assaf's suggestion to verify if the issue as described in the comment: "You could manually issue ARPing with -A and -U from the router namespace and see if either resolve the issue." Sounds like if those two ARPing commands work-around the issue, then it could indeed be a match for what the most recent commenter sees/suggests.
I was able to boot an instance that was unreachable via SSH, 192.168.191.21. I found which controller had the address and from that network namespace ran: arping -A -U -I qg-5b95cc22-6b 192.168.191.21 Where qg-5b95cc22-6b is the device in the network namespace that has the .21 address. It returned: ARPING 192.168.191.21 from 192.168.191.21 qg-5b95cc22-6b In a duplicate session I was then able to ssh into 192.168.191.21 without issue. Conclusion: the arping command allowed me to reuse the floating ip 192.168.191.21 and make it accessible.
As Dave shows the arping works, so was there and upstream patch we need to try?
Assaf, using arping -A -U ... resolves the issue. What are next steps?
I'm pretty much convinced this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1386718. @Jakub, would you be able to backport the fix to OSP 9 as well?
Jacub, will the fix for 1386718 be backported to OSP9? Arkady
I agree that bug is similar in that a GARP is not occurring when the VIP moves, is the same logic used for when a Floating IP is moved?
(In reply to Randy Perryman from comment #32) > I agree that bug is similar in that a GARP is not occurring when the VIP > moves, is the same logic used for when a Floating IP is moved? Floating IPs are implemented as VIPs in keepalived when you're using HA routers. Here's some more info: https://assafmuller.com/2014/11/08/openstack-paris-network-node-high-availability-video/
I backported the fix for OSP9 and upstream Mitaka
*** Bug 1397926 has been marked as a duplicate of this bug. ***
When can we expect to see this fix in zstream package?
(In reply to David Paterson from comment #36) > When can we expect to see this fix in zstream package? The clone I just created got dup'd to this BZ. We are currently waiting for the following upstream change to be merged in stable/mitaka, then we already have a RH gerrit for it to be backported into OSP 9. Until it merges upstream, we can't give an ETA. That said, it has been touched over the last two days so I would think soon. https://review.openstack.org/#/c/400348/
Mitaka upstream patch [0] has landed. Working on downstream patch for OSP 9 backport. Also, same issue in OSP 10 has landed in patch upstream [1] and downstream [2]. [0] https://review.openstack.org/#/c/400348/ [1] https://review.openstack.org/#/c/393886/17 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1386718
verified and fixed : on OSPD9 virt env [root@controller-0 ~]# rpm -qa |grep openstack-neutron-8. openstack-neutron-8.1.2-14.el7ost.noarch Ran Tempest scenario as mentioned in the bug and verified it also manually
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0232.html