Bug 1384108 - Cannot SSH into instance after consecutive test runs
Summary: Cannot SSH into instance after consecutive test runs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 9.0 (Mitaka)
Hardware: All
OS: Linux
medium
urgent
Target Milestone: ---
: 9.0 (Mitaka)
Assignee: Jakub Libosvar
QA Contact: Eran Kuris
URL:
Whiteboard:
: 1397926 (view as bug list)
Depends On:
Blocks: 1305654
TreeView+ depends on / blocked
 
Reported: 2016-10-12 14:55 UTC by David Paterson
Modified: 2017-02-01 14:24 UTC (History)
20 users (show)

Fixed In Version: openstack-neutron-8.1.2-14.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-01 14:24:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Plain Text file of tcpdump from the Switch (6.05 KB, text/plain)
2016-10-17 22:31 UTC, Randy Perryman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 400348 0 None None None 2016-11-21 17:23:37 UTC
Red Hat Product Errata RHBA-2017:0232 0 normal SHIPPED_LIVE openstack-neutron bug fix advisory 2017-02-01 19:24:39 UTC

Description David Paterson 2016-10-12 14:55:06 UTC
Description of problem:
In testing OSP 9 I have run into an issue where tempest a test, running on the Director node, creates an instance and logs into that instance to do some operation.  The test will run successfully several times but at around the 10th instance the test cannot SSH into the instance, the SSH client timesout and test fails.  If I run the test again it will fail right away, SSH timeout.  

The particular test I am using is:
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern

But I can reproduce the problem with any test that ssh's into an instance from the director node.

My environment is as such.
Public network for floating IPs is named public and that is the default floating ip pool configured in nova.conf as well.
public network ip is 192.168.191.0/24

Tests are running with tenant isolation turned on so each tests get's it own network, router and subnet and floating ip on the public network.

Version-Release number of selected component (if applicable):


How reproducible: I have reproduced this problem on three different OSP 9 stamps.


Steps to Reproduce:
1. Run tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern test 5-10 times in a row
2. Test will eventually fail because it cannot ssh into instance.
3.

Actual results:

SSH timeout
Expected results:
Test successfully

Additional info:
I have also validated this is not an issue on a OSP 8 stamp, the test will pass every time.

Comment 1 Mike Burns 2016-10-13 17:51:13 UTC
Are there any errors in overcloud log files?

Comment 2 David Paterson 2016-10-13 18:12:52 UTC
There are no errors that I can find in the overcloud logs on controllers and compute nodes.

Comment 3 David Paterson 2016-10-14 13:23:38 UTC
Upon further investigation we were able to validate that the instance is reachable from the network namespace on the controller for the instance that is not reachable from the outside (the Director node in this case).

So it seems like the problem is access from the outside only. OVS/Neutron is resolving the floatig ip from outside.

Again this only happens over time when several instances CAN be ssh'd into from the director node.

Comment 4 Randy Perryman 2016-10-14 19:06:52 UTC
---------------------------------------------------------------------------------------------
Internet    192.168.191.1        63   00:01:e8:8b:c1:57   Po 100         Vl 191           CP
Internet    192.168.191.2        62   fa:16:3e:01:54:fb   Po 43          Vl 191           CP
Internet    192.168.191.3         6   fa:16:3e:64:32:5a   Po 42          Vl 191           CP
Internet    192.168.191.4         5   fa:16:3e:a7:02:78   Po 42          Vl 191           CP
Internet    192.168.191.5         4   fa:16:3e:a7:02:78   Po 42          Vl 191           CP
Internet    192.168.191.6         4   fa:16:3e:a7:02:78   Po 42          Vl 191           CP
Internet    192.168.191.7         1   fa:16:3e:04:f6:0f   Po 42          Vl 191           CP
Internet    192.168.191.8         0   fa:16:3e:04:f6:0f   Po 42          Vl 191           CP
Internet    192.168.191.9         0   fa:16:3e:04:f6:0f   Po 42          Vl 191           CP
Internet    192.168.191.10       36   fa:16:3e:bc:7e:f4   Po 43          Vl 191           CP
Internet    192.168.191.12       62   fa:16:3e:d1:17:3a   Po 42          Vl 191           CP
Internet    192.168.191.13       62   fa:16:3e:d6:b4:aa   Po 43          Vl 191           CP
Internet    192.168.191.15      171   fa:16:3e:76:11:ec   Po 42          Vl 191           CP
Internet    192.168.191.16       62   fa:16:3e:d1:17:3a   Po 42          Vl 191           CP
Internet    192.168.191.17       62   fa:16:3e:76:11:ec   Po 42          Vl 191           CP
Internet    192.168.191.18       52   fa:16:3e:bc:7e:f4   Po 43          Vl 191           CP
Internet    192.168.191.19       14   fa:16:3e:fc:1c:22   Po 43          Vl 191           CP
Internet    192.168.191.20       14   fa:16:3e:fc:1c:22   Po 43          Vl 191           CP
Internet    192.168.191.21       13   fa:16:3e:fc:1c:22   Po 43          Vl 191           CP
Internet    192.168.191.22       12   fa:16:3e:fc:1c:22   Po 43          Vl 191           CP
Internet    192.168.191.23       11   fa:16:3e:7a:f5:9b   Po 42          Vl 191           CP
Internet    192.168.191.24       11   fa:16:3e:7a:f5:9b   Po 42          Vl 191           CP
Internet    192.168.191.25       10   fa:16:3e:7a:f5:9b   Po 42          Vl 191           CP
Internet    192.168.191.26        9   fa:16:3e:7a:f5:9b   Po 42          Vl 191           CP
Internet    192.168.191.27        8   fa:16:3e:64:32:5a   Po 42          Vl 191           CP
Internet    192.168.191.28        8   fa:16:3e:64:32:5a   Po 42          Vl 191           CP
Internet    192.168.191.29        7   fa:16:3e:64:32:5a   Po 42          Vl 191           CP
Internet    192.168.191.30        5   fa:16:3e:a7:02:78   Po 42          Vl 191           CP
Internet    192.168.191.252       -   00:01:e8:8b:c1:3f        -         Vl 191           CP
MHT1R1M_SW03#clear arp-cache vlan 191
MHT1R1M_SW03#sh arp

Protocol    Address         Age(min)  Hardware Address    Interface      VLAN             CPU
---------------------------------------------------------------------------------------------
Internet    192.168.190.106       7   52:54:00:05:9a:75   Po 20          Vl 190           CP
Internet    192.168.190.109      80   4e:69:7a:e6:bb:28   Po 42          Vl 190           CP
Internet    192.168.190.110      80   42:6b:cc:e3:7c:a2   Po 43          Vl 190           CP
Internet    192.168.190.242      77   00:50:56:aa:60:ed   Po 1           Vl 190           CP
Internet    192.168.190.250      80   42:6b:cc:e3:7c:a2   Po 43          Vl 190           CP
Internet    192.168.190.252       -   00:01:e8:8b:c1:3f        -         Vl 190           CP
Internet    192.168.191.1         0   00:01:e8:8b:c1:57   Po 100         Vl 191           CP
Internet    192.168.191.2         0   fa:16:3e:01:54:fb   Po 43          Vl 191           CP
Internet    192.168.191.7         0   fa:16:3e:04:f6:0f   Po 42          Vl 191           CP
Internet    192.168.191.9         0   fa:16:3e:04:f6:0f   Po 42          Vl 191           CP
Internet    192.168.191.10        0   fa:16:3e:bc:7e:f4   Po 43          Vl 191           CP
Internet    192.168.191.12        0   fa:16:3e:d1:17:3a   Po 42          Vl 191           CP
Internet    192.168.191.13        0   fa:16:3e:d6:b4:aa   Po 43          Vl 191           CP
Internet    192.168.191.15        0   fa:16:3e:76:11:ec   Po 42          Vl 191           CP
Internet    192.168.191.16        0   fa:16:3e:d1:17:3a   Po 42          Vl 191           CP
Internet    192.168.191.17        0   fa:16:3e:76:11:ec   Po 42          Vl 191           CP
Internet    192.168.191.18        0   fa:16:3e:bc:7e:f4   Po 43          Vl 191           CP
Internet    192.168.191.19        0   fa:16:3e:04:f6:0f   Po 42          Vl 191           CP
Internet    192.168.191.252       -   00:01:e8:8b:c1:3f        -         Vl 191           CP

Comment 5 Randy Perryman 2016-10-14 19:08:27 UTC
(In reply to Randy Perryman from comment #4)
If you check .19 you will see it moved from Port-Channel 43 to Port-Channel 42, something is preventing the arp entry from being updated.  This is the same switch config that worked with OSP 8.

Comment 6 Randy Perryman 2016-10-17 14:07:35 UTC
Okay this seems related to 

https://bugs.launchpad.net/neutron/+bug/1268995

--------------------
Looking at the controllers I see the send_arp_for_ha is not set.


[heat-admin@red-controller-0 ~]$ sudo grep send_arp /etc/neutron/l3_agent.ini
#send_arp_for_ha = 3
[heat-admin@red-controller-0 ~]$ exit
logout
Connection to 192.168.120.129 closed.
[stack@director ~]$ ssh cntl1
Last login: Fri Oct 14 18:49:43 2016 from 192.168.120.106
[heat-admin@red-controller-1 ~]$ sudo grep send_arp /etc/neutron/l3_agent.ini
#send_arp_for_ha = 3


------------------
How do I set this in our yamls?

Comment 7 Assaf Muller 2016-10-17 14:37:25 UTC
(In reply to Randy Perryman from comment #6)
> Okay this seems related to 
> 
> https://bugs.launchpad.net/neutron/+bug/1268995
> 
> --------------------
> Looking at the controllers I see the send_arp_for_ha is not set.
> 
> 
> [heat-admin@red-controller-0 ~]$ sudo grep send_arp /etc/neutron/l3_agent.ini
> #send_arp_for_ha = 3
> [heat-admin@red-controller-0 ~]$ exit
> logout
> Connection to 192.168.120.129 closed.
> [stack@director ~]$ ssh cntl1
> Last login: Fri Oct 14 18:49:43 2016 from 192.168.120.106
> [heat-admin@red-controller-1 ~]$ sudo grep send_arp /etc/neutron/l3_agent.ini
> #send_arp_for_ha = 3
> 
> 
> ------------------
> How do I set this in our yamls?

The default is 3, so it's enabled. Looking at the Launchpad bug you linked, check out comment 10:
https://bugs.launchpad.net/neutron/+bug/1268995/comments/10

Comment 8 Randy Perryman 2016-10-17 15:14:02 UTC
How do we confirm it is set to 3 in the running system?

Comment 9 Assaf Muller 2016-10-17 15:23:57 UTC
(In reply to Randy Perryman from comment #8)
> How do we confirm it is set to 3 in the running system?

You pasted l3_agent.ini, it shows it's commented out, therefor it's using the default of 3. If you want to see for yourself, you can find out the PID of the L3 agent on the system, then:

kill -s SIGUSR2 $L3_AGENT_PID

It will spit out the full list of conf options it's actively using.

Comment 10 Randy Perryman 2016-10-17 15:40:57 UTC
This is all I get from that:
[root@overcloud-controller-0 neutron]# ps axf | grep l3
  2912 pts/0    S+     0:00                      \_ grep --color=auto l3
150607 ?        Ss    86:01 /usr/bin/python2 /usr/bin/neutron-l3-agent --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file /var/log/neutron/l3-agent.log
[root@overcloud-controller-0 neutron]# kill -s SIGUSR2 150607
[root@overcloud-controller-0 neutron]#
---------------

Comment 11 Assaf Muller 2016-10-17 15:47:05 UTC
(In reply to Randy Perryman from comment #10)
> This is all I get from that:
> [root@overcloud-controller-0 neutron]# ps axf | grep l3
>   2912 pts/0    S+     0:00                      \_ grep --color=auto l3
> 150607 ?        Ss    86:01 /usr/bin/python2 /usr/bin/neutron-l3-agent
> --config-file /usr/share/neutron/neutron-dist.conf --config-dir
> /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf
> --config-dir /etc/neutron/conf.d/common --config-dir
> /etc/neutron/conf.d/neutron-l3-agent --log-file /var/log/neutron/l3-agent.log
> [root@overcloud-controller-0 neutron]# kill -s SIGUSR2 150607
> [root@overcloud-controller-0 neutron]#
> ---------------

It should be in the L3 agent logs.

Comment 12 Randy Perryman 2016-10-17 15:55:07 UTC
thanks found it says 3  for both servers
I have also verified that the switch is configured to accept gratuitous Arp

Comment 13 Randy Perryman 2016-10-17 17:33:06 UTC
any ideas why we are not updating?

Comment 14 Randy Perryman 2016-10-17 18:19:42 UTC
(In reply to Randy Perryman from comment #13)
> any ideas why we are not updating the ARP Cache?  Especially as they do learn.

Comment 15 Randy Perryman 2016-10-17 22:31:29 UTC
Created attachment 1211554 [details]
Plain Text file of tcpdump from the Switch

This is tcpdump from the switches themselves
192.168.191.21 is the IP Assigned to the Tenant Router
192.168.191.22 is the Floating IP Assigned to an Instance

You can see a Gratuitous ARP for 192.168.191.21 
then a few seconds later a who as 192.168.191.22, but you never see the Gratuitous ARP.

Comment 16 Randy Perryman 2016-10-18 13:32:54 UTC
Steps to recreate:
1. Deploy OSP 9 with Jetstream code using 2 - 3 controllers, VRRP, and Bond'd NICs
2. Create a Floating IP network with 5 IP's
3. Create a router and allocate IP's
4. Create 2 instances and assign Floating IP's
5. Ping from outside source
6. Delete all Instances/Floating/Router
7. Repeat in new Tenant
8. Failure should happen

Comment 17 Randy Perryman 2016-10-18 14:23:22 UTC
Should Add the Router for the Floating IP's is the Network device that the controllers are directly connected to.

Comment 18 Randy Perryman 2016-10-18 16:56:38 UTC
https://bugs.launchpad.net/neutron/+bug/1585165/comments/16
This looks suspiciously close to what we are seeing.  The Floating IP is not being cleaned up correctly or reassigned correctly.

Comment 19 Mike Orazi 2016-10-20 20:24:51 UTC
Any further diagnostics that would help us determine if this is a potential environmental problem or if this is exactly the upstream issue Randy mentions in https://bugzilla.redhat.com/show_bug.cgi?id=1384108#c18

Comment 20 Assaf Muller 2016-10-21 19:37:32 UTC
(In reply to Mike Orazi from comment #19)
> Any further diagnostics that would help us determine if this is a potential
> environmental problem or if this is exactly the upstream issue Randy
> mentions in https://bugzilla.redhat.com/show_bug.cgi?id=1384108#c18

You could manually issue ARPing with -A and -U from the router namespace and see if either resolve the issue.

One caveat is that we use HA routers that issue GARPs from keepalived, not from the L3 agent code. If we'll need to make modifications to the way we send GARPs we'll have to do it in keepalived, which is possible but more difficult.

Comment 21 Randy Perryman 2016-10-25 14:12:52 UTC
keepalive after Upgrade from OSP 8 to OSP 9

keepalived-1.2.13-7.el7.x86_64


Fresh install of JS 6.0 (OSP 9)

keepalived-1.2.13-7.el7.x86_64

Working on creating a OSP 8 install to see what that is at.

Comment 22 Randy Perryman 2016-10-25 14:12:53 UTC
keepalive after Upgrade from OSP 8 to OSP 9

keepalived-1.2.13-7.el7.x86_64


Fresh install of JS 6.0 (OSP 9)

keepalived-1.2.13-7.el7.x86_64

Working on creating a OSP 8 install to see what that is at.

Comment 23 Randy Perryman 2016-10-25 14:27:58 UTC
doing rpm -qa on the image file - keepalived-1.2.13-7.el7.x86_64

Comment 24 Randy Perryman 2016-10-26 08:45:51 UTC
(In reply to Randy Perryman from comment #23)
> doing rpm -qa on the image file - keepalived-1.2.13-7.el7.x86_64

Just validated on a JS 5.0 OSP 8 install that the keepalive is keepalived-1.2.13-7.el7.x86_64.

Comment 25 Randy Perryman 2016-11-04 18:45:37 UTC
Do we have fix?

Comment 26 Sean Merrow 2016-11-10 16:12:55 UTC
The upstream issue was fixed, but then someone commented that they see the same issue. So either the issue was never actually fixed, or the commenter only saw a similar symptom, but actually hit a different issue.

Randy, it isn't clear from the comments if you were able to use Assaf's suggestion to verify if the issue as described in the comment:

"You could manually issue ARPing with -A and -U from the router namespace and see if either resolve the issue."

Sounds like if those two ARPing commands work-around the issue, then it could indeed be a match for what the most recent commenter sees/suggests.

Comment 27 David Paterson 2016-11-10 23:14:24 UTC
I was able to boot an instance that was unreachable via SSH, 192.168.191.21.

I found which controller had the address and from that network namespace ran:
arping -A -U -I qg-5b95cc22-6b 192.168.191.21

Where qg-5b95cc22-6b is the device in the network namespace that has the .21 address.

It returned:
ARPING 192.168.191.21 from 192.168.191.21 qg-5b95cc22-6b

In a duplicate session I was then able to ssh into 192.168.191.21 without issue.

Conclusion: the arping command allowed me to reuse the floating ip 192.168.191.21 and make it accessible.

Comment 28 Randy Perryman 2016-11-11 17:10:30 UTC
As Dave shows the arping works, so was there and upstream patch we need to try?

Comment 29 Sean Merrow 2016-11-11 19:01:35 UTC
Assaf, using arping -A -U ... resolves the issue. What are next steps?

Comment 30 Assaf Muller 2016-11-11 19:51:24 UTC
I'm pretty much convinced this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1386718.

@Jakub, would you be able to backport the fix to OSP 9 as well?

Comment 31 arkady kanevsky 2016-11-11 19:58:07 UTC
Jacub,
will the fix for 1386718 be backported to OSP9?
Arkady

Comment 32 Randy Perryman 2016-11-11 20:35:51 UTC
I agree that bug is similar in that a GARP is not occurring when the VIP moves, is the same logic used for when a Floating IP is moved?

Comment 33 Assaf Muller 2016-11-11 20:49:26 UTC
(In reply to Randy Perryman from comment #32)
> I agree that bug is similar in that a GARP is not occurring when the VIP
> moves, is the same logic used for when a Floating IP is moved?

Floating IPs are implemented as VIPs in keepalived when you're using HA routers. Here's some more info:
https://assafmuller.com/2014/11/08/openstack-paris-network-node-high-availability-video/

Comment 34 Jakub Libosvar 2016-11-21 17:23:00 UTC
I backported the fix for OSP9 and upstream Mitaka

Comment 35 Jakub Libosvar 2016-11-23 15:12:56 UTC
*** Bug 1397926 has been marked as a duplicate of this bug. ***

Comment 36 David Paterson 2016-11-23 15:31:36 UTC
When can we expect to see this fix in zstream package?

Comment 37 Sean Merrow 2016-11-23 15:39:37 UTC
(In reply to David Paterson from comment #36)
> When can we expect to see this fix in zstream package?

The clone I just created got dup'd to this BZ. We are currently waiting for the following upstream change to be merged in stable/mitaka, then we already have a RH gerrit for it to be backported into OSP 9. Until it merges upstream, we can't give an ETA. That said, it has been touched over the last two days so I would think soon.

https://review.openstack.org/#/c/400348/

Comment 38 Sean Merrow 2016-11-28 19:33:47 UTC
Mitaka upstream patch [0] has landed. Working on downstream patch for OSP 9 backport.

Also, same issue in OSP 10 has landed in patch upstream [1] and downstream [2].

[0] https://review.openstack.org/#/c/400348/
[1] https://review.openstack.org/#/c/393886/17
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1386718

Comment 40 Eran Kuris 2016-12-26 07:52:21 UTC
verified and fixed : 
on OSPD9 virt env 
[root@controller-0 ~]# rpm -qa |grep openstack-neutron-8.
openstack-neutron-8.1.2-14.el7ost.noarch

Ran Tempest scenario as mentioned in the bug and verified it also manually

Comment 41 Eran Kuris 2016-12-26 07:52:31 UTC
verified and fixed : 
on OSPD9 virt env 
[root@controller-0 ~]# rpm -qa |grep openstack-neutron-8.
openstack-neutron-8.1.2-14.el7ost.noarch

Ran Tempest scenario as mentioned in the bug and verified it also manually

Comment 43 errata-xmlrpc 2017-02-01 14:24:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0232.html


Note You need to log in before you can comment on or make changes to this bug.