Bug 1001725 - Have to ping first to be able to ssh.
Have to ping first to be able to ssh.
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
3.0
x86_64 Linux
urgent Severity urgent
: async
: 3.0
Assigned To: Brent Eagles
Ami Jeain
: TestBlocker, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-27 11:14 EDT by Jaroslav Henner
Modified: 2016-04-26 15:53 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1038213 (view as bug list)
Environment:
Last Closed: 2013-12-16 14:27:42 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jaroslav Henner 2013-08-27 11:14:10 EDT
Description of problem:
After instance booted, assigned an floating IP, ssh to floating IP doesn't connect until the ip address is pinged. This happens on at least on nova FlatDHCP and neutron/quantum with VLANS and openvswitch. This happens with many images - official rhel, cirros, fedora, my rhel

Version-Release number of selected component (if applicable):
openstack-nova-api-2013.1.3-3.el6ost.noarch

How reproducible:
Unknown It seems like that the IP is sshable for some time after the instance is deleted and created again. 

Steps to Reproduce:
1. have multinode deployment
2. nova boot
3. nova add-floating-ip
4. ssh FLOATING_IP
nothing
...
5. ping FLOATING_IP
6. ssh FLOATING_IP
7. enjoy

Actual results:
ssh works only after pinged first

Expected results:
ssh works out of the box

Additional info:
It is a mystery. It doesn't look like ARP issue. As neither the pre nor the post-ping `ip neigh` on neither the host nor the controller contains the FLOATING_IP.
Comment 2 Attila Darazs 2013-10-14 08:20:33 EDT
This blocks our testing without the ping workaround, setting TestBlocker flag.
Comment 3 Brent Eagles 2013-10-14 08:38:37 EDT
I realize this sounds like grasping at straws, but before I investigate further can you verify whether NetworkManager is enabled/running on any of the hosts?
Comment 4 Attila Darazs 2013-10-14 10:45:54 EDT
(In reply to Brent Eagles from comment #3)
> I realize this sounds like grasping at straws, but before I investigate
> further can you verify whether NetworkManager is enabled/running on any of
> the hosts?

Brent, NetworkManager is not even installed.

A.
Comment 5 Rami Vaknin 2013-11-06 05:42:05 EST
I don't have a concrete reproduction right now but we (me and yfried) encounter this issue few times in the past (neutron+ovs), yfried even had (maybe still have) a solid reproducer for that, we couldn't find cause of this bug even by a fine debug with a lot of sniffing.

Yesterday yfried encountered this issue again, so i. maybe he can provide an environment for reproduction, ii. I'm not sure it's a bad configuration issue.

BTW, I do think that it has something to do with arp in combination of namespaces, the last time I debugged that I saw that the ssh from outside or using "ip netns exec..." doesn't trigger an arp request, while the ping do trigger it and then the ssh pass successfully.
Comment 6 Ofer Blaut 2013-11-06 08:36:15 EST
Hi

I have reproduced this issue on grizzly 

1. create tenant with network + subnet
2. updated security group to allow incoming SSH + ICMP
3. attached subnet to router
4. launch VM
5. give floating ip to that VM
6. try to SSH from outside to floating ip , this will not work 
I didn't see any incoming traffic to SSH , nothing in 
 ovs-dpctl dump-flows system@ovs-system relate to floating ip
iptable -t nat seems ok 
7. only after ping from outside to the floating ip  ovs seen the flows and SSH start working. this was not the first VM on that network
Comment 7 Graeme Gillies 2013-11-06 17:14:26 EST
This may be unrelated, but I thought I'd point out something else which I noticed, which may also offer some insight into what might be potentially going wrong here

What I have discovered is that the quantum/neutron setting send_arp_for_ha is working in our environment, and the arp requests are being sent out on the right interface. However what I noticed is that the mac address it announces is the mac address of the instance inside the openstack internal network, it's not the mac address of the openstack network node, nor any interface on it.

I'm not networking expert, but this doesn't seem right does it? Shouldn't the arp announcement be letting the upstream switch/router know that the floating ip address is on the external interface of the openstack network node (so the switch knows to send traffic for that IP to it)? Not the mac address of the instance in the private openstack network, which isn't publicly accessible?

So if that initial send_arp_for_ha is broken or giving false information, maybe that's why a ping is needed to get the upstream switch to figure out the correct mac address for the floating ip?

Regards,

Graeme
Comment 8 Jaroslav Henner 2013-11-07 05:37:08 EST
(In reply to Graeme Gillies from comment #7)
> This may be unrelated, but I thought I'd point out something else which I
> noticed, which may also offer some insight into what might be potentially
> going wrong here
> 
> What I have discovered is that the quantum/neutron setting send_arp_for_ha
> is working in our environment, and the arp requests are being sent out on
> the right interface. However what I noticed is that the mac address it
> announces is the mac address of the instance inside the openstack internal
> network, it's not the mac address of the openstack network node, nor any
> interface on it.
> 
> I'm not networking expert, but this doesn't seem right does it? Shouldn't
> the arp announcement be letting the upstream switch/router know that the
> floating ip address is on the external interface of the openstack network
> node (so the switch knows to send traffic for that IP to it)? Not the mac
> address of the instance in the private openstack network, which isn't
> publicly accessible?

I think you have just nailed it!

> 
> So if that initial send_arp_for_ha is broken or giving false information,
> maybe that's why a ping is needed to get the upstream switch to figure out
> the correct mac address for the floating ip?
> 
> Regards,
> 
> Graeme
Comment 10 Brent Eagles 2013-11-18 10:27:07 EST
Hmm.. yeah. That's a bit odd. It looks like the code does the gratuitous arp thing based on the gateway interface (e.g. br-ex) so something doesn't quite fit there. What mechanism did you use to witness the ARPs with the MAC addr for the internal interfaces and where did the ARPs appear?

How long of an interval passed between adding the floating IP and the initial SSH/ping attempt? Is the floating IP pool being used is a subset of the lab network (ie. the network that the openstack nodes is running on)?

How is the external gateway connected to the network?

Is the workstation being used to ping/ssh running on a wifi network? This might seem an arbitrary question, but I've noticed a difference in "willingness to find" in my own network when using wifi vs. copper.
Comment 11 Graeme Gillies 2013-11-18 18:24:42 EST
(In reply to Brent Eagles from comment #10)
> Hmm.. yeah. That's a bit odd. It looks like the code does the gratuitous arp
> thing based on the gateway interface (e.g. br-ex) so something doesn't quite
> fit there. What mechanism did you use to witness the ARPs with the MAC addr
> for the internal interfaces and where did the ARPs appear?

I was using tcpdump on the external phsyical interface of the Openstack network node and saw the arps go out just as I assigned the floating ip. I see the correct number of arp packets come out, just with the wrong mac address. See http://pastebin.test.redhat.com/173188 for an example

> 
> How long of an interval passed between adding the floating IP and the
> initial SSH/ping attempt? Is the floating IP pool being used is a subset of
> the lab network (ie. the network that the openstack nodes is running on)?

I know that after you assign a floating ip there is an interval you have to wait before quantum picks it up and does its thing (it polls periodically for new work to do). Even if I assign a floating ip, wait a good couple of minutes, and try, still doesn't work till I ping it.

The floating IP pool in my case is a completely different network range and vlan to the openstack internal network

> 
> How is the external gateway connected to the network?

I can only speak for my environment, but I have a bonded network interface connected to br-ex on the openstack network node, and another bonded network interface connected on br-int on the network node.

> 
> Is the workstation being used to ping/ssh running on a wifi network? This
> might seem an arbitrary question, but I've noticed a difference in
> "willingness to find" in my own network when using wifi vs. copper.

On copper

I'll contact you with some detailed architecture information of my environment experiencing the problem
Comment 12 Oded Ramraz 2013-12-04 03:13:16 EST
I encounter this issue quite often when using Jenkins JClouds plugin , sometimes the plugin fails to SSH to the slave if I not ping it manually. Raising the bug severity . This issue must handled soon since it will affect the automation test results in Openstack environment .
Comment 13 Attila Darazs 2013-12-04 04:10:10 EST
I also want to state that the issue happened on an up-to-date RHOS 4.0 beta instance too. I would recommend fixing this thing at least before the 4.0 GA, because it's a huge problem for a lot of people.
Comment 14 Jaroslav Henner 2013-12-04 04:38:50 EST
(In reply to Oded Ramraz from comment #12)
> I encounter this issue quite often when using Jenkins JClouds plugin ,
> sometimes the plugin fails to SSH to the slave if I not ping it manually.
> Raising the bug severity . This issue must handled soon since it will affect
> the automation test results in Openstack environment .

I think it doesn't reproduces when the instance itself starts communicating with internet, like yum install form RHN. So when reproducing, we should make sure it doesn'tn.

Oded: Maybe we should use the behaviour above to workaround the JClouds problem. For example installing the java from internet, using cloud-config. We need the java there anyway.
Comment 16 Brent Eagles 2013-12-16 14:27:42 EST
It was determined that this was an issue at the switch.

Note You need to log in before you can comment on or make changes to this bug.