Bug 1038213

Summary: Have to ping first to be able to ssh.
Product: Red Hat OpenStack Reporter: lpeer <lpeer>
Component: openstack-neutronAssignee: Brent Eagles <beagles>
Status: CLOSED NOTABUG QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: high    
Version: 4.0CC: adarazs, beagles, chrisw, dkranz, ggillies, hateya, jhenner, ndipanov, oblaut, oramraz, rvaknin, sclewis, tgraf, yeylon, yfried
Target Milestone: asyncKeywords: ZStream
Target Release: 4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1001725 Environment:
Last Closed: 2013-12-13 19:38:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 1 Brent Eagles 2013-12-09 18:47:04 UTC
After examining the QE environment, I realized that there was a key difference between my test setup and the environment this was occuring: the provider network being used for the external gateway was a VLAN leg. It seemed plausible that there might be some issue with that, so I started a fresh test environment with two network interfaces, one for lab access and one for trunking VLAN networks. I created two openstack nodes an "everything host" and a standalone compute host. The openvswitch plugin was configured to use the second interface (eth1) for the gateway network and the private networks and each neutron network was configured with a segment ID to incur VLAN mapping. Besides that, packstack was configured pretty much "normally". I also created a third host with two interfaces, one that shared the trunked network with the openstack nodes and I configured a fake bridge with vlan tag to match the VLAN tag of the public network (10). I booted a server instance, verified its DHCP allocation and allocated a floating IP address (I used the same commands as indicated in this bz). From the 3rd host (call it my "experiment" or "forensic" host) I ssh'd the vm through it's public IP address without pinging first... and it worked.

Seeing as I've tried this a few different ways I'm left wondering if there isn't something about the switch configuration in the QE environment that is interfering. When I was logged in, I tried ping the gateway IPs from the router namespaces and it didn't work, while pinging from a different subnet seemed to. That seemed a little suspicious in that the reverse route may not be working properly .. but then I might be making assumptions about how that is actually supposed to work. The fact that this behavior is reported against nova-networking is also suspect. nova-networking and neutron are pretty different when it comes to the external network access so if it is failing with both, it is pretty peculiar.

In environments where traffic (including ARPs) is showing up other than where they are supposed to, I would take a look at the interface adapters, etc.. I mistakenly configured one of my nodes initially to use the rtl8139 driver (why is that the default for the second interface? *shrug*) for the VLAN trunk interface and of course that strips the VLAN tags off of everything and chaos ensued.

Can we do detailed and directed analysis of how the QE environment does the IP to VLAN mapping and eliminate it as a potential cause?

Comment 2 Brent Eagles 2013-12-10 18:43:50 UTC
Thanks to oblaut and adarazs on this issue!

After examining an environment that oblaut and adarazs made available I discovered that ARP requests weren't being made until ping or an http request or anything else was made.

Apparently there is some kind of known issue on some types of Juniper switch. They are referenced on a few pages I found:

http://forums.whirlpool.net.au/archive/2049819

http://showroute.net/juniper-ex-switch-arp-issues-with-re-filters/

http://www.juniper.net/techpubs/en_US/junos11.1/information-products/topic-collections/release-notes/11.1/index.html?topic-53333.html
 - look for 486443

Comment 3 Brent Eagles 2013-12-12 18:01:41 UTC
I should have cleared the NEEDINFO when adding that last comment.

Comment 4 Brent Eagles 2013-12-13 19:38:52 UTC
Issue caused by network switch behavior.