Bug 1372384

Summary: qbr saves mac address on wrong port
Product: Red Hat OpenStack Reporter: Ondrej <ochalups>
Component: openstack-neutronAssignee: Jakub Libosvar <jlibosva>
Status: CLOSED NOTABUG QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: amuller, chrisw, jlibosva, mfuruta, nyechiel, ochalups, rcernin, rmanes, srevivo
Target Milestone: ---Keywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-21 09:50:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Ondrej 2016-09-01 14:35:59 UTC
Description of problem:
Instance launched with interface on flat provider(external) network,
DHCP offer arrives to qbr on qvb but never on tap, in qbr the interface
mac address is learned on wrong port.
Communication on tenant(internal) network works perfectly fine.
This behavior is seen on either flat or isolated network
with controller and compute switch ports tagged under one vlan.

# brctl showmacs qbr132fa062-6d
port no	mac addr		is local?	ageing timer
  1	 3e:a7:c0:86:39:70	yes		   0.00
  1	 3e:a7:c0:86:39:70	yes		   0.00
  1	 fa:16:3e:e1:85:7e	no		   0.70
  2	 fe:16:3e:e1:85:7e	yes		   0.00
  2	 fe:16:3e:e1:85:7e	yes		   0.00

And thus the packets are not further forwarded and are looped back to qvb.

If the IP is set manually,the communication works, but is very unstable as the mac address switches randomly between the ports.

# brctl showmacs qbr185db8c7-46 | grep "16:3e:93:3b:61"
  1	 fa:16:3e:93:3b:61	no		   3.29
  2 	fe:16:3e:93:3b:61	yes		   0.00
  2	 fe:16:3e:93:3b:61	yes		   0.00
# brctl showmacs qbr185db8c7-46 | grep "16:3e:93:3b:61"
  2 	fa:16:3e:93:3b:61	no		   0.32
  2 	fe:16:3e:93:3b:61	yes		   0.00
  2 	fe:16:3e:93:3b:61	yes		   0.00
# brctl showmacs qbr185db8c7-46 | grep "16:3e:93:3b:61"
  2	fa:16:3e:93:3b:61	no		   0.61
  2 	fe:16:3e:93:3b:61	yes		   0.00
  2 	fe:16:3e:93:3b:61	yes		   0.00

64 bytes from 172.19.243.117: icmp_seq=53 ttl=64 time=0.085 ms
64 bytes from 172.19.243.117: icmp_seq=54 ttl=64 time=0.086 ms
64 bytes from 172.19.243.117: icmp_seq=55 ttl=64 time=0.082 ms <---
64 bytes from 172.19.243.117: icmp_seq=68 ttl=64 time=0.190 ms <---
64 bytes from 172.19.243.117: icmp_seq=69 ttl=64 time=0.066 ms
64 bytes from 172.19.243.117: icmp_seq=70 ttl=64 time=0.095 ms
64 bytes from 172.19.243.117: icmp_seq=71 ttl=64 time=0.064 ms

Workaround:
set the bridge to work in "hub" mode instead of "switch" mode with
setting ageing time to 0, and disable the mac learning table, so all the packets are forwarded to all the ports inside the bridge:

# brctl setageing qbr132fa062-6d 0

Check:
# brctl showstp qbr132fa062-6d | grep "ageing time"
 ageing time		   0.00

Version-Release number of selected component (if applicable):
openstack-neutron-7.0.4-3.el7ost.noarch
openstack-neutron-bigswitch-agent-2015.3.8-1.el7ost.noarch
openstack-neutron-bigswitch-lldp-2015.3.8-1.el7ost.noarch
openstack-neutron-common-7.0.4-3.el7ost.noarch
openstack-neutron-lbaas-7.0.0-2.el7ost.noarch
openstack-neutron-metering-agent-7.0.4-3.el7ost.noarch
openstack-neutron-ml2-7.0.4-3.el7ost.noarch
openstack-neutron-openvswitch-7.0.4-3.el7ost.noarch
python-neutron-7.0.4-3.el7ost.noarch
python-neutron-lbaas-7.0.0-2.el7ost.noarch
python-neutronclient-3.1.0-1.el7ost.noarch

How reproducible:
always

Steps to Reproduce:
1. launch instance with interface on provider net
2. instance fails to get DHCP offer or communicate
3.

Actual results:
no or partial communication

Expected results:
interface gets IP and can communicate normally

Additional info:

Comment 5 Assaf Muller 2016-09-08 02:08:54 UTC
Please provide an SOS report from a node experiencing this issue. If you could verify it's the most up to date version of SOS report that would be helpful. Can you describe how was this environment installed, and please confirm that we're talking about the standard ML2+OVS?

Comment 6 Ondrej 2016-09-09 10:07:22 UTC
We used the git://github.com/sosreport/sos.git sosreport. They are available
in collab-shell under /cases/01684953. The environment was installed using 
director, 1 controller and 1 compute. Yes, ML2+OVS used.

Comment 7 Assaf Muller 2016-09-09 14:18:17 UTC
(In reply to Ondrej from comment #6)
> We used the git://github.com/sosreport/sos.git sosreport. They are available
> in collab-shell under /cases/01684953. The environment was installed using 
> director, 1 controller and 1 compute. Yes, ML2+OVS used.

Can you show the output for 'neutron net-show %s' for the problematic provider network?

Comment 8 Assaf Muller 2016-09-09 14:20:05 UTC
(In reply to Ondrej from comment #6)
> We used the git://github.com/sosreport/sos.git sosreport. They are available
> in collab-shell under /cases/01684953. The environment was installed using 
> director, 1 controller and 1 compute. Yes, ML2+OVS used.

Also looking at the SOS report on the compute node it looks like there are no VMs running? We'd need an SOS report taken *while the problem manifests*.

Comment 9 Assaf Muller 2016-09-09 14:29:19 UTC
(In reply to Assaf Muller from comment #8)
> (In reply to Ondrej from comment #6)
> > We used the git://github.com/sosreport/sos.git sosreport. They are available
> > in collab-shell under /cases/01684953. The environment was installed using 
> > director, 1 controller and 1 compute. Yes, ML2+OVS used.
> 
> Also looking at the SOS report on the compute node it looks like there are
> no VMs running? We'd need an SOS report taken *while the problem manifests*.

I found sosreport-20160818-134310 which has 2 tap devices on the compute node. We still need to know which Neutron network is problematic and the output of its 'neutron net-show'.

Comment 10 Ondrej 2016-09-09 14:56:40 UTC
Hi,providing the info:
neutron net-show ext-net
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | True                                 |
| id                        | 0113ec13-832d-48ea-9cc6-f28b43252d69 |
| mtu                       | 0                                    |
| name                      | ext-net                              |
| port_security_enabled     | True                                 |
| provider:network_type     | flat                                 |
| provider:physical_network | datacentre                           |
| provider:segmentation_id  |                                      |
| qos_policy_id             |                                      |
| router:external           | True                                 |
| shared                    | False                                |
| status                    | ACTIVE                               |
| subnets                   | 974f0fcd-71a1-4959-90db-58b5f2b03abe |
| tenant_id                 | fedcdd99268f4cd8b218cec7949e8b0e     |
+---------------------------+--------------------------------------+

Comment 11 Assaf Muller 2016-09-09 14:59:00 UTC
(In reply to Ondrej from comment #6)
> We used the git://github.com/sosreport/sos.git sosreport. They are available
> in collab-shell under /cases/01684953. The environment was installed using 
> director, 1 controller and 1 compute. Yes, ML2+OVS used.

Can you also show 'neutron port-list' and 'neutron port-show' for every relevant VM and DHCP port?

Comment 12 Assaf Muller 2016-09-09 15:49:03 UTC
(In reply to Ondrej from comment #6)
> We used the git://github.com/sosreport/sos.git sosreport. They are available
> in collab-shell under /cases/01684953. The environment was installed using 
> director, 1 controller and 1 compute. Yes, ML2+OVS used.

Can you also attach an unfiltered tcpdump on the problematic tap device?