Bug 1252222 - New instances cannot get dhcp requests
New instances cannot get dhcp requests
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
6.0 (Juno)
Unspecified Unspecified
high Severity high
: ---
: 6.0 (Juno)
Assigned To: lpeer
Ofer Blaut
: Unconfirmed, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-10 22:32 EDT by Bryan Yount
Modified: 2016-04-26 18:53 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-10 10:57:47 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bryan Yount 2015-08-10 22:32:04 EDT
Description of problem:
This is a new OSP6 environment deployed using the OSP Foreman installer. Due to network security policies for this environment, the customer requires the use of VLANs. After the Foreman deployment, we used this kbase for additional configuration steps for VLAN provider networks: https://access.redhat.com/solutions/648863


Version-Release number of selected component (if applicable):
  openstack-neutron-2014.2.3-2
  openstack-nova-compute-2014.2.3-9


Hardware:
  HP ProLiant BL460c Gen8
  Emulex OneConnect 10Gb NIC
  dhcp is a Cisco switch on 172.20.68.254


Troubleshooting steps:
1. Each of the vlan interfaces is connected to ovs, but ovs is doing its own tagging (neutron net-list) which causes a problem because you can’t connect 2 ovs bridges to a single interface.
2. So, deleted the vlan networks and recreated them as flat networks.
3. tcpdump tap device, restart the VM, do we see dhcp-offer and dhcp-ack?
  we saw a reply but it never got to the vm
4. tcpdump eno2 on the compute and the controller, restart the VM
  controller saw a reply, compute saw a reply
5. tcpdump the tap device of the VM from hypervisor’s point of view
  dhcp reply is being lost here!
6. next we’ll look at the dump flows and the output of “ovs-ofctl show br-int” on the compute
7. then tcpdump on the eno.3402 and simultaneously on qvo device, restart VM
  qvo did not get a reply but eno.3402 did


DHCP Request Flow:
controller eno2 -> compute eno2 -> eno2.3402 -> br-eno2.3402 --> br-int -> qvoXXXXX
                                   ^ dhcp getting lost after this


Possible related issues?
  A. https://bugzilla.redhat.com/show_bug.cgi?id=1214891 - VM instances do not get DHCP addr on boot 
  B. https://access.redhat.com/solutions/774743 - Virtual Machines stop communicating over the Linux bridge when using Emulex Network cards - But we checked and SR-IOV is disabled on these NICs.
Comment 5 Bryan Yount 2015-08-11 02:06:19 EDT
Additional follow-up:

We looked at kbase 774743 about sr-iov and Emulex nics but the customer checked and sr-iov is disabled. Just to be 100% sure, I asked the customer to check that sr-iov is disabled by setting PXE=None in the HP VirtualConnect for the NIC device used by the instances. Note: Foreman was used to provision the environment and so PXE was used on the Foreman/Management network.

Additionally, the workaround described in the above kbase says to use promiscuous mode, so we also tried putting "compute5 NIC's to promisc (eno2 and eno2.3402)" and that did not appear to change the behavior. 

Per the comments in Bug 1214891, I asked the customer to try rebooting one of the compute nodes first thing in the morning to see if that resolves the issue.

If the issue still persists, I will ask the customer collect a tcpdump from eno2.3402 and br-eno2.3402 after creating a new instance.
Comment 6 Sadique Puthen 2015-08-11 02:30:09 EDT
> If the issue still persists, I will ask the customer collect a tcpdump from eno2.3402 and br-eno2.3402 after creating a new instance.

Please get a tcpdump from eno2 also. I also recommend that you set the ip address shown on horizon statically to the instance and keep a ping to the gateway running inside the instance while you collect tcpdump.

You can easily see the loop in the tcpdump.
Comment 11 Bryan Yount 2015-08-11 15:31:34 EDT
Customer reports that sr-iov was confirmed "Disabled" on the HP OneView interface (OneView manages all of the iLOs and VirtualConnects). Previously, this setting was "Configuration default, disabled". Even after confirming sr-iov was disabled, we're still seeing double ARPs in the tcpdumps uploaded above.

I have asked the customer to engage HP to assist. Is there anything else in OSP or RHEL that could be causing this?
Comment 12 Flavio Leitner 2015-08-11 18:05:34 EDT
Please attach a sosreport from the compute node.

Also the output of the following commands after had reproduced the issue where <$bridge> is either ``br-eno2.3402´´ or ``br-int´´. We need the outputs from both bridges from the same event.

# ovs-ofctl show <bridge>
# ovs-ofctl dump-flows <bridge>
# ovs-appctl fdb/show <bridge>

This one before and after reproduce the issue:
# ovs-ofctl dump-ports <bridge>

This one works for all bridges at once, but flows expire quickly (few secs)
You can loop every second capturing the output while reproducing the issue.
# ovs-dpctl dump-flows

If you changed mac addresses, then new tcpdumps on eno2.3402 and the VM tap device.

Thanks
Comment 13 Flavio Leitner 2015-08-11 19:28:57 EDT
BTW, regarding to the ofproto/trace, the dl_dst is actually the mac address fa:16:3e:6a:aa:97 and not a broadcast. That might change the end result depending on how flows and fdb are.
Comment 17 Bryan Yount 2015-08-12 02:54:52 EDT
Customer provided the requested commands. However, the "ovs-dpctl dump-flows" was not captured in a loop. I have asked the customer to grab this again in the morning in case we need it to figure out the issue.

HP has also been engaged and is standing by. I have a case number to engage them if needed tomorrow.
Comment 18 Flavio Leitner 2015-08-12 09:42:26 EDT
Hi,

Looking at eno2.3402.1.pcap we can find out about the DHCP server and client MAC addresses:

  DHCP Client: FA:16:3E:6A:AA:97
  DHCP Server: FA:16:3E:5E:E5:4C

Now looking at the ofctl outputs, we can find the OpenFlow port numbers which are per bridge for each bridge's port:

              /--- eno2.3402 (port #1) --- DHCP Server
br-eno2.3402 +
              \--- phy-br-en2ba3f3 (port #3) ---\

        /--------- int-br-en2ba3f3 (port #2) ---/
br-int +
        \--------- qvoXXXXXX (port #3) --- VM


Ok, now looking for DHCP Client MAC at the forwarding db for br-int:
 port  VLAN  MAC                Age
>>  2     1  fa:16:3e:6a:aa:97    2
    2     1  64:51:06:1d:9b:19    2

It's on port #2, which is int-br-en2ba3f3 (wrong port).
Doing the same for the other bridge:
 port  VLAN  MAC                Age
>>  1     0  fa:16:3e:6a:aa:97    2
    1     0  fa:16:3e:5e:e5:4c    2
    1     0  64:51:06:1d:9b:19    2

The port #1 is eno2.3402. (wrong port).

Both are at the wrong direction.  The VM MAC address should be behind qvoXXXX (port #3) on br-int and behind phy-br-en2ba3f3 (port #3) on br-eno2.3402.

The current flows relies on action NORMAL which uses the MAC learning to decide which port to send traffic (All fine).  The current fdb says that the VM's mac address is after eno2.3402 (wrong direction) and that's why you don't see the replies going to the VM.  OVS has a protection that a packet shouldn't be sent back to the ingress port, so it is dropped.

This means a packet with VM's MAC address is coming from the DHCP_Server-to-VM direction which forces OVS to learn the wrong direction.  Usually that happens when the switch/NIC is reflecting the packets back.  We can see the dups in the eno2.3402.1.pcap:


  1  15.543507           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  2  15.543534           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  3  15.979465           :: -> ff02::1:ff6a:aa97 ICMPv6 78 Neighbor Solicitation for fe80::f816:3eff:fe6a:aa97
  4  15.979493           :: -> ff02::1:ff6a:aa97 ICMPv6 78 Neighbor Solicitation for fe80::f816:3eff:fe6a:aa97
  5  16.060584      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
  6  16.060624      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
  7  16.323202           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  8  16.323277           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  9  20.357123      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
 10  20.357153      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
 11  31.610422      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
 12  31.610527      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
 13  44.836263      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
 14  44.836308      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
 15  60.540521      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
 16  60.540560      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0xe1bb4f41
 17  60.810935           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
 18  60.810965           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
 19  60.815856      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x22599811
 20  60.815998      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x22599811
 21  61.036887           :: -> ff02::1:ff6a:aa97 ICMPv6 78 Neighbor Solicitation for fe80::f816:3eff:fe6a:aa97
 22  61.036915           :: -> ff02::1:ff6a:aa97 ICMPv6 78 Neighbor Solicitation for fe80::f816:3eff:fe6a:aa97
 23  61.626769           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
 24  61.626823           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
 25  64.281895      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x22599811
 26  64.281934      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x22599811
 27  66.523917           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
 28  66.523946           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
 29  66.536998      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x43cfe4f
 30  66.537036      0.0.0.0 -> 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x43cfe4f

You can see above the timestamp between the two dups are really small, indicating an echo, not a retransmission.  You also have retransmissions, but they are all doubled/echoed too.  Every echo forces OVS to learn the wrong direction.

Maybe there are other ways to configure the echo behavior besides "SR-IOV".  It can also be the switch port configured with similar feature (VEPA/hairpin mode).  We need to check with HP.

One thing that could be done is to drop packets coming from outside with VM's mac address as source to keep fdb clean in the earliest stage possible.  If that works, we are sure that packets are being reflected somehow. Something like this line:

# ovs-ofctl add-flow br-eno2.3402 \
     priority=100,in_port=1,dl_src=fa:16:3e:6a:aa:97,actions=drop

The above is a troubleshooting hack just to double check, nothing more. I don't know if OSP will let you change the flows manually though.
Comment 19 Flavio Leitner 2015-08-12 09:48:18 EDT
It could be a network loop if the NIC or switch isn't reflecting the packets back to the host.  Maybe one could trace the switch's port to see what is going on there.  If you see dups there, then the host is fine and the problem is in the network.
Comment 25 Assaf Muller 2015-10-10 10:57:47 EDT
Hardware issue.

Note You need to log in before you can comment on or make changes to this bug.