Bug 1818786 - DHCP IPAM module times out too quickly with no config option
Summary: DHCP IPAM module times out too quickly with no config option
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Aneesh Puttur
QA Contact: Weibin Liang
URL:
Whiteboard: SDN-CI-IMPACT,SDN-STALE
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-30 11:33 UTC by Matthew Booth
Modified: 2020-10-27 15:56 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 15:56:40 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github containernetworking plugins issues 470 0 None closed DHCP Timeout option 2021-01-24 13:31:30 UTC
Github openshift containernetworking-plugins pull 28 0 None closed Bug 1818786: Fix handling of delay in acquiring lease with stp turned on 2021-01-24 13:32:13 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:56:56 UTC

Description Matthew Booth 2020-03-30 11:33:52 UTC
Description of problem:
If STP is enabled on a linux bridge it can take a long time after adding a port before traffic forwarding is enabled. Specifically, it can take longer than the timeout of the DHCP ipam module. In my environment it *always* takes longer than the DHCP ipam module's timeout, which means that DHCP never succeeds.

The dhcp pod logs:

2020/03/27 16:03:37 d0dc081a44eb32e1c37ef4a98123907af38e1d4a515b3f8eb3f7caa1b7b285cf/osp-internalapi-dhcp/net1: acquiring lease
2020/03/27 16:03:42 no DHCP packet received within 5s

It logs no DHCP packet twice more (from memory: I have it working now and all the pods restarted so I don't have the logs any more), then fails. At this point it tears down the interface and creates a new one before trying again, so it fails again.

It seems to take about 30 seconds before the linux bridge starts forwarding traffic for a new port in my environment.

I disabled STP on the linux bridge, and now DHCP works.

My CNI definition is:

  additionalNetworks:
  - name: osp-internalapi-dhcp
    namespace: default
    rawCNIConfig: '{ "cniVersion": "0.3.1", "type": "bridge", "bridge": "br-ospinfra",
      "vlan": 100, "ipam": { "type": "dhcp" } }'
    type: Raw

The bridge is defined on all nodes as:

STP=yes
BRIDGING_OPTS=priority=32768
TYPE=Bridge
PROXY_METHOD=none
BROWSER_ONLY=no
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=no
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=bridge-br-ospinfra
UUID=b7050e5a-bdaa-4388-ab03-0c77455873c9
DEVICE=br-ospinfra
ONBOOT=yes
IPV6_DISABLED=yes

I am running a DHCP server in a pod attached to the same bridge.

Changing STP=yes to STP=no on the bridge causes DHCP to succeed.

I suggest that it would be safer to allow DHCP to take considerably longer by default. If this is undesirable, it should at least be a configuration option in the dhcp ipam module.

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-03-17-091701

Comment 1 Douglas Smith 2020-03-30 14:11:03 UTC
Just an FYI that the timeout for the DHCP client appears to be here upstream @ https://github.com/containernetworking/plugins/blob/master/plugins/ipam/dhcp/lease.go#L365

Comment 2 Douglas Smith 2020-04-01 20:30:05 UTC
Filed upstream issue to ask about extending or parameterizing the DHCP CNI plugin @ https://github.com/containernetworking/plugins/issues/470

Comment 4 Ben Bennett 2020-05-08 23:56:22 UTC
@Doug can you update the status here please.

Comment 5 Douglas Smith 2020-05-19 18:37:15 UTC
Any details on the needinfo? We're planning to get this in the queue, however, slim resources due to constraints with customer zero.

Comment 8 Douglas Smith 2020-05-29 14:56:08 UTC
This is still on our radar. Gist is that we'll need someone to mock up a lab and spend some time going through this setup and then work with the community to get the changes upstream.

Comment 15 Weibin Liang 2020-07-10 14:32:28 UTC
@mbooth 

This bug is in ON_QA now, could help to verify the fixing in your environment which STP is enable in the bridge and DHCP server is configured in the pod? Thanks!

Comment 17 Weibin Liang 2020-07-15 21:25:22 UTC
Tested and verified in 4.6.0-0.nightly-2020-07-15-091743.

After enable STP in bridge which all pods attached to, see retrying log in dhcp-daemon pod:

2020/07/15 16:11:40 71cb671401f8c54ddfce56dffd6c16cce4f32d44894c9c4d1fbf3a91ba327184/bridge-dhcp/net1: acquiring lease
2020/07/15 16:11:41 2b3f78a716c5a59ce3ed7547bf2929bf9fd53d047d4129bb0f4154a903855309/bridge-dhcp/net1: acquiring lease
2020/07/15 16:11:43 6d3428e0915a49fe587338a93db5037f15cad50ae4a0913c918e964f72a91a35/bridge-dhcp/net1: acquiring lease
2020/07/15 16:11:44 8a294e11f3342e147e72014ad6bd248c52691620fc679ea4608c26ed4e87fc4b/bridge-dhcp/net1: acquiring lease
2020/07/15 16:11:45 b8624b6dc116dafaccb020e855b62287c591c32a44b01f7b4bc3b087c575b666/bridge-dhcp/net1: acquiring lease
2020/07/15 16:11:46 71cb671401f8c54ddfce56dffd6c16cce4f32d44894c9c4d1fbf3a91ba327184/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:46.236626622 +0000 UTC m=+90067.196291375
2020/07/15 16:11:48 no DHCP packet received within 5s
2020/07/15 16:11:48 retrying in 3.184235 seconds
2020/07/15 16:11:49 no DHCP packet received within 5s
2020/07/15 16:11:49 retrying in 3.106989 seconds
2020/07/15 16:11:50 no DHCP packet received within 5s
2020/07/15 16:11:50 retrying in 4.429392 seconds
2020/07/15 16:11:51 no DHCP packet received within 5s
2020/07/15 16:11:51 retrying in 3.501525 seconds
2020/07/15 16:11:55 6d3428e0915a49fe587338a93db5037f15cad50ae4a0913c918e964f72a91a35/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:55.250397257 +0000 UTC m=+90076.210061999
2020/07/15 16:11:55 2b3f78a716c5a59ce3ed7547bf2929bf9fd53d047d4129bb0f4154a903855309/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:55.249510849 +0000 UTC m=+90076.209175598
2020/07/15 16:11:55 8a294e11f3342e147e72014ad6bd248c52691620fc679ea4608c26ed4e87fc4b/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:55.251912208 +0000 UTC m=+90076.211576975
2020/07/15 16:11:55 b8624b6dc116dafaccb020e855b62287c591c32a44b01f7b4bc3b087c575b666/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:55.252952702 +0000 UTC m=+90076.212617455

Comment 19 errata-xmlrpc 2020-10-27 15:56:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.