Description of problem: If STP is enabled on a linux bridge it can take a long time after adding a port before traffic forwarding is enabled. Specifically, it can take longer than the timeout of the DHCP ipam module. In my environment it *always* takes longer than the DHCP ipam module's timeout, which means that DHCP never succeeds. The dhcp pod logs: 2020/03/27 16:03:37 d0dc081a44eb32e1c37ef4a98123907af38e1d4a515b3f8eb3f7caa1b7b285cf/osp-internalapi-dhcp/net1: acquiring lease 2020/03/27 16:03:42 no DHCP packet received within 5s It logs no DHCP packet twice more (from memory: I have it working now and all the pods restarted so I don't have the logs any more), then fails. At this point it tears down the interface and creates a new one before trying again, so it fails again. It seems to take about 30 seconds before the linux bridge starts forwarding traffic for a new port in my environment. I disabled STP on the linux bridge, and now DHCP works. My CNI definition is: additionalNetworks: - name: osp-internalapi-dhcp namespace: default rawCNIConfig: '{ "cniVersion": "0.3.1", "type": "bridge", "bridge": "br-ospinfra", "vlan": 100, "ipam": { "type": "dhcp" } }' type: Raw The bridge is defined on all nodes as: STP=yes BRIDGING_OPTS=priority=32768 TYPE=Bridge PROXY_METHOD=none BROWSER_ONLY=no DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6INIT=no IPV6_DEFROUTE=yes IPV6_FAILURE_FATAL=no IPV6_ADDR_GEN_MODE=stable-privacy NAME=bridge-br-ospinfra UUID=b7050e5a-bdaa-4388-ab03-0c77455873c9 DEVICE=br-ospinfra ONBOOT=yes IPV6_DISABLED=yes I am running a DHCP server in a pod attached to the same bridge. Changing STP=yes to STP=no on the bridge causes DHCP to succeed. I suggest that it would be safer to allow DHCP to take considerably longer by default. If this is undesirable, it should at least be a configuration option in the dhcp ipam module. Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-03-17-091701
Just an FYI that the timeout for the DHCP client appears to be here upstream @ https://github.com/containernetworking/plugins/blob/master/plugins/ipam/dhcp/lease.go#L365
Filed upstream issue to ask about extending or parameterizing the DHCP CNI plugin @ https://github.com/containernetworking/plugins/issues/470
@Doug can you update the status here please.
Any details on the needinfo? We're planning to get this in the queue, however, slim resources due to constraints with customer zero.
This is still on our radar. Gist is that we'll need someone to mock up a lab and spend some time going through this setup and then work with the community to get the changes upstream.
@mbooth This bug is in ON_QA now, could help to verify the fixing in your environment which STP is enable in the bridge and DHCP server is configured in the pod? Thanks!
Tested and verified in 4.6.0-0.nightly-2020-07-15-091743. After enable STP in bridge which all pods attached to, see retrying log in dhcp-daemon pod: 2020/07/15 16:11:40 71cb671401f8c54ddfce56dffd6c16cce4f32d44894c9c4d1fbf3a91ba327184/bridge-dhcp/net1: acquiring lease 2020/07/15 16:11:41 2b3f78a716c5a59ce3ed7547bf2929bf9fd53d047d4129bb0f4154a903855309/bridge-dhcp/net1: acquiring lease 2020/07/15 16:11:43 6d3428e0915a49fe587338a93db5037f15cad50ae4a0913c918e964f72a91a35/bridge-dhcp/net1: acquiring lease 2020/07/15 16:11:44 8a294e11f3342e147e72014ad6bd248c52691620fc679ea4608c26ed4e87fc4b/bridge-dhcp/net1: acquiring lease 2020/07/15 16:11:45 b8624b6dc116dafaccb020e855b62287c591c32a44b01f7b4bc3b087c575b666/bridge-dhcp/net1: acquiring lease 2020/07/15 16:11:46 71cb671401f8c54ddfce56dffd6c16cce4f32d44894c9c4d1fbf3a91ba327184/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:46.236626622 +0000 UTC m=+90067.196291375 2020/07/15 16:11:48 no DHCP packet received within 5s 2020/07/15 16:11:48 retrying in 3.184235 seconds 2020/07/15 16:11:49 no DHCP packet received within 5s 2020/07/15 16:11:49 retrying in 3.106989 seconds 2020/07/15 16:11:50 no DHCP packet received within 5s 2020/07/15 16:11:50 retrying in 4.429392 seconds 2020/07/15 16:11:51 no DHCP packet received within 5s 2020/07/15 16:11:51 retrying in 3.501525 seconds 2020/07/15 16:11:55 6d3428e0915a49fe587338a93db5037f15cad50ae4a0913c918e964f72a91a35/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:55.250397257 +0000 UTC m=+90076.210061999 2020/07/15 16:11:55 2b3f78a716c5a59ce3ed7547bf2929bf9fd53d047d4129bb0f4154a903855309/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:55.249510849 +0000 UTC m=+90076.209175598 2020/07/15 16:11:55 8a294e11f3342e147e72014ad6bd248c52691620fc679ea4608c26ed4e87fc4b/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:55.251912208 +0000 UTC m=+90076.211576975 2020/07/15 16:11:55 b8624b6dc116dafaccb020e855b62287c591c32a44b01f7b4bc3b087c575b666/bridge-dhcp/net1: lease acquired, expiration is 2020-07-16 16:11:55.252952702 +0000 UTC m=+90076.212617455
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196