Description of problem: Version-Release number of selected component (if applicable): OCP4.9.11 nmstate operator How reproducible: Most every time after a reboot Steps to Reproduce: 1. Reboot Node 2. Observe primary DHCP interface (nncp managed) online, secondary/third interfaces (nncp managed -- static) flapping/failure to configure 3. nncp status --> available, nnce --> READY, interfaces down in nmcli. 4. toggle status in nncp for node to 'absent' then back to 'up': --> interfaces configure immediately, move to UP, nmcli reports connections online. Actual results: nncp is failing to consistently bring all connections up. Expected results: nncp should bring all connections up consistently. Additional info: - theories: - could be race condition between NNCP and NetworkManager both trying to provision the interfaces? observed in logs: ~~~ Dec 27 20:38:04 NetworkManager[1436]: <info> [1640637484.8484] device (ens224): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Dec 27 20:38:04 NetworkManager[1436]: <warn> [1640637484.8492] device (ens224): Activation: failed for connection 'Wired connection 2' Dec 27 20:38:04 NetworkManager[1436]: <info> [1640637484.8492] device (ens256): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Dec 27 20:38:04 NetworkManager[1436]: <warn> [1640637484.8498] device (ens256): Activation: failed for connection 'Wired connection 3' Dec 27 20:38:04 NetworkManager[1436]: <info> [1640637484.8508] device (ens224): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed') Dec 27 20:38:04 NetworkManager[1436]: <info> [1640637484.8611] dhcp4 (ens224): canceled DHCP transaction Dec 27 20:38:04 NetworkManager[1436]: <info> [1640637484.8611] dhcp4 (ens224): state changed timeout -> done Dec 27 20:38:04 NetworkManager[1436]: <info> [1640637484.8618] device (ens256): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed') Dec 27 20:38:04 NetworkManager[1436]: <info> [1640637484.8731] dhcp4 (ens256): canceled DHCP transaction Dec 27 20:38:04 NetworkManager[1436]: <info> [1640637484.8731] dhcp4 (ens256): state changed timeout -> done ~~~ DHCP status for these interfaces on nodes that successfully deployed after reboot list as 'dhcp: false'. On the node that is failing to deploy, we see 'dhcp: true' and this flapping message listed above repeated in logs. Expanded description in next comment + additional logs in linked case
Based on the fact that this is happening after reboot, I would say it is almost certainly https://bugzilla.redhat.com/show_bug.cgi?id=1970021 . There is a workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1970021#c7 that provides a simple way to verify it is the same problem. Can you give that a try? Based on the number of problems this behavior has caused, we're discussing a backport for the 4.10 fix which should also fix this. The machine-config is safe to use too though and provides an immediate solution.
Thanks Ben, I'll have our customer give this a go, I've passed on the instructions and will report back with our results, but I have a good feeling about it! Cheers, ~Will