Bug 2106903

Summary: DHCP fails to renew lease after changes to DNSMasq: dhcp4 (bond0.284): selecting lease failed: -131
Product: OpenShift Container Platform Reporter: Gabriel Diotte <gdiotte>
Component: NetworkingAssignee: Ben Nemec <bnemec>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: akaris, arajapa, augol, bnemec, dacarpen, ealcaniz, openshift-bugs-escalate, pibanezr, vpickard
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-20 13:36:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
networkmanager trace from the affected host from Jul 13 none

Comment 3 Darren Carpenter 2022-07-18 16:26:05 UTC
My apologies y'all, I've been putting notes in the wrong case I believe.

Here is details/attachment with the networkmanager trace from the 13th;

----

This is an openshift env. with storage IPs being delivered via dhcp. (Ideally these would be static)
When the lease expired for storage, it was unable to negotiate a new IP.

--

From what I can tell on worker-18, it receives 'something' from the server, but ultimately cannot connect;

dmesg/worker-18

11674:[9115102.767156] NFS: Server wrote zero bytes, expected 4096.
11678:[9115257.189423] nfs: server 69.100.109.84 not responding, still trying
11679:[9115257.192668] nfs: server 69.100.109.84 not responding, still trying
11680:[9115264.357663] nfs: server 69.100.109.84 not responding, timed out
11681:[9115264.365017] nfs: server 69.100.109.85 not responding, timed out
11683:[9115282.789355] nfs: server 69.100.109.85 not responding, still trying
11684:[9115282.789364] nfs: server 69.100.109.85 not responding, still trying

Journalctl seems to echo the above, but the piece we locked in on was the 'not an infinite DHCP4 lease, ignoring';

journalctl/worker-18

3822730:Jul 13 16:18:53 worker-18 nm-dispatcher[1882468]: NM resolv-prepender: Checking for nameservers in /var/run/NetworkManager/resolv.conf
3822733:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <warn>  [1657729133.4581] dhcp4 (ens2f0): request timed out
3822734:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.4582] dhcp4 (ens2f0): state changed unknown -> timeout
3822735:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.4582] device (ens2f0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
3822736:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <warn>  [1657729133.4597] device (ens2f0): Activation: failed for connection 'Wired connection 8'
3822737:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.4601] device (ens2f0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
3822738:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5015] dhcp4 (ens2f0): canceled DHCP transaction
3822739:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5015] dhcp4 (ens2f0): state changed timeout -> done
3822740:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5076] policy: auto-activating connection 'Wired connection 8' (bb09809e-1756-3375-9845-1c2cd5600b56)
3822741:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5088] device (ens2f0): Activation: starting connection 'Wired connection 8' (bb09809e-1756-3375-9845-1c2cd5600b56)
3822742:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5089] device (ens2f0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
3822743:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5094] device (ens2f0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
3822744:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5100] device (ens2f0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
3822745:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5104] dhcp4 (ens2f0): activation: beginning transaction (timeout in 45 seconds)
3822819:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: NM resolv-prepender: Prepending 'nameserver 69.100.108.159' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)
3822824:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: + echo 'Not an infinite DHCP4 lease. Ignoring.'
3822825:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: Not an infinite DHCP4 lease. Ignoring.
3822828:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: + echo 'Not a DHCP6 address. Ignoring.'
3822829:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: Not a DHCP6 address. Ignoring.
3822831:Jul 13 16:18:54 worker-18 root[1882843]: NM mdns-hostname triggered by dhcp4-change.
3822832:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: <13>Jul 13 16:18:54 root: NM mdns-hostname triggered by dhcp4-change.
3822969:Jul 13 16:19:02 worker-18 kernel: nfs: server 69.100.109.84 not responding, timed out
3822970:Jul 13 16:19:02 worker-18 kernel: nfs: server 69.100.109.85 not responding, timed out
3822972:Jul 13 16:19:04 worker-18 systemd[1]: NetworkManager-dispatcher.service: Succeeded.
3822973:Jul 13 16:19:04 worker-18 systemd[1]: NetworkManager-dispatcher.service: Consumed 470ms CPU time
3823029:Jul 13 16:19:10 worker-18 kernel: nfs: server 69.100.109.84 not responding, still trying
3823035:Jul 13 16:19:13 worker-18 kernel: nfs: server 69.100.109.85 not responding, timed out
3823161:Jul 13 16:19:18 worker-18 kernel: nfs: server 69.100.109.84 not responding, timed out
3823190:Jul 13 16:19:23 worker-18 kernel: nfs: server 69.100.109.85 not responding, timed out
3823214:Jul 13 16:19:28 worker-18 kernel: nfs: server 69.100.109.84 not responding, still trying
3823215:Jul 13 16:19:28 worker-18 kernel: nfs: server 69.100.109.84 not responding, timed out
3823242:Jul 13 16:19:33 worker-18 kernel: nfs: server 69.100.109.85 not responding, timed out


Lastly, we see;

<warn>  [1657729182.9473] dhcp4 (bond0.284): selecting lease failed: -131

Which led me to [1], which appears to show that NetworkManagers' code at this step leads to;

return -ENOTRECOVERABLE;

----------

If you need any further details/have questions, please let me know.

[1] gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/993

Comment 4 Darren Carpenter 2022-07-18 16:28:21 UTC
Created attachment 1897959 [details]
networkmanager trace from the affected host from Jul 13