Bug 2105088

Summary: NetworkManager didn't automatically renew the lease on the VLAN interface configured using nmstate
Product: Red Hat Enterprise Linux 8 Reporter: Mridul Markandey <mmarkand>
Component: NetworkManagerAssignee: NetworkManager Development Team <nm-team>
Status: CLOSED ERRATA QA Contact: Vladimir Benes <vbenes>
Severity: high Docs Contact:
Priority: high    
Version: 8.4CC: ajuarez, bgalvani, bnemec, bzvonar, dacarpen, ealcaniz, gdiotte, lars, lrintel, nm-team, pemensik, rkhan, sfaye, sukulkar, till, vbenes, xzhou, yjoseph
Target Milestone: rcKeywords: FastFix, Triaged, ZStream
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.39.10-1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2109411 2109412 2110000 (view as bug list) Environment:
Last Closed: 2022-11-08 10:10:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2109411, 2109412, 2110000, 2113992, 2113993, 2113994, 2113997, 2113998, 2113999    
Attachments:
Description Flags
traces from journalctl from Jul 13 (original ticket open date) none

Comment 9 Darren Carpenter 2022-07-15 22:59:24 UTC
Adding <trace>'s from journalctl/NetworkManager from the original date of the ticket, Jul 13.
Customer preferred to not reboot the node to enable the machine-config, if more details are needed ~ please let us know.

Comment 10 Darren Carpenter 2022-07-15 23:01:27 UTC
Created attachment 1897581 [details]
traces from journalctl from Jul 13 (original ticket open date)

Comment 12 Darren Carpenter 2022-07-16 11:39:24 UTC
Apologies, never gave a problem statement.

This is an openshift env. with storage IPs being delivered via dhcp. (Ideally these would be static)
When the lease expired for storage, it was unable to negotiate a new IP.

--

From what I can tell on worker-18, it receives 'something' from the server, but ultimately cannot connect;

dmesg/worker-18

11674:[9115102.767156] NFS: Server wrote zero bytes, expected 4096.
11678:[9115257.189423] nfs: server 69.100.109.84 not responding, still trying
11679:[9115257.192668] nfs: server 69.100.109.84 not responding, still trying
11680:[9115264.357663] nfs: server 69.100.109.84 not responding, timed out
11681:[9115264.365017] nfs: server 69.100.109.85 not responding, timed out
11683:[9115282.789355] nfs: server 69.100.109.85 not responding, still trying
11684:[9115282.789364] nfs: server 69.100.109.85 not responding, still trying

Journalctl seems to echo the above, but the piece we locked in on was the 'not an infinite DHCP4 lease, ignoring';

journalctl/worker-18

3822730:Jul 13 16:18:53 worker-18 nm-dispatcher[1882468]: NM resolv-prepender: Checking for nameservers in /var/run/NetworkManager/resolv.conf
3822733:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <warn>  [1657729133.4581] dhcp4 (ens2f0): request timed out
3822734:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.4582] dhcp4 (ens2f0): state changed unknown -> timeout
3822735:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.4582] device (ens2f0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
3822736:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <warn>  [1657729133.4597] device (ens2f0): Activation: failed for connection 'Wired connection 8'
3822737:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.4601] device (ens2f0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
3822738:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5015] dhcp4 (ens2f0): canceled DHCP transaction
3822739:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5015] dhcp4 (ens2f0): state changed timeout -> done
3822740:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5076] policy: auto-activating connection 'Wired connection 8' (bb09809e-1756-3375-9845-1c2cd5600b56)
3822741:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5088] device (ens2f0): Activation: starting connection 'Wired connection 8' (bb09809e-1756-3375-9845-1c2cd5600b56)
3822742:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5089] device (ens2f0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
3822743:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5094] device (ens2f0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
3822744:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5100] device (ens2f0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
3822745:Jul 13 16:18:53 worker-18 NetworkManager[4760]: <info>  [1657729133.5104] dhcp4 (ens2f0): activation: beginning transaction (timeout in 45 seconds)
3822819:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: NM resolv-prepender: Prepending 'nameserver 69.100.108.159' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)
3822824:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: + echo 'Not an infinite DHCP4 lease. Ignoring.'
3822825:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: Not an infinite DHCP4 lease. Ignoring.
3822828:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: + echo 'Not a DHCP6 address. Ignoring.'
3822829:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: Not a DHCP6 address. Ignoring.
3822831:Jul 13 16:18:54 worker-18 root[1882843]: NM mdns-hostname triggered by dhcp4-change.
3822832:Jul 13 16:18:54 worker-18 nm-dispatcher[1882468]: <13>Jul 13 16:18:54 root: NM mdns-hostname triggered by dhcp4-change.
3822969:Jul 13 16:19:02 worker-18 kernel: nfs: server 69.100.109.84 not responding, timed out
3822970:Jul 13 16:19:02 worker-18 kernel: nfs: server 69.100.109.85 not responding, timed out
3822972:Jul 13 16:19:04 worker-18 systemd[1]: NetworkManager-dispatcher.service: Succeeded.
3822973:Jul 13 16:19:04 worker-18 systemd[1]: NetworkManager-dispatcher.service: Consumed 470ms CPU time
3823029:Jul 13 16:19:10 worker-18 kernel: nfs: server 69.100.109.84 not responding, still trying
3823035:Jul 13 16:19:13 worker-18 kernel: nfs: server 69.100.109.85 not responding, timed out
3823161:Jul 13 16:19:18 worker-18 kernel: nfs: server 69.100.109.84 not responding, timed out
3823190:Jul 13 16:19:23 worker-18 kernel: nfs: server 69.100.109.85 not responding, timed out
3823214:Jul 13 16:19:28 worker-18 kernel: nfs: server 69.100.109.84 not responding, still trying
3823215:Jul 13 16:19:28 worker-18 kernel: nfs: server 69.100.109.84 not responding, timed out
3823242:Jul 13 16:19:33 worker-18 kernel: nfs: server 69.100.109.85 not responding, timed out


Lastly, we see;

<warn>  [1657729182.9473] dhcp4 (bond0.284): selecting lease failed: -131

Which led me to [1], which appears to show that NetworkManagers' code at this step leads to;

return -ENOTRECOVERABLE;

----------

If you need any further details/have questions, please let me know.

[1] gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/993

Comment 13 Darren Carpenter 2022-07-18 16:18:40 UTC
Hi All,

I know the trace+details we're only recently added, but just checking in to see if any progress has been made. Customer is putting a little bit of pressure on a resolution for this one. 

Thanks in advance ~ and sorry to bug.

Darren Carpenter

Comment 14 Beniamino Galvani 2022-07-19 14:49:05 UTC
The attached log doesn't show any "selecting lease failed" error as in the original report, and it looks like the lease is never lost. Is it the right log?

Comment 18 Victor Pickard 2022-07-20 13:36:57 UTC
*** Bug 2106903 has been marked as a duplicate of this bug. ***

Comment 28 Lars Kellogg-Stedman 2022-07-22 14:23:39 UTC
Ben (et al): While we wait for the fix to land in appropriate channels, are there any suggested workarounds for this problem?

Comment 30 Ben Nemec 2022-07-22 17:28:43 UTC
(In reply to Lars Kellogg-Stedman from comment #28)
> Ben (et al): While we wait for the fix to land in appropriate channels, are
> there any suggested workarounds for this problem?

I don't really know what the circumstances are that cause this so I can't suggest a way to avoid it. I suppose you could write a service to periodically check the interfaces and if they lose their address then force them to retry. That could still result in connectivity breaks in the meantime though. :-/

Comment 32 Beniamino Galvani 2022-07-25 13:11:48 UTC
Added links to upstream issue and NM-CI merge request to add test.

Comment 44 errata-xmlrpc 2022-11-08 10:10:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7680