Bug 1828458

Summary:	[RFE] nm-online systemd unit file has hardcoded timeout
Product:	Red Hat Enterprise Linux 8	Reporter:	Dan Williams <dcbw>
Component:	NetworkManager	Assignee:	Thomas Haller <thaller>
Status:	CLOSED ERRATA	QA Contact:	Desktop QE <desktop-qa-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.2	CC:	acardace, atragler, bgalvani, lrintel, rkhan, rob.fisher, sukulkar, thaller, till, vbenes
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	8.2	Flags:	pm-rhel: mirror+
Hardware:	All
OS:	All
Whiteboard:	Telco
Fixed In Version:	NetworkManager-1.25.1-1.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-04 01:49:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Williams 2020-04-27 18:15:50 UTC

While debugging an OpenShift cluster, we found it takes 32 seconds for NM startup until a DHCP lease is acquired. This causes nm-online/NetworkManager-wait-online to fail and allow other units to proceed before networking is up.

This in turn causes kubelet to see the wrong hostname (which is then set correctly by NM 2 seconds later) and fails to add the node to the cluster.

We could edit the systemd unit file for NetworkManager-wait-online to adjust the default 30s timeout, but it would be nicer if the unit file sourced an env file somewhere on the machine and used that value instead of the 30 default if present.

eg:

EnvironmentFile=-/etc/sysconfig/nm-online
ExecStart=/usr/bin/nm-online -s -q --timeout=$NM_ONLINE_TIMEOUT

Comment 2 Thomas Haller 2020-04-28 20:22:25 UTC

the rhbz links to case https://access.redhat.com/support/cases/#/case/02639509, but I fail to understand how that is related.

It would be interesting to know why it takes 32 seconds. The timeout is something that should not be reached in regular operation. Maybe we should just increase the timeout to 45 seconds instead? Working systems would should suffer from this increased timeout.


How about https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/484 ?
Would "EnvironmentFile=-/etc/sysconfig/nm-online" be preferable? Why?

Comment 3 Dan Williams 2020-04-28 21:21:38 UTC

(In reply to Thomas Haller from comment #2)
> the rhbz links to case
> https://access.redhat.com/support/cases/#/case/02639509, but I fail to
> understand how that is related.

It's something we found while debugging the cluster for that specific support case, and will fix one of the issues we found there WRT node hostname.

> It would be interesting to know why it takes 32 seconds. The timeout is
> something that should not be reached in regular operation. Maybe we should
> just increase the timeout to 45 seconds instead? Working systems would
> should suffer from this increased timeout.

I think it's just "enterprise stuff takes longer". We'll grab some NM logs to diagnose further.

> How about
> https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/
> merge_requests/484 ?
> Would "EnvironmentFile=-/etc/sysconfig/nm-online" be preferable? Why?

Filename was just a suggestion based off what other services do in RHEL/Fedora. Doesn't matter what the file is as long as it's %config and can be edited locally. Will comment on the MR.

Comment 5 Thomas Haller 2020-04-30 19:49:08 UTC

fixed on master: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/e468b48ab7b8e2ddc8802db4b93e3f13787835e4

Comment 8 Vladimir Benes 2020-07-09 13:26:59 UTC

CI test added:
https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/602

Comment 11 errata-xmlrpc 2020-11-04 01:49:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4499