1777040 – Multus does not provide a clear error message or cause when the default network is broken

Bug 1777040 - Multus does not provide a clear error message or cause when the default network is broken

Summary: Multus does not provide a clear error message or cause when the default netwo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Tomofumi Hayashi
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1809327 (view as bug list)
Depends On:	1777038
Blocks:	1780611
TreeView+	depends on / blocked

Reported:	2019-11-26 19:49 UTC by Clayton Coleman
Modified:	2020-07-13 17:12 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1777038
Clones:	1780611 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:12:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:12:47 UTC

Description Clayton Coleman 2019-11-26 19:49:46 UTC

In the linked bug we pushed a cluster past what openshift-sdn could support, but multus only displayed a generic error in the Kubelet status ("CNI default network not found"). This is a pretty critical message to get right for an admin to know where to look, so there are a few improvements that need to be made:

1. The message should more clearly indicate what the heck a default CNI network means - I suggest something like "Default networking provider for this node has not been registered, unable to assign IPs to pods"

2. It would be good to consider whether multus has enough info to tell us where we should be looking next (the pod or provider) - if not, that's ok, but this is a usability chokepoint. Whether that's changes to CNI (so that partial failure could be reported) or whether it's something the network plugin should be telling multus about at a kube level, or whether it's some other mechanism, it doesn't matter so long as you don't have to infer *where* the default network provider is.

3. There should be a metric and alert for this sort of scenario. In the linked bug the networking provider should be reporting a capacity error (can't allocate any more subnets). But multus should probably be reporting a metric on whether the default network is available, and if more detail is available, why.

Setting to urgent because the message is not explicable to a normal admin, and a normal admin could hit it. It should be crystal clear that the admin needs to go debug the cluster networking provider and where that is.

Comment 1 Tomofumi Hayashi 2019-12-02 04:29:22 UTC

Is the error message actually "Missing CNI default network", instead of "CNI default network not found"?

Comment 2 Tomofumi Hayashi 2019-12-04 14:09:46 UTC

https://github.com/cri-o/ocicni/pull/66

Comment 5 zhaozhanqi 2020-01-06 09:49:52 UTC

@Weibin please help check if this bug can be verified?

Comment 6 Tomofumi Hayashi 2020-01-09 13:43:26 UTC

PR66 of ocicni is merged, but cri-o does not update vendored package yet, hence this fix is not released yet.

Comment 7 Weibin Liang 2020-01-09 13:50:03 UTC

Tested on 4.4.0-0.nightly-2020-01-06-072200

Still see old message "NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network"

Should see the new message like https://github.com/cri-o/ocicni/pull/66/files#diff-8619c46869a984d0a789a8642c1b7654R64

Wait for crio update vendored package.

Comment 8 Weibin Liang 2020-01-27 13:22:11 UTC

Still wait for https://github.com/cri-o/cri-o/pull/3160

Comment 9 zhaozhanqi 2020-02-03 02:26:43 UTC

@weibin

Above PR has been merged. could you help check if this bug can be verified? thanks

Comment 10 Weibin Liang 2020-02-04 16:35:03 UTC

https://github.com/cri-o/cri-o/pull/3160 is merged, but verification failed on 4.4.0-0.nightly-2020-02-04-101225, still see the old log error:

Feb 04 16:30:55 ip-10-0-141-246 hyperkube[1671]: I0204 16:30:55.294828    1671 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-5fgzj", UID:"55da9c88-b51d-4545-8483-d76bfe450be7", APIVersion:"v1", ResourceVersion:"12947", FieldPath:""}): type: 'Warning' reason: 'NetworkNotReady' network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network

Comment 12 Tomofumi Hayashi 2020-03-09 02:36:14 UTC

@Aniket,

I looked OCP's cri-o repo[1] as following. The repo uses cri-o 1.17[1] and cri-o 1.17 does not contain the fixes[2].
The PR[3] introduces its fix into upstream so it should be fixed in OCP 4.5.0, that introduce 1.18 as [4].

[1]:http://pkgs.devel.redhat.com/cgit/rpms/cri-o/tree/cri-o.spec?h=rhaos-4.4-rhel-8#n30
[2]: https://github.com/cri-o/cri-o/blob/release-1.17/vendor/github.com/cri-o/ocicni/pkg/ocicni/ocicni.go#L57
[3]: https://github.com/cri-o/cri-o/commit/437fb7356bd3667794238d3f26a073da75a0a391
[4]: https://github.com/cri-o/cri-o/blob/release-1.18/vendor/github.com/cri-o/ocicni/pkg/ocicni/ocicni.go#L64

Comment 14 Ben Bennett 2020-05-08 19:07:07 UTC

@tomo -- Is this in 4.5?  Can we move it to ON_QE yet?

Comment 15 Ben Bennett 2020-05-08 23:19:16 UTC

*** Bug 1809327 has been marked as a duplicate of this bug. ***

Comment 16 Tomofumi Hayashi 2020-05-08 23:28:22 UTC

@ben, rhaos-4.5-rhel-8 and rhaos-4.5-rhel-7 have the fix now, hence we can move it to ON_QE.

Comment 17 Weibin Liang 2020-05-11 14:43:21 UTC

Tested and verified in 4.5.0-0.nightly-2020-05-11-114800, the logs show fixed warning messages nwo:

May 11 14:40:48 welian-ch6sv-w-b-77g8l.c.openshift-qe.internal hyperkube[1503]: I0511 14:40:48.865749    1503 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-monitoring", Name:"prometheus-k8s-1", UID:"7846227b-30cd-4e41-ae90-927c2a4a48ae", APIVersion:"v1", ResourceVersion:"20128", FieldPath:""}): type: 'Warning' reason: 'NetworkNotReady' network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

Comment 20 errata-xmlrpc 2020-07-13 17:12:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.