Bug 1780611 - Multus does not provide a clear error message or cause when the default network is broken
Summary: Multus does not provide a clear error message or cause when the default netwo...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.3.z
Assignee: Tomofumi Hayashi
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On: 1777038 1777040
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-06 14:07 UTC by Ben Bennett
Modified: 2020-07-01 13:05 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1777040
Environment:
Last Closed: 2020-07-01 13:05:00 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Ben Bennett 2019-12-06 14:07:07 UTC
+++ This bug was initially created as a clone of Bug #1777040 +++

In the linked bug we pushed a cluster past what openshift-sdn could support, but multus only displayed a generic error in the Kubelet status ("CNI default network not found").  This is a pretty critical message to get right for an admin to know where to look, so there are a few improvements that need to be made:

1. The message should more clearly indicate what the heck a default CNI network means - I suggest something like "Default networking provider for this node has not been registered, unable to assign IPs to pods"

2. It would be good to consider whether multus has enough info to tell us where we should be looking next (the pod or provider) - if not, that's ok, but this is a usability chokepoint.  Whether that's changes to CNI (so that partial failure could be reported) or whether it's something the network plugin should be telling multus about at a kube level, or whether it's some other mechanism, it doesn't matter so long as you don't have to infer *where* the default network provider is.

3. There should be a metric and alert for this sort of scenario. In the linked bug the networking provider should be reporting a capacity error (can't allocate any more subnets).  But multus should probably be reporting a metric on whether the default network is available, and if more detail is available, why.

Setting to urgent because the message is not explicable to a normal admin, and a normal admin could hit it.  It should be crystal clear that the admin needs to go debug the cluster networking provider and where that is.

--- Additional comment from Tomofumi Hayashi on 2019-12-01 23:29:22 EST ---

Is the error message actually "Missing CNI default network", instead of "CNI default network not found"?

--- Additional comment from Tomofumi Hayashi on 2019-12-04 09:09:46 EST ---

https://github.com/cri-o/ocicni/pull/66


Note You need to log in before you can comment on or make changes to this bug.