Bug 1821833

Summary: Missing CNI default network Error
Product: OpenShift Container Platform Reporter: Daneyon Hansen <dhansen>
Component: NetworkingAssignee: Douglas Smith <dosmith>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: anbhat
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1824936 1824938 (view as bug list) Environment:
Last Closed: 2020-07-13 17:26:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1824936, 1824938    

Description Daneyon Hansen 2020-04-07 17:10:55 UTC
Description of problem:
The release-openshift-origin-installer-e2e-gcp-upgrade-4.4 CI job consistently fails [1]. While investigating the cause of the failures, I see hundreds of instances of the following error:

"network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network"

I think this may be causing a cascading effect that results in a failure to maintain a functioning cluster during an upgrade. I found a similar bug [2], but maybe the fix needs a broader scope? Does [3] fix this issue? If so, it should be backported.

See [4] for additional background.

Version-Release number of selected component (if applicable):
4.4

How reproducible:
The errors are observed in all e2e-gcp-upgrade-4.4 CI job failures.

Steps to Reproduce:
1. See [1]
2. Pick a failed CI job.
3. Search for Missing CNI default network error message

Actual results:
Failed e2e-gcp-upgrade-4.4 job

Expected results:
Passed e2e-gcp-upgrade-4.4 job

Additional info:
[1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-gcp-upgrade-4.4&sort-by-flakiness=&exclude-non-failed-tests=50

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1754154

[3] https://github.com/openshift/multus-cni/pull/54

[4] https://coreos.slack.com/archives/CDCP2LA9L/p1585683028151700

Comment 2 Daneyon Hansen 2020-04-13 18:49:13 UTC
I believe the underlying issue may be related to [1], where operators are scheduling operands to nodes tainted as not ready.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1753059

Comment 7 zhaozhanqi 2020-04-15 10:04:55 UTC
verified this bug on 4.5.0-0.nightly-2020-04-14-221451

1. reboot the worker server. 
2. after restarted and check the kubelet logs, no new error message 'Missing CNI default network" generated. 
 journalctl -u kubelet | grep "Missing CNI default network"

Comment 8 errata-xmlrpc 2020-07-13 17:26:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409