Bug 1986370

Summary:	network-metrics-daemon pods are scheduled to nodes where network is not ready
Product:	OpenShift Container Platform	Reporter:	Oleg Bulatov <obulatov>
Component:	Networking	Assignee:	Tomofumi Hayashi <tohayash>
Networking sub component:	multus	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	anbhat
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-23 15:06:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Oleg Bulatov 2021-07-27 11:56:53 UTC

Description of problem:

In the e2e-aws-serial job, the pods of the daemonset network-metrics-daemon generate many events like

ns/openshift-multus pod/network-metrics-daemon-jq8zq node/ip-10-0-242-255.us-east-2.compute.internal - reason/NetworkNotReady network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

It happens during the test

[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]

Repeated events are usually indicators of a problem.

Version-Release number of selected component (if applicable):

master

How reproducible:

Always

Steps to Reproduce:

1. Run e2e-aws-serial.

Actual results:

The test `[sig-arch] events should not repeat pathologically` detects these events.

Expected results:

The network-metrics-daemon pods don't have problems during machineSets manipulations.

Additional info:

Comment 1 Oleg Bulatov 2021-07-27 14:50:48 UTC

ns/openshift-network-diagnostics pod/network-check-target-shj2x node/ip-10-0-180-110.us-west-2.compute.internal - reason/NetworkNotReady network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

These events seems to have similar nature.

Comment 3 Oleg Bulatov 2021-07-28 10:24:40 UTC

The mentioned test hasn't been merged yet, the PR with the test: https://github.com/openshift/origin/pull/26323

Comment 4 Tomofumi Hayashi 2021-10-23 15:06:02 UTC

This issue is not a bug.

Looking the CI, just before the error message, OCP cluster introduces a new node. At that time, kubelet is started in early phase of the node deploy and daemonset Pods, which includes network related pods, are deployed
(at that time, network is not ready because CNI plugin will be installed by multus and ovn/openshift-sdn pod).

In addition, network-metrics-daemon daemonset is pretty light weight container image, hence the pod could be started before the readiness of CNI plugin because CNI plugin is not installed yet.
After several minutes, then multus/ovn/openshift-sdn pods install CNI plugins into the node, kubelet stops to show error message and starts network-metrics-daemon and other pods.

Hence kubelet shows this error message (container runtime network not ready) because network is actually not ready.
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L2347

To prevent the error message, cri-o or kubelet need to recognize dependency of the pod (i.e. wait all pod except openshit-sdn/ovn/multus), however it is not discussed/designed/implemented in upstream because upstream recognize that it is not an error.