Description of problem: In the e2e-aws-serial job, the pods of the daemonset network-metrics-daemon generate many events like ns/openshift-multus pod/network-metrics-daemon-jq8zq node/ip-10-0-242-255.us-east-2.compute.internal - reason/NetworkNotReady network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? It happens during the test [sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] Repeated events are usually indicators of a problem. Version-Release number of selected component (if applicable): master How reproducible: Always Steps to Reproduce: 1. Run e2e-aws-serial. Actual results: The test `[sig-arch] events should not repeat pathologically` detects these events. Expected results: The network-metrics-daemon pods don't have problems during machineSets manipulations. Additional info:
ns/openshift-network-diagnostics pod/network-check-target-shj2x node/ip-10-0-180-110.us-west-2.compute.internal - reason/NetworkNotReady network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? These events seems to have similar nature.
The mentioned test hasn't been merged yet, the PR with the test: https://github.com/openshift/origin/pull/26323
This issue is not a bug. Looking the CI, just before the error message, OCP cluster introduces a new node. At that time, kubelet is started in early phase of the node deploy and daemonset Pods, which includes network related pods, are deployed (at that time, network is not ready because CNI plugin will be installed by multus and ovn/openshift-sdn pod). In addition, network-metrics-daemon daemonset is pretty light weight container image, hence the pod could be started before the readiness of CNI plugin because CNI plugin is not installed yet. After several minutes, then multus/ovn/openshift-sdn pods install CNI plugins into the node, kubelet stops to show error message and starts network-metrics-daemon and other pods. Hence kubelet shows this error message (container runtime network not ready) because network is actually not ready. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L2347 To prevent the error message, cri-o or kubelet need to recognize dependency of the pod (i.e. wait all pod except openshit-sdn/ovn/multus), however it is not discussed/designed/implemented in upstream because upstream recognize that it is not an error.