Bug 1986370 - network-metrics-daemon pods are scheduled to nodes where network is not ready
Summary: network-metrics-daemon pods are scheduled to nodes where network is not ready
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Tomofumi Hayashi
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-27 11:56 UTC by Oleg Bulatov
Modified: 2021-10-23 15:06 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-23 15:06:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Oleg Bulatov 2021-07-27 11:56:53 UTC
Description of problem:

In the e2e-aws-serial job, the pods of the daemonset network-metrics-daemon generate many events like

ns/openshift-multus pod/network-metrics-daemon-jq8zq node/ip-10-0-242-255.us-east-2.compute.internal - reason/NetworkNotReady network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

It happens during the test

[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]

Repeated events are usually indicators of a problem.

Version-Release number of selected component (if applicable):

master

How reproducible:

Always

Steps to Reproduce:

1. Run e2e-aws-serial.

Actual results:

The test `[sig-arch] events should not repeat pathologically` detects these events.

Expected results:

The network-metrics-daemon pods don't have problems during machineSets manipulations.

Additional info:

Comment 1 Oleg Bulatov 2021-07-27 14:50:48 UTC
ns/openshift-network-diagnostics pod/network-check-target-shj2x node/ip-10-0-180-110.us-west-2.compute.internal - reason/NetworkNotReady network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

These events seems to have similar nature.

Comment 3 Oleg Bulatov 2021-07-28 10:24:40 UTC
The mentioned test hasn't been merged yet, the PR with the test: https://github.com/openshift/origin/pull/26323

Comment 4 Tomofumi Hayashi 2021-10-23 15:06:02 UTC
This issue is not a bug.

Looking the CI, just before the error message, OCP cluster introduces a new node. At that time, kubelet is started in early phase of the node deploy and daemonset Pods, which includes network related pods, are deployed
(at that time, network is not ready because CNI plugin will be installed by multus and ovn/openshift-sdn pod).

In addition, network-metrics-daemon daemonset is pretty light weight container image, hence the pod could be started before the readiness of CNI plugin because CNI plugin is not installed yet.
After several minutes, then multus/ovn/openshift-sdn pods install CNI plugins into the node, kubelet stops to show error message and starts network-metrics-daemon and other pods.

Hence kubelet shows this error message (container runtime network not ready) because network is actually not ready.
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L2347

To prevent the error message, cri-o or kubelet need to recognize dependency of the pod (i.e. wait all pod except openshit-sdn/ovn/multus), however it is not discussed/designed/implemented in upstream because upstream recognize that it is not an error.


Note You need to log in before you can comment on or make changes to this bug.