Description of problem: SRIOV connectivity issue- sriov error when trying to request VF and nodes lost SRIOV Interface As per the below log "Found pre-allocated devices for resource openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5]" It seems kernel failed to allocate the memory for the PCI device having PCI-ID as 0000:37:02.5. So as it failed to allocate the memory , so not possible to create the entry in the sys file system. But during the init stage, every application tries to get the vf details from sys file system and tries to bind it. Now as the device does not exist, so no such entry in sys file system as expected. --------------------------------------------------------------------------------------------------------------------------- Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314165 3980 topology_manager.go:233] [topologymanager] Topology Admit Handler Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314182 3980 manager.go:843] needs 12 cpu Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314189 3980 manager.go:843] needs 1073741824 hugepages-1Gi Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314196 3980 manager.go:843] needs 25769803776 memory Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314201 3980 manager.go:843] needs 1 openshift.io/leftnuma0x710 Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314229 3980 manager.go:647] Found pre-allocated devices for resource openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5] Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868024395Z" level=error msg="Error adding network: [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: error adding container to network \"leftnuma0\": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: \"lstat /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory\"" Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868061477Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: error adding container to network \"leftnuma0\": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: \"lstat /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory\"" Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868086302Z" level=info msg="NetworkStart: stopping network for sandbox 8d79b48da30fab1f9f46f6c15a0fe679ff7fb401e10c5fa62a78c04b30dfab97" id=1d7ab073-21f6-4521-9326-9c56bee0fe3b name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
As per attachment in case below are some of the pointers towards the issue: NAME DISPLAY VERSION REPLACES PHASE namespace-configuration-operator.v1.0.5 Namespace Configuration Operator 1.0.5 namespace-configuration-operator.v1.0.4 Replacing namespace-configuration-operator.v1.0.6 Namespace Configuration Operator 1.0.6 namespace-configuration-operator.v1.0.5 Pending sriov-network-operator.4.6.0-202106032244 SR-IOV Network Operator 4.6.0-202106032244 sriov-network-operator.4.6.0-202106010807.p0.git.78e7139 Succeeded namespace-configuration-operator.v1.0.6 Namespace Configuration Operator 1.0.6 namespace-configuration-operator.v1.0.5 Pending phase: Pending installing: waiting for deployment elasticsearch-operator to become ready: Waiting for rollout to finish: 1 old replicas are pending termination... ResourceQuotas, NetworkPolicies, EgressNetworkPolicies, etc.... . Depending phase: Pending ResourceQuotas, NetworkPolicies, EgressNetworkPolicies, etc.... . Depending phase: Pending installing: waiting for deployment sriov-network-operator to become ready: Waiting for rollout to finish: 1 old replicas are pending termination... phase: Pending message: 'install strategy failed: Deployment.apps "namespace-configuration-operator-controller-manager" phase: Failed reason: InstallComponentFailed message: 'install strategy failed: Deployment.apps "namespace-configuration-operator-controller-manager" phase: Failed Message: install strategy failed: Deployment.apps "namespace-configuration-operator-controller-manager" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"operator":"namespace-configuration-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable Phase: Failed Reason: InstallComponentFailed Phase: Failed Phase: Failed --> Found huge amount of "TLS handshake errors" 2021/07/04 05:04:08 http: TLS handshake error from 192.168.145.55:35882: EOF 2021/07/05 03:39:05 http: TLS handshake error from 192.168.145.55:56894: EOF 2021/07/06 01:09:05 http: TLS handshake error from 192.168.145.55:48700: EOF ======= Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314229 3980 manager.go:647] Found pre-allocated devices for resource openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5] [11220.208454] iavf 0000:37:02.5 ens2f0v5: renamed from net1 [11221.629462] iavf 0000:37:02.5 temp_105: renamed from ens2f0v5 [11221.664059] iavf 0000:37:02.5 net1: renamed from temp_105 [11222.462375] iavf 0000:37:02.5 ens2f0v5: renamed from net1 [11223.481514] iavf 0000:37:02.5 temp_105: renamed from ens2f0v5 [11223.528717] iavf 0000:37:02.5 net1: renamed from temp_105 [11224.323564] iavf 0000:37:02.5 ens2f0v5: renamed from net1 [11225.534274] iavf 0000:37:02.5 temp_105: renamed from ens2f0v5 [11225.579891] iavf 0000:37:02.5 net1: renamed from temp_105 [11226.207440] iavf 0000:37:02.5 ens2f0v5: renamed from net1 [11227.491125] iavf 0000:37:02.5 temp_105: renamed from ens2f0v5 [11227.529681] iavf 0000:37:02.5 net1: renamed from temp_105 Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868061477Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: error adding container to network \"leftnuma0\": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: \"lstat /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory\"" Jul 23 15:04:20 mxq949065g hyperkube[3980]: E0723 15:04:20.230885 3980 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_eric-pc-up-data-plane-c7949486b-4jkx8_w6017-c1-sl01-upflab1_946025e3-be22-4d15-b0d1-f3143973beab_0(ac3c52ef7c96c1165e637fcde9b52d7e5253b70580426497dfe577a98358dfe1): [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: error adding container to network "leftnuma0": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "lstat /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory"
Some SRIOV errors leading up to the node reboot Jul 23 ~ 15:04 (see sosreport-mxq949065g-02995395-2021-07-23-eyukwzj/sos_commands/logs/journalctl_--no-pager_--catalog_--boot_-1 for details): Jul 22 04:24:13 mxq949065g hyperkube[4018]: W0722 04:24:13.857631 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused". Reconnecting... Jul 22 04:24:13 mxq949065g hyperkube[4018]: I0722 04:24:13.857728 4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc003f916c0, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused"} Jul 22 04:24:14 mxq949065g hyperkube[4018]: W0722 04:24:14.145719 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused". Reconnecting... Jul 22 04:24:14 mxq949065g hyperkube[4018]: I0722 04:24:14.145867 4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc003468c80, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused"} Jul 22 04:24:14 mxq949065g hyperkube[4018]: W0722 04:24:14.482223 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock: connect: connection refused". Reconnecting... Jul 22 04:24:14 mxq949065g hyperkube[4018]: I0722 04:24:14.482275 4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc004f65490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock: connect: connection refused"} Jul 22 04:24:14 mxq949065g hyperkube[4018]: W0722 04:24:14.998329 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused". Reconnecting... Jul 22 04:24:14 mxq949065g hyperkube[4018]: I0722 04:24:14.998397 4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0021ae090, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused"} Jul 22 04:24:17 mxq949065g hyperkube[4018]: W0722 04:24:17.503222 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused". Reconnecting... Jul 22 04:24:17 mxq949065g hyperkube[4018]: I0722 04:24:17.503316 4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc003f916c0, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused"} Jul 22 04:24:17 mxq949065g hyperkube[4018]: W0722 04:24:17.787865 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused". Reconnecting... Jul 22 04:24:17 mxq949065g hyperkube[4018]: I0722 04:24:17.787928 4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc003468c80, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused"} Jul 22 04:24:18 mxq949065g hyperkube[4018]: W0722 04:24:18.394469 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused". Reconnecting... Jul 22 04:24:18 mxq949065g hyperkube[4018]: I0722 04:24:18.394533 4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0021ae090, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused"} Jul 22 04:24:19 mxq949065g hyperkube[4018]: E0722 04:24:19.175800 4018 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock" failed. No retries permitted until 2021-07-22 04:26:21.175779025 +0000 UTC m=+461169.579680555 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock, err: context deadline exceeded" Jul 22 04:24:19 mxq949065g hyperkube[4018]: E0722 04:24:19.175874 4018 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock" failed. No retries permitted until 2021-07-22 04:26:21.175856469 +0000 UTC m=+461169.579758035 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock, err: context deadline exceeded" Jul 22 04:24:19 mxq949065g hyperkube[4018]: E0722 04:24:19.175906 4018 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock" failed. No retries permitted until 2021-07-22 04:26:21.1758855 +0000 UTC m=+461169.579787013 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock, err: context deadline exceeded" Jul 22 04:24:19 mxq949065g hyperkube[4018]: E0722 04:24:19.175918 4018 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock" failed. No retries permitted until 2021-07-22 04:26:21.175910779 +0000 UTC m=+461169.579812292 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock, err: context deadline exceeded" Jul 22 04:24:23 mxq949065g hyperkube[4018]: I0722 04:24:23.381969 4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-8f5e0a96\\x2dece8\\x2d4277\\x2db76e\\x2d2c2791938e2a-volumes-kubernetes.io\\x7esecret-sriov\\x2dcni\\x2dtoken\\x2dj6whh.mount", but ignoring. Jul 22 04:24:23 mxq949065g hyperkube[4018]: I0722 04:24:23.381993 4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-df977e9c\\x2d4911\\x2d405c\\x2d89fd\\x2d15bac7d6e118-volumes-kubernetes.io\\x7esecret-sriov\\x2ddevice\\x2dplugin\\x2dtoken\\x2dfdrx9.mount", but ignoring. Jul 22 04:24:23 mxq949065g hyperkube[4018]: I0722 04:24:23.382745 4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-4fe50017\\x2d47b5\\x2d46e9\\x2d822c\\x2d2f7989352eb1-volumes-kubernetes.io\\x7esecret-sriov\\x2dnetwork\\x2dconfig\\x2ddaemon\\x2dtoken\\x2djf9s2.mount", but ignoring. Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.354934 4018 kubelet_pods.go:1486] Generating status for "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)" Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.355053 4018 status_manager.go:429] Ignoring same status for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:31 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:32 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:32 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:31 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:172.18.243.53 PodIPs:[{IP:172.18.243.53}] StartTime:2021-07-16 20:21:31 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-device-plugin State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:21:32 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-network-device-plugin@sha256:04d49ffe894ff992626d17f32688c6ab3e17b41ae5b7eebabf33a11cfffc01f1 ImageID:registry.redhat.io/openshift4/ose-sriov-network-device-plugin@sha256:04d49ffe894ff992626d17f32688c6ab3e17b41ae5b7eebabf33a11cfffc01f1 ContainerID:cri-o://0ac4492d2cbd2e3f2910a8714bef76752da37062b4b2f96190fac49815924eda Started:0xc004386039}] QOSClass:BestEffort EphemeralContainerStatuses:[]} Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.355233 4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)" Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.355275 4018 volume_manager.go:403] All volumes are attached and mounted for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)" Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.355424 4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:49284991864db3710fd5a75e235afe1efe0152900d511945e562d4eea2d0c970 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)" Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.385900 4018 secret.go:183] Setting up volume sriov-device-plugin-token-fdrx9 for pod df977e9c-4911-405c-89fd-15bac7d6e118 at /var/lib/kubelet/pods/df977e9c-4911-405c-89fd-15bac7d6e118/volumes/kubernetes.io~secret/sriov-device-plugin-token-fdrx9 Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.385953 4018 configmap.go:212] Received configMap openshift-sriov-network-operator/device-plugin-config containing (17) pieces of data, 4224 total bytes Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.385985 4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-device-plugin-token-fdrx9 containing (4) pieces of data, 16817 total bytes Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355024 4018 kubelet_pods.go:1486] Generating status for "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)" Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355107 4018 status_manager.go:429] Ignoring same status for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 21:31:35 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:26 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:26 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 21:31:35 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:172.18.243.53 PodIPs:[{IP:172.18.243.53}] StartTime:2021-06-15 21:31:35 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-network-config-daemon State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:20:25 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-network-config-daemon@sha256:3142b5cadfc8c99e8c5f2b67fc06bc85a5ba69c162fe75521b197d7eeea3d23b ImageID:registry.redhat.io/openshift4/ose-sriov-network-config-daemon@sha256:3142b5cadfc8c99e8c5f2b67fc06bc85a5ba69c162fe75521b197d7eeea3d23b ContainerID:cri-o://ef9b47bd5b1b798ea6526ece625b51a2dd8daaba37e2eb0793037f00b4f6b5a4 Started:0xc004103cbe}] QOSClass:BestEffort EphemeralContainerStatuses:[]} Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355242 4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)" Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355282 4018 volume_manager.go:403] All volumes are attached and mounted for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)" Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355392 4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:4759c4dd03b4753faaa1d1d84b0a9a48bb56b6df3f6e9bb91d107dd39d3019f1 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)" Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.391622 4018 secret.go:183] Setting up volume sriov-network-config-daemon-token-jf9s2 for pod 4fe50017-47b5-46e9-822c-2f7989352eb1 at /var/lib/kubelet/pods/4fe50017-47b5-46e9-822c-2f7989352eb1/volumes/kubernetes.io~secret/sriov-network-config-daemon-token-jf9s2 Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.391656 4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-network-config-daemon-token-jf9s2 containing (4) pieces of data, 16849 total bytes Jul 22 04:25:23 mxq949065g hyperkube[4018]: I0722 04:25:23.389469 4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-4fe50017\\x2d47b5\\x2d46e9\\x2d822c\\x2d2f7989352eb1-volumes-kubernetes.io\\x7esecret-sriov\\x2dnetwork\\x2dconfig\\x2ddaemon\\x2dtoken\\x2djf9s2.mount", but ignoring. Jul 22 04:25:23 mxq949065g hyperkube[4018]: I0722 04:25:23.389581 4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-df977e9c\\x2d4911\\x2d405c\\x2d89fd\\x2d15bac7d6e118-volumes-kubernetes.io\\x7esecret-sriov\\x2ddevice\\x2dplugin\\x2dtoken\\x2dfdrx9.mount", but ignoring. Jul 22 04:25:23 mxq949065g hyperkube[4018]: I0722 04:25:23.390372 4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-8f5e0a96\\x2dece8\\x2d4277\\x2db76e\\x2d2c2791938e2a-volumes-kubernetes.io\\x7esecret-sriov\\x2dcni\\x2dtoken\\x2dj6whh.mount", but ignoring. Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.355168 4018 kubelet_pods.go:1486] Generating status for "sriov-cni-9z9v7_openshift-sriov-network-operator(8f5e0a96-ece8-4277-b76e-2c2791938e2a)" Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.355558 4018 status_manager.go:429] Ignoring same status for pod "sriov-cni-9z9v7_openshift-sriov-network-operator(8f5e0a96-ece8-4277-b76e-2c2791938e2a)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 20:58:35 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:38 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:38 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 20:58:35 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:1.1.98.144 PodIPs:[{IP:1.1.98.144}] StartTime:2021-06-15 20:58:35 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-cni State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:20:37 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-cni@sha256:3233715e4309c176955fcf5fbc458d2599bb89691d74e7e69aea4312890144a1 ImageID:registry.redhat.io/openshift4/ose-sriov-cni@sha256:3233715e4309c176955fcf5fbc458d2599bb89691d74e7e69aea4312890144a1 ContainerID:cri-o://c0d441730f8a89104d0935aedfcbe0c1a2444cfc88088a021775dfa5a35f7b53 Started:0xc00190a736} {Name:sriov-infiniband-cni State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:20:38 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-infiniband-cni@sha256:89043b617d0d88f578dc134f716134507e54fe0e706f73c7cba5f9a8cd40a431 ImageID:registry.redhat.io/openshift4/ose-sriov-infiniband-cni@sha256:89043b617d0d88f578dc134f716134507e54fe0e706f73c7cba5f9a8cd40a431 ContainerID:cri-o://afe470fa487b638cec02a314756cbca79d7f392e8372d7d93bcad9c43b4c6d79 Started:0xc00190a737}] QOSClass:BestEffort EphemeralContainerStatuses:[]} Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.355736 4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-cni-9z9v7_openshift-sriov-network-operator(8f5e0a96-ece8-4277-b76e-2c2791938e2a)" Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.355908 4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:264241a135254296635ab62ba95952e3b43a652cd2c25ad0eb3f60858b9afc4d Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-cni-9z9v7_openshift-sriov-network-operator(8f5e0a96-ece8-4277-b76e-2c2791938e2a)" Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.527238 4018 secret.go:183] Setting up volume sriov-cni-token-j6whh for pod 8f5e0a96-ece8-4277-b76e-2c2791938e2a at /var/lib/kubelet/pods/8f5e0a96-ece8-4277-b76e-2c2791938e2a/volumes/kubernetes.io~secret/sriov-cni-token-j6whh Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.527283 4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-cni-token-j6whh containing (4) pieces of data, 16777 total bytes Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.355019 4018 kubelet_pods.go:1486] Generating status for "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)" Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.355692 4018 status_manager.go:429] Ignoring same status for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:31 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:32 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:32 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:31 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:172.18.243.53 PodIPs:[{IP:172.18.243.53}] StartTime:2021-07-16 20:21:31 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-device-plugin State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:21:32 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-network-device-plugin@sha256:04d49ffe894ff992626d17f32688c6ab3e17b41ae5b7eebabf33a11cfffc01f1 ImageID:registry.redhat.io/openshift4/ose-sriov-network-device-plugin@sha256:04d49ffe894ff992626d17f32688c6ab3e17b41ae5b7eebabf33a11cfffc01f1 ContainerID:cri-o://0ac4492d2cbd2e3f2910a8714bef76752da37062b4b2f96190fac49815924eda Started:0xc00410f00a}] QOSClass:BestEffort EphemeralContainerStatuses:[]} Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.355856 4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)" Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.355923 4018 volume_manager.go:403] All volumes are attached and mounted for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)" Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.356046 4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:49284991864db3710fd5a75e235afe1efe0152900d511945e562d4eea2d0c970 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)" Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.464253 4018 secret.go:183] Setting up volume sriov-device-plugin-token-fdrx9 for pod df977e9c-4911-405c-89fd-15bac7d6e118 at /var/lib/kubelet/pods/df977e9c-4911-405c-89fd-15bac7d6e118/volumes/kubernetes.io~secret/sriov-device-plugin-token-fdrx9 Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.464343 4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-device-plugin-token-fdrx9 containing (4) pieces of data, 16817 total bytes Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.464373 4018 configmap.go:212] Received configMap openshift-sriov-network-operator/device-plugin-config containing (17) pieces of data, 4224 total bytes Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355222 4018 kubelet_pods.go:1486] Generating status for "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)" Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355311 4018 status_manager.go:429] Ignoring same status for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 21:31:35 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:26 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:26 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 21:31:35 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:172.18.243.53 PodIPs:[{IP:172.18.243.53}] StartTime:2021-06-15 21:31:35 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-network-config-daemon State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:20:25 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-network-config-daemon@sha256:3142b5cadfc8c99e8c5f2b67fc06bc85a5ba69c162fe75521b197d7eeea3d23b ImageID:registry.redhat.io/openshift4/ose-sriov-network-config-daemon@sha256:3142b5cadfc8c99e8c5f2b67fc06bc85a5ba69c162fe75521b197d7eeea3d23b ContainerID:cri-o://ef9b47bd5b1b798ea6526ece625b51a2dd8daaba37e2eb0793037f00b4f6b5a4 Started:0xc001f4cd8e}] QOSClass:BestEffort EphemeralContainerStatuses:[]} Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355464 4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)" Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355502 4018 volume_manager.go:403] All volumes are attached and mounted for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)" Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355623 4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:4759c4dd03b4753faaa1d1d84b0a9a48bb56b6df3f6e9bb91d107dd39d3019f1 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)" Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.492707 4018 secret.go:183] Setting up volume sriov-network-config-daemon-token-jf9s2 for pod 4fe50017-47b5-46e9-822c-2f7989352eb1 at /var/lib/kubelet/pods/4fe50017-47b5-46e9-822c-2f7989352eb1/volumes/kubernetes.io~secret/sriov-network-config-daemon-token-jf9s2 Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.492744 4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-network-config-daemon-token-jf9s2 containing (4) pieces of data, 16849 total bytes Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208528 4018 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock 2021-07-16 20:20:23.377127926 +0000 UTC m=+11.781029437 <nil> }) Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208553 4018 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock 2021-07-16 20:20:23.37781788 +0000 UTC m=+11.781719392 <nil> }) Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208566 4018 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock 2021-07-16 20:20:23.377568698 +0000 UTC m=+11.781470210 <nil> }) Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208579 4018 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock 2021-07-16 20:20:23.377807904 +0000 UTC m=+11.781709415 <nil> }) Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208669 4018 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock <nil> 0 <nil>}] <nil> <nil>} Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208707 4018 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock <nil> 0 <nil>}] <nil> <nil>} Jul 22 04:26:21 mxq949065g hyperkube[4018]: W0722 04:26:21.208752 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused". Reconnecting... Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208669 4018 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock <nil> 0 <nil>}] <nil> <nil>} Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208629 4018 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock <nil> 0 <nil>}] <nil> <nil>} Jul 22 04:26:21 mxq949065g hyperkube[4018]: W0722 04:26:21.208833 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused". Reconnecting... Jul 22 04:26:21 mxq949065g hyperkube[4018]: W0722 04:26:21.208772 4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused". Reconnecting... Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208858 4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc00130c640, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused"}
> > It seems kernel failed to allocate the memory for the PCI device having > PCI-ID as 0000:37:02.5. So as it failed to allocate the memory , so not > possible to create the entry in the sys file system. But during the init > stage, every application tries to get the vf details from sys file system > and tries to bind it. > How did you know if failed to allocate the memory? Do we have the log? > > Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868024395Z" > level=error msg="Error adding network: > [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: > error adding container to network \"leftnuma0\": SRIOV-CNI failed to load > netconf: LoadConf(): failed to get VF information: \"lstat > /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory\"" The above failure was that sriov-cni tried to get the PF name of VF (0000:37:02.5), but the sysfs (/sys/bus/pci/devices/0000:37:02.5/physfn/net) doesn't exist. Did this happen the first time when creating a sriov pod or the sriov pod has been deleted and recreated more than one time? If it ever worked, it maybe that sriov VF was not released gracefully which results in failure in next pod creation. Also could you check whether the PF of VF (0000:37:02.5) exist in the host namespace in the worker node? (by log in to the node and run "p link show ens2f0") What is the deviceType in sriovNetworkNodePolicy?
==> How did you know if failed to allocate the memory? Do we have the log? Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314229 3980 manager.go:647] Found pre-allocated devices for resource openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5] ==> Did this happen the first time when creating a sriov pod or the sriov pod has been deleted and recreated more than one time? If it ever worked, it maybe that sriov VF was not released gracefully which results in failure in next pod creation. Issue happened multiple times ** apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: labels: argocd.argoproj.io/instance: sriov-policy name: leftnuma0x710 namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci <-------- isRdma: false mtu: 9000 nicSelector: deviceID: "1572" pfNames: - ens2f0 <---------- vendor: "8086" nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" numVfs: 8 resourceName: leftnuma0x710
Any Update?
(In reply to Eswar Vadla from comment #4) > ==> How did you know if failed to allocate the memory? Do we have the log? > Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314229 3980 > manager.go:647] Found pre-allocated devices for resource > openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod > "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5] > This message is about device allocation, I don't see obvious issue with it. I think it is that kubelet tried to re-use the pre-allocated device to recover the sriov pod after pod restarting or node reboot. > ==> Did this happen the first time when creating a sriov pod or the sriov > pod has been deleted and recreated more than one time? If it ever worked, it > maybe that sriov VF was not released gracefully which results in failure in > next pod creation. > Issue happened multiple times > > ** > apiVersion: sriovnetwork.openshift.io/v1 > kind: SriovNetworkNodePolicy > metadata: > labels: > argocd.argoproj.io/instance: sriov-policy > name: leftnuma0x710 > namespace: openshift-sriov-network-operator > spec: > deviceType: vfio-pci <-------- > isRdma: false > mtu: 9000 > nicSelector: > deviceID: "1572" > pfNames: > - ens2f0 <---------- Could you get the following info on the target worker node? Check if ens2f0 exist # ip link show ens2f0 # ethtool -i ens2f0 Check the network interfaces # ls /sys/class/net Stop the pod creation and check the VF info # ls /sys/bus/pci/devices/0000:37:02.5/physfn/net/ I guess ens2f0 is the PF of VF (0000:37:02.5), the failure in sriov pod creation indicates that this PF doesn't exist on the target worker node whereas it should. For example: # ls /sys/bus/pci/devices/0000\:3b\:0a.2/physfn/net/ ens1f1 I used the output from local environment where ens1f1 is the PF and 0000\:3b\:0a.2 is the VF PCI address. Could you also get the sriov device plugin log on the target worker? Find sriov device plugin pod name for the target worker: # oc get pods -n openshift-sriov-network-operator -o wide Get log: # oc logs -f <sriov-network-device-plugin-name> -n openshift-sriov-network-operator
Hi Zenghui, Below are requested information: The issue is intermittent, so that has nothing do with non-existing devices. I can provide the information you're asking but the information I can get might not reflect the problem situation because we had temporarily fixed the issue so the customer doesn't have to face a broken state before we have solution. Please see my comments below. Check if ens2f0 exist ip link show ens2f0 8: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 48:df:37:b2:78:d0 brd ff:ff:ff:ff:ff:ff vf 0 link/ether 26:9f:ec:1f:3f:72 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 1 link/ether 72:37:1f:c6:b7:cf brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 2 link/ether 22:8b:85:ce:5a:a5 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 3 link/ether 52:85:1f:62:e0:fd brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 4 link/ether 62:a0:aa:5e:a2:7f brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 5 link/ether 6e:fc:11:28:b7:f1 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 6 link/ether 26:64:ea:5d:ba:b3 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off vf 7 link/ether 72:9d:3b:05:93:d2 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off ethtool -i ens2f0 driver: i40e version: 2.8.20-k firmware-version: 10.5.5 expansion-rom-version: bus-info: 0000:37:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes Check the network interfaces ls /sys/class/net bond0 bond1.728 bond1.733 bond1.738 bond1.751 bond1.784 bond1.789 bond1.850 bond1.855 cali6c8e1dd90c6 egress.calico eno5 ens2f1 ens3f0v3 ens3f1 ens3f1v4 gre0 bond0.220 bond1.729 bond1.734 bond1.739 bond1.753 bond1.785 bond1.790 bond1.851 bonding_masters cali77c44e2f94a eno1 eno6 ens3f0 ens3f0v4 ens3f1v0 ens3f1v5 gretap0 bond1 bond1.730 bond1.735 bond1.740 bond1.754 bond1.786 bond1.791 bond1.852 cali2fd27eaea1f cali81281ea782c eno2 ens1f0 ens3f0v0 ens3f0v5 ens3f1v1 ens3f1v6 lo bond1.726 bond1.731 bond1.736 bond1.741 bond1.782 bond1.787 bond1.792 bond1.853 cali43cdd2d7f48 cali993aa2b9bd3 eno3 ens1f1 ens3f0v1 ens3f0v6 ens3f1v2 ens3f1v7 bond1.727 bond1.732 bond1.737 bond1.750 bond1.783 bond1.788 bond1.849 bond1.854 cali61f51780f95 calid9e1f13482f eno4 ens2f0 ens3f0v2 ens3f0v7 ens3f1v3 erspan0 Stop the pod creation and check the VF info ls /sys/bus/pci/devices/0000:37:02.5/physfn/net/ ens2f0 I guess ens2f0 is the PF of VF (0000:37:02.5), the failure in sriov pod creation indicates that this PF doesn't exist on the target worker node whereas it should. For example: ls /sys/bus/pci/devices/0000:3b:0a.2/physfn/net/ ens1f1 I used the output from local environment where ens1f1 is the PF and 0000:3b:0a.2 is the VF PCI address. Could you also get the sriov device plugin log on the target worker? Find sriov device plugin pod name for the target worker: oc get pods -n openshift-sriov-network-operator -o wide Get log: oc logs -f <sriov-network-device-plugin-name> -n openshift-sriov-network-operator Please see attached log file: sriov-device-plugin-mxq949065g.log
Created attachment 1827133 [details] SRIOV logs
According to logs from the sriov network device plugin pod, you have same devices added to two separate pools, for example, VF "0000:37:02.0" is in leftnuma0 pool, it is also in leftnuma0x710 pool. Why do we need such configuration? I0726 20:44:46.858980 25 manager.go:116] Creating new ResourcePool: leftnuma0 I0726 20:44:46.858982 25 manager.go:117] DeviceType: netDevice I0726 20:44:46.866533 25 factory.go:106] device added: [pciAddr: 0000:37:02.0, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.866542 25 factory.go:106] device added: [pciAddr: 0000:37:02.1, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.866545 25 factory.go:106] device added: [pciAddr: 0000:37:02.2, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.866547 25 factory.go:106] device added: [pciAddr: 0000:37:02.3, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.866549 25 factory.go:106] device added: [pciAddr: 0000:37:02.4, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.866552 25 factory.go:106] device added: [pciAddr: 0000:37:02.5, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.866555 25 factory.go:106] device added: [pciAddr: 0000:37:02.6, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.866557 25 factory.go:106] device added: [pciAddr: 0000:37:02.7, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.866569 25 manager.go:145] New resource server is created for leftnuma0 ResourcePool I0726 20:44:46.866573 25 manager.go:115] I0726 20:44:46.866575 25 manager.go:116] Creating new ResourcePool: leftnuma0x710 I0726 20:44:46.866578 25 manager.go:117] DeviceType: netDevice I0726 20:44:46.873739 25 factory.go:106] device added: [pciAddr: 0000:37:02.0, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.873749 25 factory.go:106] device added: [pciAddr: 0000:37:02.1, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.873751 25 factory.go:106] device added: [pciAddr: 0000:37:02.2, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.873754 25 factory.go:106] device added: [pciAddr: 0000:37:02.3, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.873756 25 factory.go:106] device added: [pciAddr: 0000:37:02.4, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.873759 25 factory.go:106] device added: [pciAddr: 0000:37:02.5, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.873761 25 factory.go:106] device added: [pciAddr: 0000:37:02.6, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.873763 25 factory.go:106] device added: [pciAddr: 0000:37:02.7, vendor: 8086, device: 154c, driver: vfio-pci] I0726 20:44:46.873770 25 manager.go:145] New resource server is created for leftnuma0x710 ResourcePool The above may not directly be related to the error, but it should be fixed. Regarding the "0000:37:02.5/physfn/net: no such file or directory" error, did it happen after node reboot? If so, what happened could be that kubelet tried to recover the customer sriov pod with pre-allocated devices (recorded in the kubelet check point file before rebooting), but at the time of recovery, the sriov device haven't be created yet (sriov devices are created by sriov-config-daemon pod, it is not guaranteed that sriov-config-daemon pod would be up and running first before customer sriov pod). Could you capture the logs for both sriov-config-daemon pod and the CNI error message when the issue happens again? I think we can compare the timestamp to see if that's the case.
@zenghui.shi To make it clear, we hit SRIOV issue recently was slightly different to the originally reported. 1) the originally reported one. it happened in all nodes with sriov interfaces (in different time). symptom: - when listed with lspci, some of vfs were not activated: 0000:37:02.5 - 0000:37:02.7; - when describe node, we saw capacity of sriov as 0: ~~~~ Capacity: openshift.io/leftnuma0: 0 openshift.io/leftnuma0x710: 0 openshift.io/leftnuma1: 0 openshift.io/leftnuma1x710: 0 openshift.io/rightnuma0: 0 openshift.io/rightnuma0x710: 0 openshift.io/rightnuma1: 0 openshift.io/rightnuma1x710: 0 openshift.io/sriovleftdpdkintel: 0 openshift.io/sriovleftdpdkintelx710: 0 openshift.io/sriovrightdpdkintel: 0 openshift.io/sriovrightdpdkintelx710: 0 ~~~~ 2) the recently hit one on Sep. 27th. it happened on one node with 25GB SRIOV interface (Intel XXV710). symptom: - (no issue) when listed with lspci, all vfs were activated (0 - 7); - (no issue) when describe node, we saw capacity of sriov as 8; - (no issue) sys vf number was 8 cat /sys/bus/pci/devices/0000:37:00.0/sriov_numvfs cat /sys/bus/pci/devices/0000:37:00.1/sriov_numvfs - (no issue) both drivers are loaded properly: iavf and vfio_pci - (issue) /dev/vfio was empty, so pod failed to start with error like following ~~~~ Sep 27 09:30:53 mxq1010l8g hyperkube[4529]: E0927 09:30:53.653468 4529 pod_workers.go:191] Error syncing pod 7eca0638-5b0d-4b6a-88c3-e72d6e5e403e ("eric-pc-up-data-plane-cc5c88746-x9k7r_w6017-c1-sl01-upflab3(7eca0638-5b0d-4b6a-88c3-e72d6e5e403e)"), skipping: failed to "StartContainer" for "eric-pc-up-data-plane" with CreateContainerError: "lstat /dev/vfio/114: no such file or directory" ~~~~ (please see attached logs: mxq1010l8g_sriov_err_20210927_0800_to_1800.logaa.tar.gz, mxq1010l8g_sriov_err_20210927_0800_to_1800.logab.tar.gz, mxq1010l8g_sriov_err_20210927_0800_to_1800.logac.tar.gz)
Created attachment 1827824 [details] sriov error kubelet log for sep. 27th part 1 sriov error kubelet log for sep. 27th part 1
Created attachment 1827825 [details] sriov error kubelet log for sep. 27th part 2 sriov error kubelet log for sep. 27th part 2
Created attachment 1827826 [details] sriov error kubelet log for sep. 27th part 3 sriov error kubelet log for sep. 27th part 3
@zenghui.shi please find following version information: OCP: 4.6.13 CoreOS: 46.82.202101191342-0 SRIOV operator: sriov-network-operator.4.6.0-202106032244
@zenghui.shi the node recently got sriov issue was not rebooted recently, and the last reboot was on July 26th, 2021: ~~~~ sh-4.4# date Thu Sep 30 23:31:49 UTC 2021 sh-4.4# last core pts/0 192.168.137.252 Mon Sep 27 17:53 - 18:11 (00:17) core pts/0 192.168.137.252 Mon Sep 27 17:51 - 17:53 (00:01) core pts/0 192.168.137.252 Tue Sep 7 18:47 - 18:47 (00:00) reboot system boot 4.18.0-193.40.1. Mon Jul 26 20:59 still running reboot system boot 4.18.0-193.40.1. Mon Jul 26 20:51 - 20:56 (00:05) reboot system boot 4.18.0-193.40.1. Mon Jul 26 20:41 - 20:48 (00:07) reboot system boot 4.18.0-193.40.1. Mon Jul 26 20:32 - 20:37 (00:05) ~~~~
(In reply to ptang from comment #10) > @zenghui.shi > > To make it clear, we hit SRIOV issue recently was slightly different to the > originally reported. > 1) the originally reported one. > it happened in all nodes with sriov interfaces (in different time). > symptom: > - when listed with lspci, some of vfs were not activated: 0000:37:02.5 - > 0000:37:02.7; > - when describe node, we saw capacity of sriov as 0: > ~~~~ > Capacity: > openshift.io/leftnuma0: 0 > openshift.io/leftnuma0x710: 0 > openshift.io/leftnuma1: 0 > openshift.io/leftnuma1x710: 0 > openshift.io/rightnuma0: 0 > openshift.io/rightnuma0x710: 0 > openshift.io/rightnuma1: 0 > openshift.io/rightnuma1x710: 0 > openshift.io/sriovleftdpdkintel: 0 > openshift.io/sriovleftdpdkintelx710: 0 > openshift.io/sriovrightdpdkintel: 0 > openshift.io/sriovrightdpdkintelx710: 0 > ~~~~ > This symptom looks differently than what's originally described. When the vfs are observed in the target node, but not reported in node status, it could be that sriov device plugin didn't discover those devices or not yet reported back to kubelet. We can check the sriov device plugin log while the issue happens (getting log before or after the issue may not tell exactly what we want to see). > 2) the recently hit one on Sep. 27th. > it happened on one node with 25GB SRIOV interface (Intel XXV710). > symptom: > - (no issue) when listed with lspci, all vfs were activated (0 - 7); > - (no issue) when describe node, we saw capacity of sriov as 8; > - (no issue) sys vf number was 8 > cat /sys/bus/pci/devices/0000:37:00.0/sriov_numvfs > cat /sys/bus/pci/devices/0000:37:00.1/sriov_numvfs > - (no issue) both drivers are loaded properly: iavf and vfio_pci > - (issue) /dev/vfio was empty, so pod failed to start with error like > following > ~~~~ > Sep 27 09:30:53 mxq1010l8g hyperkube[4529]: E0927 09:30:53.653468 > 4529 pod_workers.go:191] Error syncing pod > 7eca0638-5b0d-4b6a-88c3-e72d6e5e403e > ("eric-pc-up-data-plane-cc5c88746-x9k7r_w6017-c1-sl01-upflab3(7eca0638-5b0d- > 4b6a-88c3-e72d6e5e403e)"), skipping: failed to "StartContainer" for > "eric-pc-up-data-plane" with CreateContainerError: "lstat /dev/vfio/114: no > such file or directory" > ~~~~ This error indicates that vfio-pci device spec is not mounted to sriov container. May I know how does the application (inside container) know which device (e.g. /dev/vfio/114) to read from? Same here, when the issue happens, check the sriov device plugin log to see if it has successfully allocated the device (/dev/vfio/114), it should contain a message indicating that `Allocate` grpc call has been issued by kubelet and the exact device (dev/vfio/114) has been chosen for the particular container. If you don't see such message, we probably has other issues in the net-attach-def and how pod requesting the devices (need to check the net-attach-def and pod manifests).
Hi @zenghui.shi, The customer just hit SRIOV issue, so the node reported 0 sriov again. There was no cluster upgrade, no node reboot, no operator restart. And I have resolved the sriov pool conflict issue based on the comment #9. Since those files are too big for BugZilla, I have uploaded in support case: https://access.redhat.com/support/cases/#/case/02995395. Please look for these three files in the case: must-gather-sriov-mxq1010l8g-20211025.tar.gz, sosreport-mxq1010l8g-02995395-2021-10-25-tsmzqry.tar.xz, sriov-operator-data-20211025.txt.tar.gz Here is the node description and interface list to give you the idea: ptang@iaasjump1:~$ oc describe node mxq1010l8g Name: mxq1010l8g Roles: worker,worker-hp Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/network-sriov.capable=true kubernetes.io/arch=amd64 kubernetes.io/hostname=mxq1010l8g kubernetes.io/os=linux multus-cnf=true network-sriov.nic-type=XXV710 node-role.kubernetes.io/worker= node-role.kubernetes.io/worker-hp= node.openshift.io/os_id=rhcos run_prisma=true tenants-workload=true type=high-throughput Annotations: csi.volume.kubernetes.io/nodeid: {"csi.trident.netapp.io":"mxq1010l8g"} machineconfiguration.openshift.io/currentConfig: rendered-worker-hp-b3ec8e1fd600306f2c10530e4c3ff696 machineconfiguration.openshift.io/desiredConfig: rendered-worker-hp-b3ec8e1fd600306f2c10530e4c3ff696 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Done projectcalico.org/IPv4Address: 192.168.145.69/23 sriovnetwork.openshift.io/state: Idle volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Sat, 13 Mar 2021 20:19:10 +0000 Taints: sriov_upf5=true:NoSchedule Unschedulable: false Lease: HolderIdentity: mxq1010l8g AcquireTime: <unset> RenewTime: Mon, 25 Oct 2021 21:47:50 +0000 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Mon, 26 Jul 2021 20:59:58 +0000 Mon, 26 Jul 2021 20:59:58 +0000 CalicoIsUp Calico is running on this node MemoryPressure False Mon, 25 Oct 2021 21:44:57 +0000 Thu, 16 Sep 2021 20:09:46 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Mon, 25 Oct 2021 21:44:57 +0000 Thu, 16 Sep 2021 20:09:46 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Mon, 25 Oct 2021 21:44:57 +0000 Thu, 16 Sep 2021 20:09:46 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Mon, 25 Oct 2021 21:44:57 +0000 Thu, 16 Sep 2021 20:09:46 +0000 KubeletReady kubelet is posting ready status Addresses: InternalIP: 172.18.243.69 Hostname: mxq1010l8g Capacity: cpu: 96 ephemeral-storage: 937140548Ki hugepages-1Gi: 40Gi memory: 395962020Ki openshift.io/leftnuma0: 0 openshift.io/leftnuma0x710: 0 openshift.io/leftnuma1: 0 openshift.io/leftnuma1x710: 0 openshift.io/rightnuma0: 0 openshift.io/rightnuma0x710: 0 openshift.io/rightnuma1: 0 openshift.io/rightnuma1x710: 0 openshift.io/sriovleftdpdkintel: 0 openshift.io/sriovleftdpdkintelx710: 0 openshift.io/sriovrightdpdkintel: 0 openshift.io/sriovrightdpdkintelx710: 0 pods: 250 Allocatable: cpu: 95500m ephemeral-storage: 862594985783 hugepages-1Gi: 40Gi memory: 352868004Ki openshift.io/leftnuma0: 0 openshift.io/leftnuma0x710: 0 openshift.io/leftnuma1: 0 openshift.io/leftnuma1x710: 0 openshift.io/rightnuma0: 0 openshift.io/rightnuma0x710: 0 openshift.io/rightnuma1: 0 openshift.io/rightnuma1x710: 0 openshift.io/sriovleftdpdkintel: 0 openshift.io/sriovleftdpdkintelx710: 0 openshift.io/sriovrightdpdkintel: 0 openshift.io/sriovrightdpdkintelx710: 0 pods: 250 System Info: Machine ID: 8a7fc9582fe744ad835950587fc7a791 System UUID: 37393150-3636-584d-5131-3031304c3847 Boot ID: 571a72fc-391d-4113-81ad-45eeb453823b Kernel Version: 4.18.0-193.40.1.el8_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 46.82.202101191342-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.19.1-4.rhaos4.6.git3846aab.el8 Kubelet Version: v1.19.0+3b01205 Kube-Proxy Version: v1.19.0+3b01205 Non-terminated Pods: (17 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- calico-system calico-node-mcfj6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 103d cluster-network-addons nmstate-handler-rzcsk 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 32d openshift-cluster-node-tuning-operator tuned-5f6mw 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 226d openshift-dns dns-default-sm2pf 65m (0%) 0 (0%) 110Mi (0%) 512Mi (0%) 226d openshift-image-registry node-ca-2v25v 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 226d openshift-kube-proxy openshift-kube-proxy-bqkx2 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 226d openshift-logging fluentd-69tfn 100m (0%) 0 (0%) 1536Mi (0%) 1536Mi (0%) 32d openshift-machine-config-operator machine-config-daemon-45klv 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 226d openshift-monitoring node-exporter-w5g6p 9m (0%) 0 (0%) 210Mi (0%) 0 (0%) 226d openshift-multus multus-rxhb6 10m (0%) 0 (0%) 150Mi (0%) 0 (0%) 226d openshift-multus network-metrics-daemon-cb7kr 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 226d openshift-sriov-network-operator sriov-cni-495lk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 6d3h openshift-sriov-network-operator sriov-device-plugin-7bjqh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 172m openshift-sriov-network-operator sriov-network-config-daemon-shckq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d trident trident-csi-lcrml 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12d twistlock twistlock-defender-ds-qhnvs 256m (0%) 0 (0%) 512Mi (0%) 512Mi (0%) 118d w6017-c1-sl01-upflab3 mxq1010l8g-debug 250m (0%) 1 (1%) 512Mi (0%) 2Gi (0%) 12s Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 970m (1%) 1 (1%) memory 3610Mi (1%) 4608Mi (1%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) openshift.io/leftnuma0 0 0 openshift.io/leftnuma0x710 0 0 openshift.io/leftnuma1 0 0 openshift.io/leftnuma1x710 0 0 openshift.io/rightnuma0 0 0 openshift.io/rightnuma0x710 0 0 openshift.io/rightnuma1 0 0 openshift.io/rightnuma1x710 0 0 openshift.io/sriovleftdpdkintel 0 0 openshift.io/sriovleftdpdkintelx710 0 0 openshift.io/sriovrightdpdkintel 0 0 openshift.io/sriovrightdpdkintelx710 0 0 Events: Interface list: sh-4.4# lspci | grep Ether 12:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 12:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 12:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 12:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 37:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) 37:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) 37:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:02.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:02.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:02.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:02.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:02.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:02.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:02.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:0a.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:0a.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:0a.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:0a.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:0a.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:0a.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:0a.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 37:0a.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 5d:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 5d:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 5d:00.2 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 5d:00.3 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) d8:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) d8:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) d8:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:02.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:02.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:02.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:02.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:02.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:02.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:02.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:0a.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:0a.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:0a.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:0a.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:0a.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:0a.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:0a.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) d8:0a.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
It seems the issue is VFs didn't bind to vfio-pci driver even the sriov policy is configured to using "deviceType:vfio-pci". sriov-device-plugin in turn filtered out the iavf VFs based on the configmap selector "driver:vfio-pci". For exampe, the sriovnetworknodestate mxq1010l8g shows that VFs are configured to vfio-pci driver, but actually bind to iavf driver: ## [...] Interfaces: Mtu: 9000 Name: ens2f0 Num Vfs: 8 Pci Address: 0000:37:00.0 Vf Groups: Device Type: vfio-pci <----- configured to use vfio-pci driver Policy Name: leftnuma0 Resource Name: leftnuma0 Vf Range: 0-7 [...] - Vfs: - deviceID: 154c driver: iavf <----- use iavf driver actually mac: 66:f9:4d:f3:41:ef mtu: 1500 name: ens2f0v0 pciAddress: "0000:37:02.0" vendor: "8086" vfID: 0 - deviceID: 154c driver: iavf mac: aa:d9:07:9f:af:36 mtu: 1500 name: ens2f0v1 pciAddress: "0000:37:02.1" vendor: "8086" vfID: 1 [...] Sync Status: Succeeded <----- status shows that the config-daemon is not processing Events: <none> ## pod/sriov-device-plugin-7bjqh on node mxq1010l8g use vfio-pci in the driver selector: ## [...] {"resourceList":[{"resourceName":"leftnuma0","selectors":{"vendors":["8086"],"devices":["154c"],"drivers":["vfio-pci"],"pfNames":["ens2f0"],"IsRdma":false},"SelectorObj":null},{"resourceName":"leftnuma1","selectors":{"vendors":["8086"],"devices":["154c"],"drivers":["vfio-pci"],"pfNames":["ens3f0"],"IsRdma":false},"SelectorObj":null},{"resourceName":"rightnuma0","selectors":{"vendors":["8086"],"devices":["154c"],"drivers":["vfio-pci"],"pfNames":["ens2f1"],"IsRdma":false},"SelectorObj":null},{"resourceName":"rightnuma1","selectors":{"vendors":["8086"],"devices":["154c"],"drivers":["vfio-pci"],"pfNames":["ens3f1"],"IsRdma":false},"SelectorObj":null}]} [...] ## Checking pod/sriov-network-config-daemon-shckq log on node mxq1010l8g: ## [...] I1025 18:53:57.933510 890379 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: timed out waiting for the condition I1025 18:54:01.149359 890379 daemon.go:123] evicting pod w6017-c1-sl01-upflab3/eric-pc-up-data-plane-cc7f58fb8-g8695 E1025 18:54:01.154516 890379 daemon.go:123] error when evicting pod "eric-pc-up-data-plane-cc7f58fb8-g8695" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1025 18:54:06.154681 890379 daemon.go:123] evicting pod w6017-c1-sl01-upflab3/eric-pc-up-data-plane-cc7f58fb8-g8695 I1025 18:54:06.154729 890379 daemon.go:786] Draining failed with: error when evicting pod "eric-pc-up-data-plane-cc7f58fb8-g8695": global timeout reached: 1m30s, retrying E1025 18:54:06.154744 890379 daemon.go:790] drainNode(): failed to drain node (5 tries): timed out waiting for the condition :error when evicting pod "eric-pc-up-data-plane-cc7f58fb8-g8695": global timeout reached: 1m30s E1025 18:54:06.154751 890379 daemon.go:792] drainNode(): failed to drain node: timed out waiting for the condition [...] ## It shows that drainNode timed out due to pod w6017-c1-sl01-upflab3/eric-pc-up-data-plane-cc7f58fb8-g8695 cannot be evicted, which blocks sriov-config-daemon from proceeding. Could you try to fix the pod disruption budget and re-apply the sriov policy?
Hi @zenghui.shi, Why the node drain will impact the sriov? There might be customer activities to try to drain the node. I have restarted sriov daemon pod for the node, and temporarily fixed the issue.
(In reply to ptang from comment #20) > Hi @zenghui.shi, > > Why the node drain will impact the sriov? There might be customer activities > to try to drain the node. > I have restarted sriov daemon pod for the node, and temporarily fixed the > issue. This node drain is triggered by sriov-config-daemon, not customer activity. The reason why sriov-config-daemon drains the node is that it evicts the pods to other nodes to prevent the workload from being interrupted (while sriov configuration is applied).
Hi @zenghui.shi, Before I discuss with the customer for the change, I'd like to understand and confirm: 1) What mechanism will trigger the sriov configuration to be re-applied? and when? Is the operator does that periodically? I know there was no sriov configuration change on customer side. 2) Does sriov configuration application always drain the node? The eviction will interrupt the workload service. So on the opposite of "prevent the workload from being interrupted". Please confirm that's the behavior we should expect while using sriov operator.
(In reply to ptang from comment #22) > Hi @zenghui.shi, > Before I discuss with the customer for the change, I'd like to understand > and confirm: > 1) What mechanism will trigger the sriov configuration to be re-applied? and > when? Changes to existing sriov policy or creation of new policy will trigger re-apply immediately. > Is the operator does that periodically? It reconciles every 5mins, but won't apply if no changes to the policy. > I know there was no sriov configuration change on customer side. > > 2) Does sriov configuration application always drain the node? Not always, depends on whether the changes would have the potential to cause workload interruption. For example the first time vfio-pci device is configured to a certian node, it will trigger node drain, because vfio-pci configuration in kernel argument takes effect after rebooting and reboot implies node drain. Another example is the poicy changes require updating the NIC firmware configuration (for some vendor NICs), which would trigger node drain and reboot. > The eviction will interrupt the workload service. So on the opposite of > "prevent the workload from being interrupted". It is expected that workload implements application-level HA and node is not a single point of failure in the cluster, like a deployment with replica > 1 would distribute several workload pods to different nodes (with proper taints, affinity settings). One node draining or rebooting won't affect the workload on the other nodes or in the cluster. We have the similar node drain mechism in the machine config operator, if the application of configuration is disruptive, it would drain and reboot the node. In such case, application HA is required to keep the workload running.
> > We have the similar node drain mechism in the machine config operator, if > the application of configuration is disruptive, it would drain and reboot ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ _ the appliance of configuration > the node. In such case, application HA is required to keep the workload > running.
Hi @zenghui.shi, Need further to understand the auto re-apply mechanism and node drain. 1) To your comment about re-apply "Changes to existing sriov policy or creation of new policy will trigger re-apply immediately", in our customer case, there was no sriov policy change (no update, no creation), why was the re-apply triggered? 2) To your comment about node drain, "Not always, depends on whether the changes would have the potential to cause workload interruption. For example the first time vfio-pci device is configured to a certian node, it will trigger node drain, because vfio-pci configuration in kernel argument takes effect after rebooting and reboot implies node drain. Another example is the poicy changes require updating the NIC firmware configuration (for some vendor NICs), which would trigger node drain and reboot. ", there was no workload change, nor new deployment. There was no OCP or device or firmware update as well. Why would sriov operator decide to drain the node?
(In reply to ptang from comment #25) > Hi @zenghui.shi, > > Need further to understand the auto re-apply mechanism and node drain. > 1) To your comment about re-apply "Changes to existing sriov policy or > creation of new policy will trigger re-apply immediately", > in our customer case, there was no sriov policy change (no update, no > creation), why was the re-apply triggered? Node reboot may also trigger this. sriov configuration is re-applied every time after rebooting. It could be manual rebooting or rebooting triggered by other (non-sriov) components. > > 2) To your comment about node drain, > "Not always, depends on whether the changes would have the potential to > cause workload interruption. > For example the first time vfio-pci device is configured to a certian node, > it will trigger node drain, because vfio-pci configuration in kernel > argument takes effect after rebooting and reboot implies node drain. > Another example is the poicy changes require updating the NIC firmware > configuration (for some vendor NICs), which would trigger node drain and > reboot. > ", > there was no workload change, nor new deployment. There was no OCP or device > or firmware update as well. Why would sriov operator decide to drain the > node? Same as the first question.
Hi @zenghui.shi, To comment #26, those assumption are not true. In our customer case, there was: - no node reboot (the last reboot was three months ago) - no policy update/create/delete - no hardware change - no hardware firmware update - no workload update Why sriov configuration would be re-applied and sriov tried to drain the node? Here is the boot log: ptang@iaasjump1:~$ oc debug node/mxq1010l8g Starting pod/mxq1010l8g-debug ... To use host binaries, run `chroot /host` Pod IP: 172.18.243.69 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# last core pts/0 192.168.137.252 Mon Sep 27 17:53 - 18:11 (00:17) core pts/0 192.168.137.252 Mon Sep 27 17:51 - 17:53 (00:01) core pts/0 192.168.137.252 Tue Sep 7 18:47 - 18:47 (00:00) reboot system boot 4.18.0-193.40.1. Mon Jul 26 20:59 still running reboot system boot 4.18.0-193.40.1. Mon Jul 26 20:51 - 20:56 (00:05)
Hi @zenghui.shi and @ddelcian, We're having a sudden node draining by sriov operator today (please see the attached node description for details). For that node (mxq0440kn4), we don't enable sriov on that node at all (please see labels). I have uploaded following files for investigation: sriovnetworknodepolicy-20211028.yaml, collect-sriov-operator-data-mxq0440kn4.txt.tar.gz, sosreport-mxq0440kn4-02995395-2021-10-28-qaqmoms.tar.gz, must-gather.local.870088789971800506-mxq0440kn4.tar.gz, sriov-cm-device-plugin-config.yaml Name: mxq0440kn4 Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=mxq0440kn4 kubernetes.io/os=linux multus-cnf=true node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos pcc-mm-pod=non-controller pcc-sm-pod=non-controller snmp-sender=upflab3 tenants-workload=true Annotations: csi.volume.kubernetes.io/nodeid: {"csi.trident.netapp.io":"mxq0440kn4"} machineconfiguration.openshift.io/currentConfig: rendered-worker-a421fbe985727fc560149575f30f97d2 machineconfiguration.openshift.io/desiredConfig: rendered-worker-a421fbe985727fc560149575f30f97d2 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done projectcalico.org/IPv4Address: 192.168.145.58/23 sriovnetwork.openshift.io/state: Draining volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Sat, 27 Feb 2021 06:26:59 +0000 Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: mxq0440kn4 AcquireTime: <unset> RenewTime: Thu, 28 Oct 2021 17:41:20 +0000
Sorry, in comment #28, I meant I have uploaded files in support case: https://access.redhat.com/support/cases/#/case/02995395
Hi @zenghui.shi, One thing I'd like to mention is the customer is using sriov operator version: 4.6.0-202106032244, while the latest version is: 4.6.0-202110121348. Do you know if there were any important fix in between these two versions?
Patric, Could you also let me know what actions did you take to resolve the overlapping devices in different pools? I'm wondering how did you address it without updating policy or other sriov configurations.
Hi @zenghui.shi, On Oct 19th, I made the policy update to resolve the sriov pool conflict based on your suggestion; then we hit another sriov issue on node mxq1010l8g on Oct 25th, that's why I said there was no sriov policy update before that issue; on Oct 28th, we saw SRIOV operator was trying to drain a totally unrelated node mxq0440kn4 which was never set up sriov.
three things: 1) I see VF exist in SriovNetworkNodeState on node mxq0440kn4 in log sriov-operator-data-20211025.txt, but according to Patrik, the node has never been configured with sriov policy. I suspect we may have wrongly selected this node with sriov policy, which results in VFs be created. 2) To confirm 1) is caused by wrongly configured policy, we need to start with a clean state (cluster is ready, mcp is ready, sriov configuration is all set), collect and compare the sriovnetworknodestate, sriovnetworknodepolicy, sriov-device-plugin configmap, sriov-device-plugin log and node status right before and after any changes to sriov policy (including adding label to nodeSelector). 3) To fix the drain issue on node mxq0440kn4, I suggest to make the pod w6017-c1-sl01-upflab3/eric-pc-up-data-plane-cc7f58fb8-g8695 evictable, then wait until sriov configuration is done on the node (VFs be removed)
@
@zshi - Thanks for taking the time to help troubleshoot the issue earlier today. A few follow-up questions/observations: 1) The SRIOV operator subscription was set to Manual some time ago due to a past issue and was never reset to Automatic. We did notice the version of the operator dating back from Jun 3rd, however, the MCO team did mention a race condition issue between MCO and SRIOV which is supposedly corrected via https://github.com/openshift/sriov-network-operator/pull/506. However, this PR was only merged around Jun 23rd so this likely means they could have hit the race condition since the pool was not paused prior to the SRIOV daemon marking the node unschedulable. 2) The SRIOV operator version in production corresponds to the Oct 12th version and we haven't hit the issue there. So this is encouraging us to want to upgrade. Do you agree? 3) Once we've addressed the SRIOV operator upgrade, we will gauge the stability with the new version plus updated node policies in place. 4) If for some reason, we still see an instability on the nodes with the X710 NICs, we will consider updating the firmware as it is out-of-date and I believe, not 100% compatible with the bundled RHEL8/RHCOS driver running on the nodes. Thanks! Daniel.
(In reply to Daniel Del Ciancio from comment #35) > @zshi - Thanks for taking the time to help troubleshoot the issue > earlier today. > > A few follow-up questions/observations: > > 1) The SRIOV operator subscription was set to Manual some time ago due to a > past issue and was never reset to Automatic. We did notice the version of > the operator dating back from Jun 3rd, however, the MCO team did mention a > race condition issue between MCO and SRIOV which is supposedly corrected via > https://github.com/openshift/sriov-network-operator/pull/506. However, this > PR was only merged around Jun 23rd so this likely means they could have hit > the race condition since the pool was not paused prior to the SRIOV daemon > marking the node unschedulable. The PR #506 in sriov operator was to fix a race condition when both MC and sriov policy are applied simultaneously which leads to node never become Ready after rebooting. We can check the health of MCP (not degraded) when the sriov issue happens to see whether this is the same issue, if MCP is not ready, it might be. > > 2) The SRIOV operator version in production corresponds to the Oct 12th > version and we haven't hit the issue there. So this is encouraging us to > want to upgrade. Do you agree? Do we also had the overlapping device configuration issue in sriov pools or "pod cannot satisify the eviction condition" issue in the prod environment? I recommend to fix the pod eviction issue first in current env like I mentioned in comment #19 and then let the cluster to auto-fix by itself if possible.
(In reply to zenghui.shi from comment #36) > (In reply to Daniel Del Ciancio from comment #35) > > @zshi - Thanks for taking the time to help troubleshoot the issue > > earlier today. > > > > A few follow-up questions/observations: > > > > 1) The SRIOV operator subscription was set to Manual some time ago due to a > > past issue and was never reset to Automatic. We did notice the version of > > the operator dating back from Jun 3rd, however, the MCO team did mention a > > race condition issue between MCO and SRIOV which is supposedly corrected via > > https://github.com/openshift/sriov-network-operator/pull/506. However, this > > PR was only merged around Jun 23rd so this likely means they could have hit > > the race condition since the pool was not paused prior to the SRIOV daemon > > marking the node unschedulable. > > The PR #506 in sriov operator was to fix a race condition when both MC and > sriov policy are applied simultaneously which leads to node never become > Ready after rebooting. We can check the health of MCP (not degraded) when > the sriov issue happens to see whether this is the same issue, if MCP is not > ready, it might be. It seems after applying the SRIOV operator version upgrade (on Monday Nov 1st) to the latest (since the sub was set to manual) along with the corrected SRIOV node policies (applying the additional node selector and node labeling), things seem to be more stable. This leads to another question. Is it safe to keep the SRIOV operator auto-upgrade? I think provided no SRIOV node policies change, causing no underlying VF changes to the nodes, then it should be safe, right? > > > > 2) The SRIOV operator version in production corresponds to the Oct 12th > > version and we haven't hit the issue there. So this is encouraging us to > > want to upgrade. Do you agree? > > Do we also had the overlapping device configuration issue in sriov pools or > "pod cannot satisify the eviction condition" issue in the prod environment? > > I recommend to fix the pod eviction issue first in current env like I > mentioned in comment #19 and then let the cluster to auto-fix by itself > if possible. Concerning comment #19, the PDB blocking the node drain is a known concern and the customer is aware. This is another conversation being handled with the customer and the CNF vendor. As for now, the node drain is being freed by deleting the blocking pod. Does SRIOV implement a drain timeout? I believe some work is being done on MCO end to handle this especially due to PDB pressure.
(In reply to Daniel Del Ciancio from comment #37) > (In reply to zenghui.shi from comment #36) > > (In reply to Daniel Del Ciancio from comment #35) > > > @zshi - Thanks for taking the time to help troubleshoot the issue > > > earlier today. > > > > > > A few follow-up questions/observations: > > > > > > 1) The SRIOV operator subscription was set to Manual some time ago due to a > > > past issue and was never reset to Automatic. We did notice the version of > > > the operator dating back from Jun 3rd, however, the MCO team did mention a > > > race condition issue between MCO and SRIOV which is supposedly corrected via > > > https://github.com/openshift/sriov-network-operator/pull/506. However, this > > > PR was only merged around Jun 23rd so this likely means they could have hit > > > the race condition since the pool was not paused prior to the SRIOV daemon > > > marking the node unschedulable. > > > > The PR #506 in sriov operator was to fix a race condition when both MC and > > sriov policy are applied simultaneously which leads to node never become > > Ready after rebooting. We can check the health of MCP (not degraded) when > > the sriov issue happens to see whether this is the same issue, if MCP is not > > ready, it might be. > > It seems after applying the SRIOV operator version upgrade (on Monday Nov > 1st) to the latest (since the sub was set to manual) along with the > corrected SRIOV node policies (applying the additional node selector and > node labeling), things seem to be more stable. > > This leads to another question. Is it safe to keep the SRIOV operator > auto-upgrade? I think provided no SRIOV node policies change, causing no > underlying VF changes to the nodes, then it should be safe, right? Yes to the auto-upgrade, the upgrade of sriov operator version is not supposed to trigger a reconfiguration of the VF devices unless explictily documented. > > > > > > > 2) The SRIOV operator version in production corresponds to the Oct 12th > > > version and we haven't hit the issue there. So this is encouraging us to > > > want to upgrade. Do you agree? > > > > Do we also had the overlapping device configuration issue in sriov pools or > > "pod cannot satisify the eviction condition" issue in the prod environment? > > > > I recommend to fix the pod eviction issue first in current env like I > > mentioned in comment #19 and then let the cluster to auto-fix by itself > > if possible. > > > Concerning comment #19, the PDB blocking the node drain is a known concern > and the customer is aware. This is another conversation being handled with > the customer and the CNF vendor. As for now, the node drain is being freed > by deleting the blocking pod. > > Does SRIOV implement a drain timeout? Nope in released versions. This is to avoid potential workload interruption for pods that cannot satisfy eviction policy. User interaction is required to unlock the node drain failure at this moment. > I believe some work is being done on > MCO end to handle this especially due to PDB pressure. Do you have a pointer to how MCO handles the PDB pressure?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days