1997650 – Sriov error when trying to request VF and nodes lost SRIOV Interface

Bug 1997650 - Sriov error when trying to request VF and nodes lost SRIOV Interface

Summary: Sriov error when trying to request VF and nodes lost SRIOV Interface

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Balazs Nemeth
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-25 15:28 UTC by Eswar Vadla
Modified:	2024-12-20 20:49 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-14 20:21:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
SRIOV logs (102.88 KB, text/plain) 2021-09-28 17:41 UTC, Eswar Vadla	no flags	Details
sriov error kubelet log for sep. 27th part 1 (13.11 MB, application/gzip) 2021-09-30 23:24 UTC, ptang	no flags	Details
sriov error kubelet log for sep. 27th part 2 (13.11 MB, application/gzip) 2021-09-30 23:25 UTC, ptang	no flags	Details
sriov error kubelet log for sep. 27th part 3 (4.39 MB, application/gzip) 2021-09-30 23:26 UTC, ptang	no flags	Details
View All

Description Eswar Vadla 2021-08-25 15:28:03 UTC

Description of problem:
SRIOV connectivity issue- sriov error when trying to request VF and nodes lost SRIOV Interface

As per the below log "Found pre-allocated devices for resource openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5]"  

It seems kernel failed to allocate the memory for the PCI device having PCI-ID as 0000:37:02.5. So as it failed to allocate the memory , so not possible to create the entry in the sys file system. But during the init stage, every application tries to get the vf details from sys file system and tries to bind it. 

Now as the device does not exist, so no such entry in sys file system as expected. 
 
---------------------------------------------------------------------------------------------------------------------------

Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314165    3980 topology_manager.go:233] [topologymanager] Topology Admit Handler
Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314182    3980 manager.go:843] needs 12 cpu
Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314189    3980 manager.go:843] needs 1073741824 hugepages-1Gi
Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314196    3980 manager.go:843] needs 25769803776 memory
Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314201    3980 manager.go:843] needs 1 openshift.io/leftnuma0x710
Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314229    3980 manager.go:647] Found pre-allocated devices for resource openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5]


Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868024395Z" level=error msg="Error adding network: [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: error adding container to network \"leftnuma0\": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: \"lstat /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory\""
Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868061477Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: error adding container to network \"leftnuma0\": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: \"lstat /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory\""
Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868086302Z" level=info msg="NetworkStart: stopping network for sandbox 8d79b48da30fab1f9f46f6c15a0fe679ff7fb401e10c5fa62a78c04b30dfab97" id=1d7ab073-21f6-4521-9326-9c56bee0fe3b name=/runtime.v1alpha2.RuntimeService/RunPodSandbox

Comment 1 Eswar Vadla 2021-08-25 15:34:09 UTC

As per attachment in case below are some of the pointers towards the issue:
NAME                                        DISPLAY                             VERSION              REPLACES                                                   PHASE
namespace-configuration-operator.v1.0.5     Namespace Configuration Operator    1.0.5                namespace-configuration-operator.v1.0.4                    Replacing
namespace-configuration-operator.v1.0.6     Namespace Configuration Operator    1.0.6                namespace-configuration-operator.v1.0.5                    Pending
sriov-network-operator.4.6.0-202106032244   SR-IOV Network Operator             4.6.0-202106032244   sriov-network-operator.4.6.0-202106010807.p0.git.78e7139   Succeeded


namespace-configuration-operator.v1.0.6     Namespace Configuration Operator    1.0.6                namespace-configuration-operator.v1.0.5                    Pending
phase: Pending
        installing: waiting for deployment elasticsearch-operator to become ready: Waiting for rollout to finish: 1 old replicas are pending termination...
      ResourceQuotas, NetworkPolicies, EgressNetworkPolicies, etc.... . Depending
      phase: Pending
      ResourceQuotas, NetworkPolicies, EgressNetworkPolicies, etc.... . Depending
      phase: Pending
        installing: waiting for deployment sriov-network-operator to become ready: Waiting for rollout to finish: 1 old replicas are pending termination...
      phase: Pending

message: 'install strategy failed: Deployment.apps "namespace-configuration-operator-controller-manager"
      phase: Failed
      reason: InstallComponentFailed
      message: 'install strategy failed: Deployment.apps "namespace-configuration-operator-controller-manager"
      phase: Failed

Message:               install strategy failed: Deployment.apps "namespace-configuration-operator-controller-manager" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"operator":"namespace-configuration-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
    Phase:                 Failed
    Reason:                InstallComponentFailed
    Phase:                 Failed
    Phase:                 Failed
--> Found huge amount of "TLS handshake errors"
2021/07/04 05:04:08 http: TLS handshake error from 192.168.145.55:35882: EOF
2021/07/05 03:39:05 http: TLS handshake error from 192.168.145.55:56894: EOF
2021/07/06 01:09:05 http: TLS handshake error from 192.168.145.55:48700: EOF

=======

Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314229    3980 manager.go:647] Found pre-allocated devices for resource openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5]


[11220.208454] iavf 0000:37:02.5 ens2f0v5: renamed from net1
[11221.629462] iavf 0000:37:02.5 temp_105: renamed from ens2f0v5
[11221.664059] iavf 0000:37:02.5 net1: renamed from temp_105
[11222.462375] iavf 0000:37:02.5 ens2f0v5: renamed from net1
[11223.481514] iavf 0000:37:02.5 temp_105: renamed from ens2f0v5
[11223.528717] iavf 0000:37:02.5 net1: renamed from temp_105
[11224.323564] iavf 0000:37:02.5 ens2f0v5: renamed from net1
[11225.534274] iavf 0000:37:02.5 temp_105: renamed from ens2f0v5
[11225.579891] iavf 0000:37:02.5 net1: renamed from temp_105
[11226.207440] iavf 0000:37:02.5 ens2f0v5: renamed from net1
[11227.491125] iavf 0000:37:02.5 temp_105: renamed from ens2f0v5
[11227.529681] iavf 0000:37:02.5 net1: renamed from temp_105

Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868061477Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: error adding container to network \"leftnuma0\": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: \"lstat /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory\""

Jul 23 15:04:20 mxq949065g hyperkube[3980]: E0723 15:04:20.230885    3980 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_eric-pc-up-data-plane-c7949486b-4jkx8_w6017-c1-sl01-upflab1_946025e3-be22-4d15-b0d1-f3143973beab_0(ac3c52ef7c96c1165e637fcde9b52d7e5253b70580426497dfe577a98358dfe1): [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]: error adding container to network "leftnuma0": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "lstat /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory"

Comment 2 Eswar Vadla 2021-08-25 15:35:50 UTC

Some SRIOV errors leading up to the node reboot Jul 23 ~ 15:04 (see sosreport-mxq949065g-02995395-2021-07-23-eyukwzj/sos_commands/logs/journalctl_--no-pager_--catalog_--boot_-1 for details):

Jul 22 04:24:13 mxq949065g hyperkube[4018]: W0722 04:24:13.857631    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused". Reconnecting...
Jul 22 04:24:13 mxq949065g hyperkube[4018]: I0722 04:24:13.857728    4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc003f916c0, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused"}
Jul 22 04:24:14 mxq949065g hyperkube[4018]: W0722 04:24:14.145719    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused". Reconnecting...
Jul 22 04:24:14 mxq949065g hyperkube[4018]: I0722 04:24:14.145867    4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc003468c80, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused"}
Jul 22 04:24:14 mxq949065g hyperkube[4018]: W0722 04:24:14.482223    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock: connect: connection refused". Reconnecting...
Jul 22 04:24:14 mxq949065g hyperkube[4018]: I0722 04:24:14.482275    4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc004f65490, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock: connect: connection refused"}
Jul 22 04:24:14 mxq949065g hyperkube[4018]: W0722 04:24:14.998329    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused". Reconnecting...
Jul 22 04:24:14 mxq949065g hyperkube[4018]: I0722 04:24:14.998397    4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0021ae090, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused"}
Jul 22 04:24:17 mxq949065g hyperkube[4018]: W0722 04:24:17.503222    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused". Reconnecting...
Jul 22 04:24:17 mxq949065g hyperkube[4018]: I0722 04:24:17.503316    4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc003f916c0, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused"}
Jul 22 04:24:17 mxq949065g hyperkube[4018]: W0722 04:24:17.787865    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused". Reconnecting...
Jul 22 04:24:17 mxq949065g hyperkube[4018]: I0722 04:24:17.787928    4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc003468c80, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused"}
Jul 22 04:24:18 mxq949065g hyperkube[4018]: W0722 04:24:18.394469    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused". Reconnecting...
Jul 22 04:24:18 mxq949065g hyperkube[4018]: I0722 04:24:18.394533    4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0021ae090, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused"}
Jul 22 04:24:19 mxq949065g hyperkube[4018]: E0722 04:24:19.175800    4018 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock" failed. No retries permitted until 2021-07-22 04:26:21.175779025 +0000 UTC m=+461169.579680555 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock, err: context deadline exceeded"
Jul 22 04:24:19 mxq949065g hyperkube[4018]: E0722 04:24:19.175874    4018 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock" failed. No retries permitted until 2021-07-22 04:26:21.175856469 +0000 UTC m=+461169.579758035 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock, err: context deadline exceeded"
Jul 22 04:24:19 mxq949065g hyperkube[4018]: E0722 04:24:19.175906    4018 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock" failed. No retries permitted until 2021-07-22 04:26:21.1758855 +0000 UTC m=+461169.579787013 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock, err: context deadline exceeded"
Jul 22 04:24:19 mxq949065g hyperkube[4018]: E0722 04:24:19.175918    4018 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock" failed. No retries permitted until 2021-07-22 04:26:21.175910779 +0000 UTC m=+461169.579812292 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock, err: context deadline exceeded"
Jul 22 04:24:23 mxq949065g hyperkube[4018]: I0722 04:24:23.381969    4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-8f5e0a96\\x2dece8\\x2d4277\\x2db76e\\x2d2c2791938e2a-volumes-kubernetes.io\\x7esecret-sriov\\x2dcni\\x2dtoken\\x2dj6whh.mount", but ignoring.
Jul 22 04:24:23 mxq949065g hyperkube[4018]: I0722 04:24:23.381993    4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-df977e9c\\x2d4911\\x2d405c\\x2d89fd\\x2d15bac7d6e118-volumes-kubernetes.io\\x7esecret-sriov\\x2ddevice\\x2dplugin\\x2dtoken\\x2dfdrx9.mount", but ignoring.
Jul 22 04:24:23 mxq949065g hyperkube[4018]: I0722 04:24:23.382745    4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-4fe50017\\x2d47b5\\x2d46e9\\x2d822c\\x2d2f7989352eb1-volumes-kubernetes.io\\x7esecret-sriov\\x2dnetwork\\x2dconfig\\x2ddaemon\\x2dtoken\\x2djf9s2.mount", but ignoring.
Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.354934    4018 kubelet_pods.go:1486] Generating status for "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)"
Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.355053    4018 status_manager.go:429] Ignoring same status for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:31 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:32 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:32 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:31 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:172.18.243.53 PodIPs:[{IP:172.18.243.53}] StartTime:2021-07-16 20:21:31 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-device-plugin State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:21:32 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-network-device-plugin@sha256:04d49ffe894ff992626d17f32688c6ab3e17b41ae5b7eebabf33a11cfffc01f1 ImageID:registry.redhat.io/openshift4/ose-sriov-network-device-plugin@sha256:04d49ffe894ff992626d17f32688c6ab3e17b41ae5b7eebabf33a11cfffc01f1 ContainerID:cri-o://0ac4492d2cbd2e3f2910a8714bef76752da37062b4b2f96190fac49815924eda Started:0xc004386039}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.355233    4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)"
Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.355275    4018 volume_manager.go:403] All volumes are attached and mounted for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)"
Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.355424    4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:49284991864db3710fd5a75e235afe1efe0152900d511945e562d4eea2d0c970 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)"
Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.385900    4018 secret.go:183] Setting up volume sriov-device-plugin-token-fdrx9 for pod df977e9c-4911-405c-89fd-15bac7d6e118 at /var/lib/kubelet/pods/df977e9c-4911-405c-89fd-15bac7d6e118/volumes/kubernetes.io~secret/sriov-device-plugin-token-fdrx9
Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.385953    4018 configmap.go:212] Received configMap openshift-sriov-network-operator/device-plugin-config containing (17) pieces of data, 4224 total bytes
Jul 22 04:24:28 mxq949065g hyperkube[4018]: I0722 04:24:28.385985    4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-device-plugin-token-fdrx9 containing (4) pieces of data, 16817 total bytes
Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355024    4018 kubelet_pods.go:1486] Generating status for "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)"
Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355107    4018 status_manager.go:429] Ignoring same status for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 21:31:35 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:26 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:26 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 21:31:35 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:172.18.243.53 PodIPs:[{IP:172.18.243.53}] StartTime:2021-06-15 21:31:35 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-network-config-daemon State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:20:25 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-network-config-daemon@sha256:3142b5cadfc8c99e8c5f2b67fc06bc85a5ba69c162fe75521b197d7eeea3d23b ImageID:registry.redhat.io/openshift4/ose-sriov-network-config-daemon@sha256:3142b5cadfc8c99e8c5f2b67fc06bc85a5ba69c162fe75521b197d7eeea3d23b ContainerID:cri-o://ef9b47bd5b1b798ea6526ece625b51a2dd8daaba37e2eb0793037f00b4f6b5a4 Started:0xc004103cbe}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355242    4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)"
Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355282    4018 volume_manager.go:403] All volumes are attached and mounted for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)"
Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.355392    4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:4759c4dd03b4753faaa1d1d84b0a9a48bb56b6df3f6e9bb91d107dd39d3019f1 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)"
Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.391622    4018 secret.go:183] Setting up volume sriov-network-config-daemon-token-jf9s2 for pod 4fe50017-47b5-46e9-822c-2f7989352eb1 at /var/lib/kubelet/pods/4fe50017-47b5-46e9-822c-2f7989352eb1/volumes/kubernetes.io~secret/sriov-network-config-daemon-token-jf9s2
Jul 22 04:24:30 mxq949065g hyperkube[4018]: I0722 04:24:30.391656    4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-network-config-daemon-token-jf9s2 containing (4) pieces of data, 16849 total bytes
Jul 22 04:25:23 mxq949065g hyperkube[4018]: I0722 04:25:23.389469    4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-4fe50017\\x2d47b5\\x2d46e9\\x2d822c\\x2d2f7989352eb1-volumes-kubernetes.io\\x7esecret-sriov\\x2dnetwork\\x2dconfig\\x2ddaemon\\x2dtoken\\x2djf9s2.mount", but ignoring.
Jul 22 04:25:23 mxq949065g hyperkube[4018]: I0722 04:25:23.389581    4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-df977e9c\\x2d4911\\x2d405c\\x2d89fd\\x2d15bac7d6e118-volumes-kubernetes.io\\x7esecret-sriov\\x2ddevice\\x2dplugin\\x2dtoken\\x2dfdrx9.mount", but ignoring.
Jul 22 04:25:23 mxq949065g hyperkube[4018]: I0722 04:25:23.390372    4018 factory.go:209] Factory "systemd" can handle container "/system.slice/var-lib-kubelet-pods-8f5e0a96\\x2dece8\\x2d4277\\x2db76e\\x2d2c2791938e2a-volumes-kubernetes.io\\x7esecret-sriov\\x2dcni\\x2dtoken\\x2dj6whh.mount", but ignoring.
Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.355168    4018 kubelet_pods.go:1486] Generating status for "sriov-cni-9z9v7_openshift-sriov-network-operator(8f5e0a96-ece8-4277-b76e-2c2791938e2a)"
Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.355558    4018 status_manager.go:429] Ignoring same status for pod "sriov-cni-9z9v7_openshift-sriov-network-operator(8f5e0a96-ece8-4277-b76e-2c2791938e2a)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 20:58:35 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:38 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:38 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 20:58:35 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:1.1.98.144 PodIPs:[{IP:1.1.98.144}] StartTime:2021-06-15 20:58:35 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-cni State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:20:37 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-cni@sha256:3233715e4309c176955fcf5fbc458d2599bb89691d74e7e69aea4312890144a1 ImageID:registry.redhat.io/openshift4/ose-sriov-cni@sha256:3233715e4309c176955fcf5fbc458d2599bb89691d74e7e69aea4312890144a1 ContainerID:cri-o://c0d441730f8a89104d0935aedfcbe0c1a2444cfc88088a021775dfa5a35f7b53 Started:0xc00190a736} {Name:sriov-infiniband-cni State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:20:38 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-infiniband-cni@sha256:89043b617d0d88f578dc134f716134507e54fe0e706f73c7cba5f9a8cd40a431 ImageID:registry.redhat.io/openshift4/ose-sriov-infiniband-cni@sha256:89043b617d0d88f578dc134f716134507e54fe0e706f73c7cba5f9a8cd40a431 ContainerID:cri-o://afe470fa487b638cec02a314756cbca79d7f392e8372d7d93bcad9c43b4c6d79 Started:0xc00190a737}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.355736    4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-cni-9z9v7_openshift-sriov-network-operator(8f5e0a96-ece8-4277-b76e-2c2791938e2a)"
Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.355908    4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:264241a135254296635ab62ba95952e3b43a652cd2c25ad0eb3f60858b9afc4d Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-cni-9z9v7_openshift-sriov-network-operator(8f5e0a96-ece8-4277-b76e-2c2791938e2a)"
Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.527238    4018 secret.go:183] Setting up volume sriov-cni-token-j6whh for pod 8f5e0a96-ece8-4277-b76e-2c2791938e2a at /var/lib/kubelet/pods/8f5e0a96-ece8-4277-b76e-2c2791938e2a/volumes/kubernetes.io~secret/sriov-cni-token-j6whh
Jul 22 04:25:33 mxq949065g hyperkube[4018]: I0722 04:25:33.527283    4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-cni-token-j6whh containing (4) pieces of data, 16777 total bytes
Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.355019    4018 kubelet_pods.go:1486] Generating status for "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)"
Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.355692    4018 status_manager.go:429] Ignoring same status for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:31 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:32 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:32 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:21:31 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:172.18.243.53 PodIPs:[{IP:172.18.243.53}] StartTime:2021-07-16 20:21:31 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-device-plugin State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:21:32 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-network-device-plugin@sha256:04d49ffe894ff992626d17f32688c6ab3e17b41ae5b7eebabf33a11cfffc01f1 ImageID:registry.redhat.io/openshift4/ose-sriov-network-device-plugin@sha256:04d49ffe894ff992626d17f32688c6ab3e17b41ae5b7eebabf33a11cfffc01f1 ContainerID:cri-o://0ac4492d2cbd2e3f2910a8714bef76752da37062b4b2f96190fac49815924eda Started:0xc00410f00a}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.355856    4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)"
Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.355923    4018 volume_manager.go:403] All volumes are attached and mounted for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)"
Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.356046    4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:49284991864db3710fd5a75e235afe1efe0152900d511945e562d4eea2d0c970 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-device-plugin-wtflv_openshift-sriov-network-operator(df977e9c-4911-405c-89fd-15bac7d6e118)"
Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.464253    4018 secret.go:183] Setting up volume sriov-device-plugin-token-fdrx9 for pod df977e9c-4911-405c-89fd-15bac7d6e118 at /var/lib/kubelet/pods/df977e9c-4911-405c-89fd-15bac7d6e118/volumes/kubernetes.io~secret/sriov-device-plugin-token-fdrx9
Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.464343    4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-device-plugin-token-fdrx9 containing (4) pieces of data, 16817 total bytes
Jul 22 04:25:43 mxq949065g hyperkube[4018]: I0722 04:25:43.464373    4018 configmap.go:212] Received configMap openshift-sriov-network-operator/device-plugin-config containing (17) pieces of data, 4224 total bytes
Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355222    4018 kubelet_pods.go:1486] Generating status for "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)"
Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355311    4018 status_manager.go:429] Ignoring same status for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 21:31:35 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:26 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-16 20:20:26 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-06-15 21:31:35 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.243.53 PodIP:172.18.243.53 PodIPs:[{IP:172.18.243.53}] StartTime:2021-06-15 21:31:35 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:sriov-network-config-daemon State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-07-16 20:20:25 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.redhat.io/openshift4/ose-sriov-network-config-daemon@sha256:3142b5cadfc8c99e8c5f2b67fc06bc85a5ba69c162fe75521b197d7eeea3d23b ImageID:registry.redhat.io/openshift4/ose-sriov-network-config-daemon@sha256:3142b5cadfc8c99e8c5f2b67fc06bc85a5ba69c162fe75521b197d7eeea3d23b ContainerID:cri-o://ef9b47bd5b1b798ea6526ece625b51a2dd8daaba37e2eb0793037f00b4f6b5a4 Started:0xc001f4cd8e}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355464    4018 volume_manager.go:372] Waiting for volumes to attach and mount for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)"
Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355502    4018 volume_manager.go:403] All volumes are attached and mounted for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)"
Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.355623    4018 kuberuntime_manager.go:664] computePodActions got {KillPod:false CreateSandbox:false SandboxID:4759c4dd03b4753faaa1d1d84b0a9a48bb56b6df3f6e9bb91d107dd39d3019f1 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "sriov-network-config-daemon-pkknz_openshift-sriov-network-operator(4fe50017-47b5-46e9-822c-2f7989352eb1)"
Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.492707    4018 secret.go:183] Setting up volume sriov-network-config-daemon-token-jf9s2 for pod 4fe50017-47b5-46e9-822c-2f7989352eb1 at /var/lib/kubelet/pods/4fe50017-47b5-46e9-822c-2f7989352eb1/volumes/kubernetes.io~secret/sriov-network-config-daemon-token-jf9s2
Jul 22 04:25:51 mxq949065g hyperkube[4018]: I0722 04:25:51.492744    4018 secret.go:207] Received secret openshift-sriov-network-operator/sriov-network-config-daemon-token-jf9s2 containing (4) pieces of data, 16849 total bytes
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208528    4018 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock 2021-07-16 20:20:23.377127926 +0000 UTC m=+11.781029437 <nil> })
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208553    4018 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock 2021-07-16 20:20:23.37781788 +0000 UTC m=+11.781719392 <nil> })
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208566    4018 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock 2021-07-16 20:20:23.377568698 +0000 UTC m=+11.781470210 <nil> })
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208579    4018 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock 2021-07-16 20:20:23.377807904 +0000 UTC m=+11.781709415 <nil> })
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208669    4018 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208707    4018 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 22 04:26:21 mxq949065g hyperkube[4018]: W0722 04:26:21.208752    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintel.sock: connect: connection refused". Reconnecting...
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208669    4018 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/openshift.io_sriovleftdpdkintelx710.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208629    4018 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 22 04:26:21 mxq949065g hyperkube[4018]: W0722 04:26:21.208833    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintel.sock: connect: connection refused". Reconnecting...
Jul 22 04:26:21 mxq949065g hyperkube[4018]: W0722 04:26:21.208772    4018 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused". Reconnecting...
Jul 22 04:26:21 mxq949065g hyperkube[4018]: I0722 04:26:21.208858    4018 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc00130c640, {TRANSIENT_FAILURE connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/openshift.io_sriovrightdpdkintelx710.sock: connect: connection refused"}

Comment 3 zenghui.shi 2021-08-30 01:22:51 UTC

> 
> It seems kernel failed to allocate the memory for the PCI device having
> PCI-ID as 0000:37:02.5. So as it failed to allocate the memory , so not
> possible to create the entry in the sys file system. But during the init
> stage, every application tries to get the vf details from sys file system
> and tries to bind it. 
> 

How did you know if failed to allocate the memory? Do we have the log?

> 
> Jul 23 15:04:18 mxq949065g crio[3920]: time="2021-07-23 15:04:18.868024395Z"
> level=error msg="Error adding network:
> [w6017-c1-sl01-upflab1/eric-pc-up-data-plane-c7949486b-4jkx8:leftnuma0]:
> error adding container to network \"leftnuma0\": SRIOV-CNI failed to load
> netconf: LoadConf(): failed to get VF information: \"lstat
> /sys/bus/pci/devices/0000:37:02.5/physfn/net: no such file or directory\""

The above failure was that sriov-cni tried to get the PF name of VF (0000:37:02.5),
but the sysfs (/sys/bus/pci/devices/0000:37:02.5/physfn/net) doesn't exist.

Did this happen the first time when creating a sriov pod or the sriov pod
has been deleted and recreated more than one time? If it ever worked, it
maybe that sriov VF was not released gracefully which results in failure
in next pod creation.

Also could you check whether the PF of VF (0000:37:02.5) exist in the host
namespace in the worker node? (by log in to the node and run "p link show ens2f0")

What is the deviceType in sriovNetworkNodePolicy?

Comment 4 Eswar Vadla 2021-08-30 20:23:09 UTC

==> How did you know if failed to allocate the memory? Do we have the log?
Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314229    3980 manager.go:647] Found pre-allocated devices for resource openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5]

==> Did this happen the first time when creating a sriov pod or the sriov pod has been deleted and recreated more than one time? If it ever worked, it maybe that sriov VF was not released gracefully which results in failure in next pod creation.
Issue happened multiple times

**
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  labels:
    argocd.argoproj.io/instance: sriov-policy
  name: leftnuma0x710
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci      <--------
  isRdma: false
  mtu: 9000
  nicSelector:
    deviceID: "1572"
    pfNames:
    - ens2f0         <----------
    vendor: "8086"
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 8
  resourceName: leftnuma0x710

Comment 5 Eswar Vadla 2021-09-24 16:21:51 UTC

Any Update?

Comment 6 zenghui.shi 2021-09-28 01:59:31 UTC

(In reply to Eswar Vadla from comment #4)
> ==> How did you know if failed to allocate the memory? Do we have the log?
> Jul 23 15:04:00 mxq949065g hyperkube[3980]: I0723 15:04:00.314229    3980
> manager.go:647] Found pre-allocated devices for resource
> openshift.io/leftnuma0x710 container "eric-pc-up-data-plane" in Pod
> "946025e3-be22-4d15-b0d1-f3143973beab": [0000:37:02.5]
> 

This message is about device allocation, I don't see obvious issue with it.
I think it is that kubelet tried to re-use the pre-allocated device to recover
the sriov pod after pod restarting or node reboot.

> ==> Did this happen the first time when creating a sriov pod or the sriov
> pod has been deleted and recreated more than one time? If it ever worked, it
> maybe that sriov VF was not released gracefully which results in failure in
> next pod creation.
> Issue happened multiple times
> 
> **
> apiVersion: sriovnetwork.openshift.io/v1
> kind: SriovNetworkNodePolicy
> metadata:
>   labels:
>     argocd.argoproj.io/instance: sriov-policy
>   name: leftnuma0x710
>   namespace: openshift-sriov-network-operator
> spec:
>   deviceType: vfio-pci      <--------
>   isRdma: false
>   mtu: 9000
>   nicSelector:
>     deviceID: "1572"
>     pfNames:
>     - ens2f0         <----------

Could you get the following info on the target worker node?

Check if ens2f0 exist
# ip link show ens2f0
# ethtool -i ens2f0

Check the network interfaces
# ls /sys/class/net

Stop the pod creation and check the VF info
# ls /sys/bus/pci/devices/0000:37:02.5/physfn/net/


I guess ens2f0 is the PF of VF (0000:37:02.5), the failure in sriov
pod creation indicates that this PF doesn't exist on the target worker
node whereas it should. For example:

# ls /sys/bus/pci/devices/0000\:3b\:0a.2/physfn/net/
ens1f1

I used the output from local environment where ens1f1 is the PF
and 0000\:3b\:0a.2 is the VF PCI address.

Could you also get the sriov device plugin log on the target worker?

Find sriov device plugin pod name for the target worker: 
# oc get pods -n openshift-sriov-network-operator -o wide

Get log:
# oc logs -f <sriov-network-device-plugin-name> -n openshift-sriov-network-operator

Comment 7 Eswar Vadla 2021-09-28 17:40:27 UTC

Hi Zenghui,

Below are requested information:
 
The issue is intermittent, so that has nothing do with non-existing devices. I can provide the information you're asking but the information I can get might not reflect the problem situation because we had temporarily fixed the issue so the customer doesn't have to face a broken state before we have solution.
Please see my comments below.

Check if ens2f0 exist
ip link show ens2f0
8: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 48:df:37:b2:78:d0 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 26:9f:ec:1f:3f:72 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 1     link/ether 72:37:1f:c6:b7:cf brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 2     link/ether 22:8b:85:ce:5a:a5 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 3     link/ether 52:85:1f:62:e0:fd brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 4     link/ether 62:a0:aa:5e:a2:7f brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 5     link/ether 6e:fc:11:28:b7:f1 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 6     link/ether 26:64:ea:5d:ba:b3 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 7     link/ether 72:9d:3b:05:93:d2 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
ethtool -i ens2f0
driver: i40e
version: 2.8.20-k
firmware-version: 10.5.5
expansion-rom-version:
bus-info: 0000:37:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
Check the network interfaces


ls /sys/class/net
bond0       bond1.728  bond1.733  bond1.738  bond1.751  bond1.784  bond1.789  bond1.850    bond1.855     cali6c8e1dd90c6  egress.calico  eno5     ens2f1    ens3f0v3  ens3f1    ens3f1v4  gre0
bond0.220  bond1.729  bond1.734  bond1.739  bond1.753  bond1.785  bond1.790  bond1.851    bonding_masters  cali77c44e2f94a  eno1         eno6     ens3f0    ens3f0v4  ens3f1v0  ens3f1v5  gretap0
bond1       bond1.730  bond1.735  bond1.740  bond1.754  bond1.786  bond1.791  bond1.852    cali2fd27eaea1f  cali81281ea782c  eno2         ens1f0  ens3f0v0  ens3f0v5  ens3f1v1  ens3f1v6  lo
bond1.726  bond1.731  bond1.736  bond1.741  bond1.782  bond1.787  bond1.792  bond1.853    cali43cdd2d7f48  cali993aa2b9bd3  eno3         ens1f1  ens3f0v1  ens3f0v6  ens3f1v2  ens3f1v7
bond1.727  bond1.732  bond1.737  bond1.750  bond1.783  bond1.788  bond1.849  bond1.854    cali61f51780f95  calid9e1f13482f  eno4         ens2f0  ens3f0v2  ens3f0v7  ens3f1v3  erspan0
Stop the pod creation and check the VF info


ls /sys/bus/pci/devices/0000:37:02.5/physfn/net/
ens2f0
I guess ens2f0 is the PF of VF (0000:37:02.5), the failure in sriov
pod creation indicates that this PF doesn't exist on the target worker
node whereas it should. For example:


ls /sys/bus/pci/devices/0000:3b:0a.2/physfn/net/
ens1f1


I used the output from local environment where ens1f1 is the PF
and 0000:3b:0a.2 is the VF PCI address.


Could you also get the sriov device plugin log on the target worker?


Find sriov device plugin pod name for the target worker: 


oc get pods -n openshift-sriov-network-operator -o wide
Get log:


oc logs -f <sriov-network-device-plugin-name> -n openshift-sriov-network-operator
Please see attached log file: sriov-device-plugin-mxq949065g.log

Comment 8 Eswar Vadla 2021-09-28 17:41:03 UTC

Created attachment 1827133 [details]
SRIOV logs

Comment 9 zenghui.shi 2021-09-29 02:00:22 UTC

According to logs from the sriov network device plugin pod, you have same devices added to two separate pools, for example, VF "0000:37:02.0" is in leftnuma0 pool, it is also in leftnuma0x710 pool. Why do we need such configuration?

I0726 20:44:46.858980      25 manager.go:116] Creating new ResourcePool: leftnuma0
I0726 20:44:46.858982      25 manager.go:117] DeviceType: netDevice
I0726 20:44:46.866533      25 factory.go:106] device added: [pciAddr: 0000:37:02.0, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.866542      25 factory.go:106] device added: [pciAddr: 0000:37:02.1, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.866545      25 factory.go:106] device added: [pciAddr: 0000:37:02.2, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.866547      25 factory.go:106] device added: [pciAddr: 0000:37:02.3, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.866549      25 factory.go:106] device added: [pciAddr: 0000:37:02.4, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.866552      25 factory.go:106] device added: [pciAddr: 0000:37:02.5, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.866555      25 factory.go:106] device added: [pciAddr: 0000:37:02.6, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.866557      25 factory.go:106] device added: [pciAddr: 0000:37:02.7, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.866569      25 manager.go:145] New resource server is created for leftnuma0 ResourcePool
I0726 20:44:46.866573      25 manager.go:115]
I0726 20:44:46.866575      25 manager.go:116] Creating new ResourcePool: leftnuma0x710
I0726 20:44:46.866578      25 manager.go:117] DeviceType: netDevice
I0726 20:44:46.873739      25 factory.go:106] device added: [pciAddr: 0000:37:02.0, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.873749      25 factory.go:106] device added: [pciAddr: 0000:37:02.1, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.873751      25 factory.go:106] device added: [pciAddr: 0000:37:02.2, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.873754      25 factory.go:106] device added: [pciAddr: 0000:37:02.3, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.873756      25 factory.go:106] device added: [pciAddr: 0000:37:02.4, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.873759      25 factory.go:106] device added: [pciAddr: 0000:37:02.5, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.873761      25 factory.go:106] device added: [pciAddr: 0000:37:02.6, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.873763      25 factory.go:106] device added: [pciAddr: 0000:37:02.7, vendor: 8086, device: 154c, driver: vfio-pci]
I0726 20:44:46.873770      25 manager.go:145] New resource server is created for leftnuma0x710 ResourcePool

The above may not directly be related to the error, but it should be fixed.

Regarding the "0000:37:02.5/physfn/net: no such file or directory" error, did it happen after node reboot? 
If so, what happened could be that kubelet tried to recover the customer sriov pod with pre-allocated devices (recorded in the kubelet check point file before rebooting), but at the time of recovery, the sriov device haven't be created yet (sriov devices are created by sriov-config-daemon pod, it is not guaranteed that sriov-config-daemon pod would be up and running first before customer sriov pod). 

Could you capture the logs for both sriov-config-daemon pod and the CNI error message when the issue happens again? I think we can compare the timestamp to see if that's the case.

Comment 10 ptang 2021-09-30 23:23:14 UTC

@zenghui.shi

To make it clear, we hit SRIOV issue recently was slightly different to the originally reported.
1) the originally reported one. 
   it happened in all nodes with sriov interfaces (in different time).
   symptom: 
   - when listed with lspci, some of vfs were not activated: 0000:37:02.5 - 0000:37:02.7;
   - when describe node, we saw capacity of sriov as 0:
     ~~~~
     Capacity:
       openshift.io/leftnuma0:                0
       openshift.io/leftnuma0x710:            0
       openshift.io/leftnuma1:                0
       openshift.io/leftnuma1x710:            0
       openshift.io/rightnuma0:               0
       openshift.io/rightnuma0x710:           0
       openshift.io/rightnuma1:               0
       openshift.io/rightnuma1x710:           0
       openshift.io/sriovleftdpdkintel:       0
       openshift.io/sriovleftdpdkintelx710:   0
       openshift.io/sriovrightdpdkintel:      0
       openshift.io/sriovrightdpdkintelx710:  0  
     ~~~~

2) the recently hit one on Sep. 27th.
   it happened on one node with 25GB SRIOV interface (Intel XXV710).
   symptom:
   - (no issue) when listed with lspci, all vfs were activated (0 - 7);
   - (no issue) when describe node, we saw capacity of sriov as 8;
   - (no issue) sys vf number was 8
      cat /sys/bus/pci/devices/0000:37:00.0/sriov_numvfs
      cat /sys/bus/pci/devices/0000:37:00.1/sriov_numvfs
   - (no issue) both drivers are loaded properly: iavf and vfio_pci
   - (issue) /dev/vfio was empty, so pod failed to start with error like following
     ~~~~
     Sep 27 09:30:53 mxq1010l8g hyperkube[4529]: E0927 09:30:53.653468    4529 pod_workers.go:191] Error syncing pod 7eca0638-5b0d-4b6a-88c3-e72d6e5e403e ("eric-pc-up-data-plane-cc5c88746-x9k7r_w6017-c1-sl01-upflab3(7eca0638-5b0d-4b6a-88c3-e72d6e5e403e)"), skipping: failed to "StartContainer" for "eric-pc-up-data-plane" with CreateContainerError: "lstat /dev/vfio/114: no such file or directory"
     ~~~~
(please see attached logs: mxq1010l8g_sriov_err_20210927_0800_to_1800.logaa.tar.gz, mxq1010l8g_sriov_err_20210927_0800_to_1800.logab.tar.gz, mxq1010l8g_sriov_err_20210927_0800_to_1800.logac.tar.gz)

Comment 11 ptang 2021-09-30 23:24:55 UTC

Created attachment 1827824 [details]
sriov error kubelet log for sep. 27th part 1

sriov error kubelet log for sep. 27th part 1

Comment 12 ptang 2021-09-30 23:25:44 UTC

Created attachment 1827825 [details]
sriov error kubelet log for sep. 27th part 2

sriov error kubelet log for sep. 27th part 2

Comment 13 ptang 2021-09-30 23:26:22 UTC

Created attachment 1827826 [details]
sriov error kubelet log for sep. 27th part 3

sriov error kubelet log for sep. 27th part 3

Comment 14 ptang 2021-09-30 23:28:57 UTC

@zenghui.shi

please find following version information:
OCP: 4.6.13
CoreOS: 46.82.202101191342-0
SRIOV operator: sriov-network-operator.4.6.0-202106032244

Comment 15 ptang 2021-09-30 23:32:51 UTC

@zenghui.shi

the node recently got sriov issue was not rebooted recently, and the last reboot was on July 26th, 2021:
~~~~
sh-4.4# date
Thu Sep 30 23:31:49 UTC 2021
sh-4.4# last
core     pts/0        192.168.137.252  Mon Sep 27 17:53 - 18:11  (00:17)
core     pts/0        192.168.137.252  Mon Sep 27 17:51 - 17:53  (00:01)
core     pts/0        192.168.137.252  Tue Sep  7 18:47 - 18:47  (00:00)
reboot   system boot  4.18.0-193.40.1. Mon Jul 26 20:59   still running
reboot   system boot  4.18.0-193.40.1. Mon Jul 26 20:51 - 20:56  (00:05)
reboot   system boot  4.18.0-193.40.1. Mon Jul 26 20:41 - 20:48  (00:07)
reboot   system boot  4.18.0-193.40.1. Mon Jul 26 20:32 - 20:37  (00:05)
~~~~

Comment 17 zenghui.shi 2021-10-09 08:18:08 UTC

(In reply to ptang from comment #10)
> @zenghui.shi
> 
> To make it clear, we hit SRIOV issue recently was slightly different to the
> originally reported.
> 1) the originally reported one. 
>    it happened in all nodes with sriov interfaces (in different time).
>    symptom: 
>    - when listed with lspci, some of vfs were not activated: 0000:37:02.5 -
> 0000:37:02.7;
>    - when describe node, we saw capacity of sriov as 0:
>      ~~~~
>      Capacity:
>        openshift.io/leftnuma0:                0
>        openshift.io/leftnuma0x710:            0
>        openshift.io/leftnuma1:                0
>        openshift.io/leftnuma1x710:            0
>        openshift.io/rightnuma0:               0
>        openshift.io/rightnuma0x710:           0
>        openshift.io/rightnuma1:               0
>        openshift.io/rightnuma1x710:           0
>        openshift.io/sriovleftdpdkintel:       0
>        openshift.io/sriovleftdpdkintelx710:   0
>        openshift.io/sriovrightdpdkintel:      0
>        openshift.io/sriovrightdpdkintelx710:  0  
>      ~~~~
> 
This symptom looks differently than what's originally described.
When the vfs are observed in the target node, but not reported in node status,
it could be that sriov device plugin didn't discover those devices or not yet
reported back to kubelet. We can check the sriov device plugin log while the
issue happens (getting log before or after the issue may not tell exactly what
we want to see).

> 2) the recently hit one on Sep. 27th.
>    it happened on one node with 25GB SRIOV interface (Intel XXV710).
>    symptom:
>    - (no issue) when listed with lspci, all vfs were activated (0 - 7);
>    - (no issue) when describe node, we saw capacity of sriov as 8;
>    - (no issue) sys vf number was 8
>       cat /sys/bus/pci/devices/0000:37:00.0/sriov_numvfs
>       cat /sys/bus/pci/devices/0000:37:00.1/sriov_numvfs
>    - (no issue) both drivers are loaded properly: iavf and vfio_pci
>    - (issue) /dev/vfio was empty, so pod failed to start with error like
> following
>      ~~~~
>      Sep 27 09:30:53 mxq1010l8g hyperkube[4529]: E0927 09:30:53.653468   
> 4529 pod_workers.go:191] Error syncing pod
> 7eca0638-5b0d-4b6a-88c3-e72d6e5e403e
> ("eric-pc-up-data-plane-cc5c88746-x9k7r_w6017-c1-sl01-upflab3(7eca0638-5b0d-
> 4b6a-88c3-e72d6e5e403e)"), skipping: failed to "StartContainer" for
> "eric-pc-up-data-plane" with CreateContainerError: "lstat /dev/vfio/114: no
> such file or directory"
>      ~~~~

This error indicates that vfio-pci device spec is not mounted to sriov container.
May I know how does the application (inside container) know which device (e.g.
/dev/vfio/114) to read from?

Same here, when the issue happens, check the sriov device plugin log to see if it
has successfully allocated the device (/dev/vfio/114), it should contain a message
indicating that `Allocate` grpc call has been issued by kubelet and the exact device
(dev/vfio/114) has been chosen for the particular container. If you don't see such
message, we probably has other issues in the net-attach-def and how pod requesting
the devices (need to check the net-attach-def and pod manifests).

Comment 18 ptang 2021-10-25 22:08:07 UTC

Hi @zenghui.shi,

The customer just hit SRIOV issue, so the node reported 0 sriov again.
There was no cluster upgrade, no node reboot, no operator restart.
And I have resolved the sriov pool conflict issue based on the comment #9.

Since those files are too big for BugZilla, I have uploaded in support case: https://access.redhat.com/support/cases/#/case/02995395.
Please look for these three files in the case: must-gather-sriov-mxq1010l8g-20211025.tar.gz, sosreport-mxq1010l8g-02995395-2021-10-25-tsmzqry.tar.xz, sriov-operator-data-20211025.txt.tar.gz

Here is the node description and interface list to give you the idea:

ptang@iaasjump1:~$ oc describe node mxq1010l8g
Name:               mxq1010l8g
Roles:              worker,worker-hp
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/network-sriov.capable=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=mxq1010l8g
                    kubernetes.io/os=linux
                    multus-cnf=true
                    network-sriov.nic-type=XXV710
                    node-role.kubernetes.io/worker=
                    node-role.kubernetes.io/worker-hp=
                    node.openshift.io/os_id=rhcos
                    run_prisma=true
                    tenants-workload=true
                    type=high-throughput
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.trident.netapp.io":"mxq1010l8g"}
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-hp-b3ec8e1fd600306f2c10530e4c3ff696
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-hp-b3ec8e1fd600306f2c10530e4c3ff696
                    machineconfiguration.openshift.io/reason:
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Done
                    projectcalico.org/IPv4Address: 192.168.145.69/23
                    sriovnetwork.openshift.io/state: Idle
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sat, 13 Mar 2021 20:19:10 +0000
Taints:             sriov_upf5=true:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  mxq1010l8g
  AcquireTime:     <unset>
  RenewTime:       Mon, 25 Oct 2021 21:47:50 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 26 Jul 2021 20:59:58 +0000   Mon, 26 Jul 2021 20:59:58 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Mon, 25 Oct 2021 21:44:57 +0000   Thu, 16 Sep 2021 20:09:46 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 25 Oct 2021 21:44:57 +0000   Thu, 16 Sep 2021 20:09:46 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 25 Oct 2021 21:44:57 +0000   Thu, 16 Sep 2021 20:09:46 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 25 Oct 2021 21:44:57 +0000   Thu, 16 Sep 2021 20:09:46 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.18.243.69
  Hostname:    mxq1010l8g
Capacity:
  cpu:                                   96
  ephemeral-storage:                     937140548Ki
  hugepages-1Gi:                         40Gi
  memory:                                395962020Ki
  openshift.io/leftnuma0:                0
  openshift.io/leftnuma0x710:            0
  openshift.io/leftnuma1:                0
  openshift.io/leftnuma1x710:            0
  openshift.io/rightnuma0:               0
  openshift.io/rightnuma0x710:           0
  openshift.io/rightnuma1:               0
  openshift.io/rightnuma1x710:           0
  openshift.io/sriovleftdpdkintel:       0
  openshift.io/sriovleftdpdkintelx710:   0
  openshift.io/sriovrightdpdkintel:      0
  openshift.io/sriovrightdpdkintelx710:  0
  pods:                                  250
Allocatable:
  cpu:                                   95500m
  ephemeral-storage:                     862594985783
  hugepages-1Gi:                         40Gi
  memory:                                352868004Ki
  openshift.io/leftnuma0:                0
  openshift.io/leftnuma0x710:            0
  openshift.io/leftnuma1:                0
  openshift.io/leftnuma1x710:            0
  openshift.io/rightnuma0:               0
  openshift.io/rightnuma0x710:           0
  openshift.io/rightnuma1:               0
  openshift.io/rightnuma1x710:           0
  openshift.io/sriovleftdpdkintel:       0
  openshift.io/sriovleftdpdkintelx710:   0
  openshift.io/sriovrightdpdkintel:      0
  openshift.io/sriovrightdpdkintelx710:  0
  pods:                                  250
System Info:
  Machine ID:                             8a7fc9582fe744ad835950587fc7a791
  System UUID:                            37393150-3636-584d-5131-3031304c3847
  Boot ID:                                571a72fc-391d-4113-81ad-45eeb453823b
  Kernel Version:                         4.18.0-193.40.1.el8_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 46.82.202101191342-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.19.1-4.rhaos4.6.git3846aab.el8
  Kubelet Version:                        v1.19.0+3b01205
  Kube-Proxy Version:                     v1.19.0+3b01205
Non-terminated Pods:                      (17 in total)
  Namespace                               Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                 ------------  ----------  ---------------  -------------  ---
  calico-system                           calico-node-mcfj6                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         103d
  cluster-network-addons                  nmstate-handler-rzcsk                100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         32d
  openshift-cluster-node-tuning-operator  tuned-5f6mw                          10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         226d
  openshift-dns                           dns-default-sm2pf                    65m (0%)      0 (0%)      110Mi (0%)       512Mi (0%)     226d
  openshift-image-registry                node-ca-2v25v                        10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         226d
  openshift-kube-proxy                    openshift-kube-proxy-bqkx2           100m (0%)     0 (0%)      200Mi (0%)       0 (0%)         226d
  openshift-logging                       fluentd-69tfn                        100m (0%)     0 (0%)      1536Mi (0%)      1536Mi (0%)    32d
  openshift-machine-config-operator       machine-config-daemon-45klv          40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         226d
  openshift-monitoring                    node-exporter-w5g6p                  9m (0%)       0 (0%)      210Mi (0%)       0 (0%)         226d
  openshift-multus                        multus-rxhb6                         10m (0%)      0 (0%)      150Mi (0%)       0 (0%)         226d
  openshift-multus                        network-metrics-daemon-cb7kr         20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         226d
  openshift-sriov-network-operator        sriov-cni-495lk                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         6d3h
  openshift-sriov-network-operator        sriov-device-plugin-7bjqh            0 (0%)        0 (0%)      0 (0%)           0 (0%)         172m
  openshift-sriov-network-operator        sriov-network-config-daemon-shckq    0 (0%)        0 (0%)      0 (0%)           0 (0%)         28d
  trident                                 trident-csi-lcrml                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
  twistlock                               twistlock-defender-ds-qhnvs          256m (0%)     0 (0%)      512Mi (0%)       512Mi (0%)     118d
  w6017-c1-sl01-upflab3                   mxq1010l8g-debug                     250m (0%)     1 (1%)      512Mi (0%)       2Gi (0%)       12s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                              Requests     Limits
  --------                              --------     ------
  cpu                                   970m (1%)    1 (1%)
  memory                                3610Mi (1%)  4608Mi (1%)
  ephemeral-storage                     0 (0%)       0 (0%)
  hugepages-1Gi                         0 (0%)       0 (0%)
  openshift.io/leftnuma0                0            0
  openshift.io/leftnuma0x710            0            0
  openshift.io/leftnuma1                0            0
  openshift.io/leftnuma1x710            0            0
  openshift.io/rightnuma0               0            0
  openshift.io/rightnuma0x710           0            0
  openshift.io/rightnuma1               0            0
  openshift.io/rightnuma1x710           0            0
  openshift.io/sriovleftdpdkintel       0            0
  openshift.io/sriovleftdpdkintelx710   0            0
  openshift.io/sriovrightdpdkintel      0            0
  openshift.io/sriovrightdpdkintelx710  0            0
Events:


Interface list:
sh-4.4# lspci | grep Ether
12:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
12:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
12:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
12:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
37:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
37:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
37:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:02.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:02.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:02.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:02.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:02.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:02.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:02.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:0a.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:0a.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:0a.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:0a.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:0a.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:0a.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:0a.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
37:0a.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
5d:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
5d:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
5d:00.2 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
5d:00.3 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
d8:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
d8:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
d8:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:02.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:02.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:02.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:02.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:02.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:02.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:02.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:0a.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:0a.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:0a.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:0a.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:0a.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:0a.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:0a.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
d8:0a.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)

Comment 19 zenghui.shi 2021-10-26 02:35:13 UTC

It seems the issue is VFs didn't bind to vfio-pci driver even the sriov policy is configured to using "deviceType:vfio-pci".
sriov-device-plugin in turn filtered out the iavf VFs based on the configmap selector "driver:vfio-pci".


For exampe, the sriovnetworknodestate mxq1010l8g shows that VFs are configured to vfio-pci driver, but actually bind to iavf driver:

##
[...]
  Interfaces:
    Mtu:          9000
    Name:         ens2f0
    Num Vfs:      8
    Pci Address:  0000:37:00.0
    Vf Groups:
      Device Type:    vfio-pci            <----- configured to use vfio-pci driver
      Policy Name:    leftnuma0
      Resource Name:  leftnuma0
      Vf Range:       0-7
[...]
    - Vfs:  
      - deviceID: 154c
        driver: iavf                      <----- use iavf driver actually
        mac: 66:f9:4d:f3:41:ef
        mtu: 1500
        name: ens2f0v0
        pciAddress: "0000:37:02.0"
        vendor: "8086"
        vfID: 0
      - deviceID: 154c
        driver: iavf
        mac: aa:d9:07:9f:af:36
        mtu: 1500
        name: ens2f0v1
        pciAddress: "0000:37:02.1"
        vendor: "8086"
        vfID: 1
[...]
  Sync Status:      Succeeded            <----- status shows that the config-daemon is not processing
Events:             <none>

##

pod/sriov-device-plugin-7bjqh on node mxq1010l8g use vfio-pci in the driver selector:

##
[...]
{"resourceList":[{"resourceName":"leftnuma0","selectors":{"vendors":["8086"],"devices":["154c"],"drivers":["vfio-pci"],"pfNames":["ens2f0"],"IsRdma":false},"SelectorObj":null},{"resourceName":"leftnuma1","selectors":{"vendors":["8086"],"devices":["154c"],"drivers":["vfio-pci"],"pfNames":["ens3f0"],"IsRdma":false},"SelectorObj":null},{"resourceName":"rightnuma0","selectors":{"vendors":["8086"],"devices":["154c"],"drivers":["vfio-pci"],"pfNames":["ens2f1"],"IsRdma":false},"SelectorObj":null},{"resourceName":"rightnuma1","selectors":{"vendors":["8086"],"devices":["154c"],"drivers":["vfio-pci"],"pfNames":["ens3f1"],"IsRdma":false},"SelectorObj":null}]}
[...]
##

Checking pod/sriov-network-config-daemon-shckq log on node mxq1010l8g:

##
[...]
I1025 18:53:57.933510  890379 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: timed out waiting for the condition
I1025 18:54:01.149359  890379 daemon.go:123] evicting pod w6017-c1-sl01-upflab3/eric-pc-up-data-plane-cc7f58fb8-g8695
E1025 18:54:01.154516  890379 daemon.go:123] error when evicting pod "eric-pc-up-data-plane-cc7f58fb8-g8695" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1025 18:54:06.154681  890379 daemon.go:123] evicting pod w6017-c1-sl01-upflab3/eric-pc-up-data-plane-cc7f58fb8-g8695
I1025 18:54:06.154729  890379 daemon.go:786] Draining failed with: error when evicting pod "eric-pc-up-data-plane-cc7f58fb8-g8695": global timeout reached: 1m30s, retrying
E1025 18:54:06.154744  890379 daemon.go:790] drainNode(): failed to drain node (5 tries): timed out waiting for the condition :error when evicting pod "eric-pc-up-data-plane-cc7f58fb8-g8695": global timeout reached: 1m30s
E1025 18:54:06.154751  890379 daemon.go:792] drainNode(): failed to drain node: timed out waiting for the condition
[...]
##

It shows that drainNode timed out due to pod w6017-c1-sl01-upflab3/eric-pc-up-data-plane-cc7f58fb8-g8695 cannot be evicted, which blocks sriov-config-daemon from proceeding.

Could you try to fix the pod disruption budget and re-apply the sriov policy?

Comment 20 ptang 2021-10-26 14:43:29 UTC

Hi @zenghui.shi,

Why the node drain will impact the sriov? There might be customer activities to try to drain the node.
I have restarted sriov daemon pod for the node, and temporarily fixed the issue.

Comment 21 zenghui.shi 2021-10-26 15:09:01 UTC

(In reply to ptang from comment #20)
> Hi @zenghui.shi,
> 
> Why the node drain will impact the sriov? There might be customer activities
> to try to drain the node.
> I have restarted sriov daemon pod for the node, and temporarily fixed the
> issue.

This node drain is triggered by sriov-config-daemon, not customer activity.
The reason why sriov-config-daemon drains the node is that it evicts the pods to other nodes to prevent the workload from being interrupted (while sriov configuration is applied).

Comment 22 ptang 2021-10-26 17:06:31 UTC

Hi @zenghui.shi,
Before I discuss with the customer for the change, I'd like to understand and confirm:
1) What mechanism will trigger the sriov configuration to be re-applied? and when?
Is the operator does that periodically? 
I know there was no sriov configuration change on customer side.

2) Does sriov configuration application always drain the node?
The eviction will interrupt the workload service. So on the opposite of "prevent the workload from being interrupted".
Please confirm that's the behavior we should expect while using sriov operator.

Comment 23 zenghui.shi 2021-10-27 00:17:15 UTC

(In reply to ptang from comment #22)
> Hi @zenghui.shi,
> Before I discuss with the customer for the change, I'd like to understand
> and confirm:
> 1) What mechanism will trigger the sriov configuration to be re-applied? and
> when?

Changes to existing sriov policy or creation of new policy will trigger re-apply immediately.

> Is the operator does that periodically? 

It reconciles every 5mins, but won't apply if no changes to the policy.

> I know there was no sriov configuration change on customer side.
> 
> 2) Does sriov configuration application always drain the node?

Not always, depends on whether the changes would have the potential to cause workload interruption.
For example the first time vfio-pci device is configured to a certian node, it will trigger node drain, because vfio-pci configuration in kernel argument takes effect after rebooting and reboot implies node drain.
Another example is the poicy changes require updating the NIC firmware configuration (for some vendor NICs), which would trigger node drain and reboot.

> The eviction will interrupt the workload service. So on the opposite of
> "prevent the workload from being interrupted".

It is expected that workload implements application-level HA and node is not a single point of failure in the cluster, like a deployment with replica > 1 would distribute several workload pods to different nodes (with proper taints, affinity settings). One node draining or rebooting won't affect the workload on the other nodes or in the cluster.

We have the similar node drain mechism in the machine config operator, if the application of configuration is disruptive, it would drain and reboot the node. In such case, application HA is required to keep the workload running.

Comment 24 zenghui.shi 2021-10-27 00:19:21 UTC

> 
> We have the similar node drain mechism in the machine config operator, if
> the application of configuration is disruptive, it would drain and reboot
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ _ the appliance of configuration

> the node. In such case, application HA is required to keep the workload
> running.

Comment 25 ptang 2021-10-27 12:51:00 UTC

Hi @zenghui.shi,

Need further to understand the auto re-apply mechanism and node drain. 
1) To your comment about re-apply "Changes to existing sriov policy or creation of new policy will trigger re-apply immediately", 
in our customer case, there was no sriov policy change (no update, no creation), why was the re-apply triggered?

2) To your comment about node drain, 
"Not always, depends on whether the changes would have the potential to cause workload interruption.
For example the first time vfio-pci device is configured to a certian node, it will trigger node drain, because vfio-pci configuration in kernel argument takes effect after rebooting and reboot implies node drain.
Another example is the poicy changes require updating the NIC firmware configuration (for some vendor NICs), which would trigger node drain and reboot.
",
there was no workload change, nor new deployment. There was no OCP or device or firmware update as well. Why would sriov operator decide to drain the node?

Comment 26 zenghui.shi 2021-10-28 03:22:12 UTC

(In reply to ptang from comment #25)
> Hi @zenghui.shi,
> 
> Need further to understand the auto re-apply mechanism and node drain. 
> 1) To your comment about re-apply "Changes to existing sriov policy or
> creation of new policy will trigger re-apply immediately", 
> in our customer case, there was no sriov policy change (no update, no
> creation), why was the re-apply triggered?

Node reboot may also trigger this.
sriov configuration is re-applied every time after rebooting.

It could be manual rebooting or rebooting triggered by other (non-sriov) components.

> 
> 2) To your comment about node drain, 
> "Not always, depends on whether the changes would have the potential to
> cause workload interruption.
> For example the first time vfio-pci device is configured to a certian node,
> it will trigger node drain, because vfio-pci configuration in kernel
> argument takes effect after rebooting and reboot implies node drain.
> Another example is the poicy changes require updating the NIC firmware
> configuration (for some vendor NICs), which would trigger node drain and
> reboot.
> ",
> there was no workload change, nor new deployment. There was no OCP or device
> or firmware update as well. Why would sriov operator decide to drain the
> node?

Same as the first question.

Comment 27 ptang 2021-10-28 14:33:31 UTC

Hi @zenghui.shi,
To comment #26, those assumption are not true.
In our customer case, there was:
- no node reboot (the last reboot was three months ago)
- no policy update/create/delete
- no hardware change
- no hardware firmware update 
- no workload update

Why sriov configuration would be re-applied and sriov tried to drain the node?
Here is the boot log:
ptang@iaasjump1:~$ oc debug node/mxq1010l8g
Starting pod/mxq1010l8g-debug ...
To use host binaries, run `chroot /host`
Pod IP: 172.18.243.69
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# last
core     pts/0        192.168.137.252  Mon Sep 27 17:53 - 18:11  (00:17)
core     pts/0        192.168.137.252  Mon Sep 27 17:51 - 17:53  (00:01)
core     pts/0        192.168.137.252  Tue Sep  7 18:47 - 18:47  (00:00)
reboot   system boot  4.18.0-193.40.1. Mon Jul 26 20:59   still running
reboot   system boot  4.18.0-193.40.1. Mon Jul 26 20:51 - 20:56  (00:05)

Comment 28 ptang 2021-10-28 18:55:24 UTC

Hi @zenghui.shi and @ddelcian,
We're having a sudden node draining by sriov operator today (please see the attached node description for details). For that node (mxq0440kn4), we don't enable sriov on that node at all (please see labels). I have uploaded following files for investigation: sriovnetworknodepolicy-20211028.yaml, collect-sriov-operator-data-mxq0440kn4.txt.tar.gz, sosreport-mxq0440kn4-02995395-2021-10-28-qaqmoms.tar.gz, must-gather.local.870088789971800506-mxq0440kn4.tar.gz, sriov-cm-device-plugin-config.yaml

Name:               mxq0440kn4
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=mxq0440kn4
                    kubernetes.io/os=linux
                    multus-cnf=true
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
                    pcc-mm-pod=non-controller
                    pcc-sm-pod=non-controller
                    snmp-sender=upflab3
                    tenants-workload=true
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.trident.netapp.io":"mxq0440kn4"}
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-a421fbe985727fc560149575f30f97d2
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-a421fbe985727fc560149575f30f97d2
                    machineconfiguration.openshift.io/reason:
                    machineconfiguration.openshift.io/state: Done
                    projectcalico.org/IPv4Address: 192.168.145.58/23
                    sriovnetwork.openshift.io/state: Draining
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sat, 27 Feb 2021 06:26:59 +0000
Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  mxq0440kn4
  AcquireTime:     <unset>
  RenewTime:       Thu, 28 Oct 2021 17:41:20 +0000

Comment 29 ptang 2021-10-28 18:56:25 UTC

Sorry, in comment #28, I meant I have uploaded files in support case: https://access.redhat.com/support/cases/#/case/02995395

Comment 30 ptang 2021-10-28 19:09:52 UTC

Hi @zenghui.shi,
One thing I'd like to mention is the customer is using sriov operator version:
4.6.0-202106032244, 
while the latest version is: 
4.6.0-202110121348.

Do you know if there were any important fix in between these two versions?

Comment 31 zenghui.shi 2021-10-29 08:15:01 UTC

Patric, Could you also let me know what actions did you take to resolve the overlapping devices in different pools? 
I'm wondering how did you address it without updating policy or other sriov configurations.

Comment 32 ptang 2021-10-29 13:24:22 UTC

Hi @zenghui.shi,

On Oct 19th, I made the policy update to resolve the sriov pool conflict based on your suggestion;
then we hit another sriov issue on node mxq1010l8g on Oct 25th, that's why I said there was no sriov policy update before that issue;
on Oct 28th, we saw SRIOV operator was trying to drain a totally unrelated node mxq0440kn4 which was never set up sriov.

Comment 33 zenghui.shi 2021-10-29 13:51:44 UTC

three things:

1) I see VF exist in SriovNetworkNodeState on node mxq0440kn4 in log sriov-operator-data-20211025.txt, but according to Patrik, the node has never been configured with sriov policy. I suspect we may have wrongly selected this node with sriov policy, which results in VFs be created.

2) To confirm 1) is caused by wrongly configured policy, we need to start with a clean state (cluster is ready, mcp is ready, sriov configuration is all set), collect and compare the sriovnetworknodestate, sriovnetworknodepolicy, sriov-device-plugin configmap, sriov-device-plugin log and node status right before and after any changes to sriov policy (including adding label to nodeSelector).

3) To fix the drain issue on node mxq0440kn4, I suggest to make the pod w6017-c1-sl01-upflab3/eric-pc-up-data-plane-cc7f58fb8-g8695 evictable, then wait until sriov configuration is done on the node (VFs be removed)

Comment 34 Daniel Del Ciancio 2021-10-29 19:59:40 UTC

Comment 35 Daniel Del Ciancio 2021-10-29 20:12:49 UTC

@zshi - Thanks for taking the time to help troubleshoot the issue earlier today.  

A few follow-up questions/observations:

1) The SRIOV operator subscription was set to Manual some time ago due to a past issue and was never reset to Automatic.  We did notice the version of the operator dating back from Jun 3rd, however, the MCO team did mention a race condition issue between MCO and SRIOV which is supposedly corrected via https://github.com/openshift/sriov-network-operator/pull/506.  However, this PR was only merged around Jun 23rd so this likely means they could have hit the race condition since the pool was not paused prior to the SRIOV daemon marking the node unschedulable.

2) The SRIOV operator version in production corresponds to the Oct 12th version and we haven't hit the issue there.  So this is encouraging us to want to upgrade.  Do you agree?

3) Once we've addressed the SRIOV operator upgrade, we will gauge the stability with the new version plus updated node policies in place.

4) If for some reason, we still see an instability on the nodes with the X710 NICs, we will consider updating the firmware as it is out-of-date and I believe, not 100% compatible with the bundled RHEL8/RHCOS driver running on the nodes.


Thanks!
Daniel.

Comment 36 zenghui.shi 2021-11-02 07:52:37 UTC

(In reply to Daniel Del Ciancio from comment #35)
> @zshi - Thanks for taking the time to help troubleshoot the issue
> earlier today.  
> 
> A few follow-up questions/observations:
> 
> 1) The SRIOV operator subscription was set to Manual some time ago due to a
> past issue and was never reset to Automatic.  We did notice the version of
> the operator dating back from Jun 3rd, however, the MCO team did mention a
> race condition issue between MCO and SRIOV which is supposedly corrected via
> https://github.com/openshift/sriov-network-operator/pull/506.  However, this
> PR was only merged around Jun 23rd so this likely means they could have hit
> the race condition since the pool was not paused prior to the SRIOV daemon
> marking the node unschedulable.

The PR #506 in sriov operator was to fix a race condition when both MC and
sriov policy are applied simultaneously which leads to node never become
Ready after rebooting. We can check the health of MCP (not degraded) when
the sriov issue happens to see whether this is the same issue, if MCP is not
ready, it might be. 

> 
> 2) The SRIOV operator version in production corresponds to the Oct 12th
> version and we haven't hit the issue there.  So this is encouraging us to
> want to upgrade.  Do you agree?

Do we also had the overlapping device configuration issue in sriov pools or
"pod cannot satisify the eviction condition" issue in the prod environment?

I recommend to fix the pod eviction issue first in current env like I
mentioned in comment #19 and then let the cluster to auto-fix by itself
if possible.

Comment 37 Daniel Del Ciancio 2021-11-04 14:39:02 UTC

(In reply to zenghui.shi from comment #36)
> (In reply to Daniel Del Ciancio from comment #35)
> > @zshi - Thanks for taking the time to help troubleshoot the issue
> > earlier today.  
> > 
> > A few follow-up questions/observations:
> > 
> > 1) The SRIOV operator subscription was set to Manual some time ago due to a
> > past issue and was never reset to Automatic.  We did notice the version of
> > the operator dating back from Jun 3rd, however, the MCO team did mention a
> > race condition issue between MCO and SRIOV which is supposedly corrected via
> > https://github.com/openshift/sriov-network-operator/pull/506.  However, this
> > PR was only merged around Jun 23rd so this likely means they could have hit
> > the race condition since the pool was not paused prior to the SRIOV daemon
> > marking the node unschedulable.
> 
> The PR #506 in sriov operator was to fix a race condition when both MC and
> sriov policy are applied simultaneously which leads to node never become
> Ready after rebooting. We can check the health of MCP (not degraded) when
> the sriov issue happens to see whether this is the same issue, if MCP is not
> ready, it might be. 

It seems after applying the SRIOV operator version upgrade (on Monday Nov 1st) to the latest (since the sub was set to manual) along with the corrected SRIOV node policies (applying the additional node selector and node labeling), things seem to be more stable.

This leads to another question.  Is it safe to keep the SRIOV operator auto-upgrade?  I think provided no SRIOV node policies change, causing no underlying VF changes to the nodes, then it should be safe, right?

> > 
> > 2) The SRIOV operator version in production corresponds to the Oct 12th
> > version and we haven't hit the issue there.  So this is encouraging us to
> > want to upgrade.  Do you agree?
> 
> Do we also had the overlapping device configuration issue in sriov pools or
> "pod cannot satisify the eviction condition" issue in the prod environment?
> 
> I recommend to fix the pod eviction issue first in current env like I
> mentioned in comment #19 and then let the cluster to auto-fix by itself
> if possible.


Concerning comment #19, the PDB blocking the node drain is a known concern and the customer is aware.  This is another conversation being handled with the customer and the CNF vendor.  As for now, the node drain is being freed by deleting the blocking pod.  

Does SRIOV implement a drain timeout?  I believe some work is being done on MCO end to handle this especially due to PDB pressure.

Comment 38 zenghui.shi 2021-11-08 03:43:28 UTC

(In reply to Daniel Del Ciancio from comment #37)
> (In reply to zenghui.shi from comment #36)
> > (In reply to Daniel Del Ciancio from comment #35)
> > > @zshi - Thanks for taking the time to help troubleshoot the issue
> > > earlier today.  
> > > 
> > > A few follow-up questions/observations:
> > > 
> > > 1) The SRIOV operator subscription was set to Manual some time ago due to a
> > > past issue and was never reset to Automatic.  We did notice the version of
> > > the operator dating back from Jun 3rd, however, the MCO team did mention a
> > > race condition issue between MCO and SRIOV which is supposedly corrected via
> > > https://github.com/openshift/sriov-network-operator/pull/506.  However, this
> > > PR was only merged around Jun 23rd so this likely means they could have hit
> > > the race condition since the pool was not paused prior to the SRIOV daemon
> > > marking the node unschedulable.
> > 
> > The PR #506 in sriov operator was to fix a race condition when both MC and
> > sriov policy are applied simultaneously which leads to node never become
> > Ready after rebooting. We can check the health of MCP (not degraded) when
> > the sriov issue happens to see whether this is the same issue, if MCP is not
> > ready, it might be. 
> 
> It seems after applying the SRIOV operator version upgrade (on Monday Nov
> 1st) to the latest (since the sub was set to manual) along with the
> corrected SRIOV node policies (applying the additional node selector and
> node labeling), things seem to be more stable.
> 
> This leads to another question.  Is it safe to keep the SRIOV operator
> auto-upgrade?  I think provided no SRIOV node policies change, causing no
> underlying VF changes to the nodes, then it should be safe, right?

Yes to the auto-upgrade, the upgrade of sriov operator version is not supposed
to trigger a reconfiguration of the VF devices unless explictily documented.

> 
> > > 
> > > 2) The SRIOV operator version in production corresponds to the Oct 12th
> > > version and we haven't hit the issue there.  So this is encouraging us to
> > > want to upgrade.  Do you agree?
> > 
> > Do we also had the overlapping device configuration issue in sriov pools or
> > "pod cannot satisify the eviction condition" issue in the prod environment?
> > 
> > I recommend to fix the pod eviction issue first in current env like I
> > mentioned in comment #19 and then let the cluster to auto-fix by itself
> > if possible.
> 
> 
> Concerning comment #19, the PDB blocking the node drain is a known concern
> and the customer is aware.  This is another conversation being handled with
> the customer and the CNF vendor.  As for now, the node drain is being freed
> by deleting the blocking pod.  
> 
> Does SRIOV implement a drain timeout?  

Nope in released versions. This is to avoid potential workload interruption for
pods that cannot satisfy eviction policy. User interaction is required to unlock
the node drain failure at this moment.

> I believe some work is being done on
> MCO end to handle this especially due to PDB pressure.

Do you have a pointer to how MCO handles the PDB pressure?

Comment 48 Red Hat Bugzilla 2023-09-18 04:25:28 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.