Bug 1857440
Summary: | [sriov] openshift.io/intelsriov resources are no longer available after worker node is restarted | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Walid A. <wabouham> |
Component: | Networking | Assignee: | Peng Liu <pliu> |
Networking sub component: | SR-IOV | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | ddharwar, dosmith, mifiedle, zshi |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:14:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Walid A.
2020-07-15 20:46:49 UTC
this issue do not be reproduced in my cluster. after the node is reboot. the sriov-device-plugin pod will be re-created by auto. and when the sriov-device-plugin is running. the VF rollback correctly. sriov-device-plugin-tm454 1/1 Running 0 37s oc get node -o yaml | grep "openshift.io/intelnetdevice" openshift.io/intelnetdevice: "5" openshift.io/intelnetdevice: "5" @zhaozhanqi which version of OCP do you not see this issue on ? You have "openshift.io/intelnetdevice" and we have "openshift.io/intelsriov" resources, not sure if this why or if our configurations are different. We are seeing this issue also in another baremetal IPI cluster (same hardware), on OCP 4.5.0.rc1 (@ddharwar's cluster). Hi Walid, Could you share the output of 'oc get -n openshift-sriov-network-operator SriovNetworkNodeState -o yaml' and the log of the 'sriov-network-config-daemon-xxxx' pod after the node reboot? Also please check the node resource until the 'syncStatus' in SriovNetworkNodeState CR becomes to 'Succeeded'. I cannot reproduce this issue in my environment either. As zhanqi mentioned, the sirov-network-device-plugin pod shall be recreated after the node reboot. Hi Walid, According to the log you posted, you're using the 4.6 origin images instead of the downstream 4.5 images of the sriov network operator. Hi Peng, I was able to verify that openshift.io/intelsriov resources do not disappear after worker node reboot, on OCP 4.5.3 baremetal cluster with v4.5 sriov-network-operator images (deployed from OperatorHub). However, when I tried to deploy the sriov-network-operator from github master branch or from release-4.6 ( both have the merged fix PR openshift/sriov-network-operator/pull/308), the sriov-cni pod goes to Init:CrashLoopBackoff after deploying the sriov policy: # oc get pods -n openshift-sriov-network-operator NAME READY STATUS RESTARTS AGE network-resources-injector-5fghq 1/1 Running 0 138m network-resources-injector-5n4kt 1/1 Running 0 138m network-resources-injector-7rdlc 1/1 Running 0 138m operator-webhook-97wbd 1/1 Running 0 138m operator-webhook-s5hv2 1/1 Running 0 138m operator-webhook-sjrcd 1/1 Running 0 138m sriov-cni-ft6r8 0/1 Init:CrashLoopBackOff 28 124m sriov-device-plugin-lmknz 1/1 Running 0 122m sriov-network-config-daemon-bzclg 1/1 Running 0 138m sriov-network-config-daemon-ks2hw 1/1 Running 0 138m sriov-network-operator-785676dfcf-wsgjt 1/1 Running 0 138m The sriov-network-operator shows version 4.3.0: # oc describe pod -n openshift-sriov-network-operator sriov-network-operator-785676dfcf-wsgjt Name: sriov-network-operator-785676dfcf-wsgjt Namespace: openshift-sriov-network-operator Priority: 0 Node: master-2/192.168.222.12 Start Time: Wed, 29 Jul 2020 01:54:14 +0000 Labels: name=sriov-network-operator pod-template-hash=785676dfcf Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.128.0.10/23"],"mac_address":"fa:28:ea:80:00:0b","gateway_ips":["10.128.0.1"],"ip_address":"10.128.0.10/23"... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.128.0.10" ], "mac": "fa:28:ea:80:00:0b", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.128.0.10" ], "mac": "fa:28:ea:80:00:0b", "default": true, "dns": {} }] openshift.io/scc: restricted Status: Running IP: 10.128.0.10 IPs: IP: 10.128.0.10 Controlled By: ReplicaSet/sriov-network-operator-785676dfcf Containers: sriov-network-operator: Container ID: cri-o://0adae7c5713ddd4dc116f16742ceb31272aa97eb7d592687892b3781b8265288 Image: quay.io/openshift/origin-sriov-network-operator@sha256:3383f608660e0b153ddd8b70f33f295b028662ecfa0e732834cfaf97e5a3a34a Image ID: quay.io/openshift/origin-sriov-network-operator@sha256:3383f608660e0b153ddd8b70f33f295b028662ecfa0e732834cfaf97e5a3a34a Port: <none> Host Port: <none> Command: sriov-network-operator State: Running Started: Wed, 29 Jul 2020 01:54:20 +0000 Ready: True Restart Count: 0 Environment: WATCH_NAMESPACE: openshift-sriov-network-operator (v1:metadata.namespace) SRIOV_CNI_IMAGE: quay.io/openshift/origin-sriov-cni@sha256:38ce1d1ab4d1e6508ea860fdce37e2746a26796a49eff2ec5c569e689459198b SRIOV_INFINIBAND_CNI_IMAGE: quay.io/openshift/origin-sriov-infiniband-cni@sha256:26e1e88443e2f258dd06b196f549c346d1061c961c4216ac30a6ae0d8e413a57 SRIOV_DEVICE_PLUGIN_IMAGE: quay.io/openshift/origin-sriov-network-device-plugin@sha256:757c8f8659ed4702918e30efff604cb40aecc8424ec1cc43ad655bc1594d0739 NETWORK_RESOURCES_INJECTOR_IMAGE: quay.io/openshift/origin-sriov-dp-admission-controller@sha256:9128b6ec8c2b4895b9e927b1b54f01d2e51440d3d6232f9b3c08b17b861d8209 OPERATOR_NAME: sriov-network-operator SRIOV_NETWORK_CONFIG_DAEMON_IMAGE: quay.io/openshift/origin-sriov-network-config-daemon@sha256:f9135e2381d0986e421e4a4799615de2cbf0690e22c90470b4863d956164b67e SRIOV_NETWORK_WEBHOOK_IMAGE: quay.io/openshift/origin-sriov-network-webhook@sha256:0f54d40344967eb052a0a90bdf6b9447dc68f93f77ff9d4e40955e6b217b7a9a RESOURCE_PREFIX: openshift.io ENABLE_ADMISSION_CONTROLLER: true NAMESPACE: openshift-sriov-network-operator (v1:metadata.namespace) POD_NAME: sriov-network-operator-785676dfcf-wsgjt (v1:metadata.name) RELEASE_VERSION: 4.3.0 SRIOV_CNI_BIN_PATH: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from sriov-network-operator-token-dnfkj (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: sriov-network-operator-token-dnfkj: Type: Secret (a volume populated by a Secret) SecretName: sriov-network-operator-token-dnfkj Optional: false QoS Class: BestEffort Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-sriov-network-operator/sriov-network-operator-785676dfcf-wsgjt to master-2 Normal AddedInterface 137m multus Add eth0 [10.128.0.10/23] Normal Pulling 137m kubelet, master-2 Pulling image "quay.io/openshift/origin-sriov-network-operator@sha256:3383f608660e0b153ddd8b70f33f295b028662ecfa0e732834cfaf97e5a3a34a" Normal Pulled 137m kubelet, master-2 Successfully pulled image "quay.io/openshift/origin-sriov-network-operator@sha256:3383f608660e0b153ddd8b70f33f295b028662ecfa0e732834cfaf97e5a3a34a" Normal Created 137m kubelet, master-2 Created container sriov-network-operator Normal Started 137m kubelet, master-2 Started container sriov-network-operator ----------- # oc describe pods/sriov-cni-ft6r8 -n openshift-sriov-network-operator Name: sriov-cni-ft6r8 Namespace: openshift-sriov-network-operator Priority: 0 Node: worker000/192.168.222.13 Start Time: Wed, 29 Jul 2020 02:09:48 +0000 Labels: app=sriov-cni component=network controller-revision-hash=75c97d4f88 openshift.io/component=network pod-template-generation=3 type=infra Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.128.2.4/23"],"mac_address":"fa:28:ea:80:02:05","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.4/23","... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.128.2.4" ], "mac": "fa:28:ea:80:02:05", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.128.2.4" ], "mac": "fa:28:ea:80:02:05", "default": true, "dns": {} }] openshift.io/scc: privileged Status: Pending IP: 10.128.2.4 IPs: IP: 10.128.2.4 Controlled By: DaemonSet/sriov-cni Init Containers: sriov-infiniband-cni: Container ID: cri-o://cc4a57eeab17bbb8c58e53c9e8a190082e0ad55925dbbbf0e701229c8cd48cbd Image: quay.io/openshift/origin-sriov-infiniband-cni@sha256:26e1e88443e2f258dd06b196f549c346d1061c961c4216ac30a6ae0d8e413a57 Image ID: quay.io/openshift/origin-sriov-infiniband-cni@sha256:26e1e88443e2f258dd06b196f549c346d1061c961c4216ac30a6ae0d8e413a57 Port: <none> Host Port: <none> State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 29 Jul 2020 02:15:48 +0000 Finished: Wed, 29 Jul 2020 02:15:48 +0000 Ready: False Restart Count: 6 Environment: <none> Mounts: /host/opt/cni/bin from cnibin (rw) /var/run/secrets/kubernetes.io/serviceaccount from sriov-cni-token-6gxhb (ro) Containers: sriov-cni: Container ID: Image: quay.io/openshift/origin-sriov-cni@sha256:38ce1d1ab4d1e6508ea860fdce37e2746a26796a49eff2ec5c569e689459198b Image ID: Port: <none> Host Port: <none> State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: <none> Mounts: /host/opt/cni/bin from cnibin (rw) /var/run/secrets/kubernetes.io/serviceaccount from sriov-cni-token-6gxhb (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: cnibin: Type: HostPath (bare host directory volume) Path: /var/lib/cni/bin HostPathType: sriov-cni-token-6gxhb: Type: Secret (a volume populated by a Secret) SecretName: sriov-cni-token-6gxhb Optional: false QoS Class: BestEffort Node-Selectors: beta.kubernetes.io/os=linux Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/pid-pressure:NoSchedule node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-sriov-network-operator/sriov-cni-ft6r8 to worker000 Normal AddedInterface 8m1s multus Add eth0 [10.128.2.4/23] Normal Pulled 6m20s (x5 over 8m) kubelet, worker000 Container image "quay.io/openshift/origin-sriov-infiniband-cni@sha256:26e1e88443e2f258dd06b196f549c346d1061c961c4216ac30a6ae0d8e413a57" already present on machine Normal Created 6m20s (x5 over 8m) kubelet, worker000 Created container sriov-infiniband-cni Normal Started 6m20s (x5 over 8m) kubelet, worker000 Started container sriov-infiniband-cni Warning BackOff <invalid> (x48 over 7m58s) kubelet, worker000 Back-off restarting failed container Walid, it's a different issue. The sriov-cni pod problem shall be fixed by https://github.com/openshift/sriov-network-operator/pull/312. However, recently, the sriov-network-operator origin images were not built as expected. So you may need to wait until the origin image build back to normal. Verified this bug on 4.6.0-202008121454.p0 After node restarted. the sriov resource can be restored to normal. Verified this bz also on baremetal OCP 4.6.0-0.nightly-2020-10-03-051134 cluster with sriov deployed from github master branch. sriov-network-operator image build-date=2020-09-05T01:13:15.933978. release=202009050041.5133 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |