Bug 1857440
| Summary: | [sriov] openshift.io/intelsriov resources are no longer available after worker node is restarted | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Walid A. <wabouham> |
| Component: | Networking | Assignee: | Peng Liu <pliu> |
| Networking sub component: | SR-IOV | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | ddharwar, dosmith, mifiedle, zshi |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:14:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Walid A.
2020-07-15 20:46:49 UTC
this issue do not be reproduced in my cluster.
after the node is reboot. the sriov-device-plugin pod will be re-created by auto. and when the sriov-device-plugin is running. the VF rollback correctly.
sriov-device-plugin-tm454 1/1 Running 0 37s
oc get node -o yaml | grep "openshift.io/intelnetdevice"
openshift.io/intelnetdevice: "5"
openshift.io/intelnetdevice: "5"
@zhaozhanqi which version of OCP do you not see this issue on ? You have "openshift.io/intelnetdevice" and we have "openshift.io/intelsriov" resources, not sure if this why or if our configurations are different. We are seeing this issue also in another baremetal IPI cluster (same hardware), on OCP 4.5.0.rc1 (@ddharwar's cluster). Hi Walid, Could you share the output of 'oc get -n openshift-sriov-network-operator SriovNetworkNodeState -o yaml' and the log of the 'sriov-network-config-daemon-xxxx' pod after the node reboot? Also please check the node resource until the 'syncStatus' in SriovNetworkNodeState CR becomes to 'Succeeded'. I cannot reproduce this issue in my environment either. As zhanqi mentioned, the sirov-network-device-plugin pod shall be recreated after the node reboot. Hi Walid, According to the log you posted, you're using the 4.6 origin images instead of the downstream 4.5 images of the sriov network operator. Hi Peng,
I was able to verify that openshift.io/intelsriov resources do not disappear after worker node reboot, on OCP 4.5.3 baremetal cluster with v4.5 sriov-network-operator images (deployed from OperatorHub).
However, when I tried to deploy the sriov-network-operator from github master branch or from release-4.6 ( both have the merged fix PR openshift/sriov-network-operator/pull/308), the sriov-cni pod goes to Init:CrashLoopBackoff after deploying the sriov policy:
# oc get pods -n openshift-sriov-network-operator
NAME READY STATUS RESTARTS AGE
network-resources-injector-5fghq 1/1 Running 0 138m
network-resources-injector-5n4kt 1/1 Running 0 138m
network-resources-injector-7rdlc 1/1 Running 0 138m
operator-webhook-97wbd 1/1 Running 0 138m
operator-webhook-s5hv2 1/1 Running 0 138m
operator-webhook-sjrcd 1/1 Running 0 138m
sriov-cni-ft6r8 0/1 Init:CrashLoopBackOff 28 124m
sriov-device-plugin-lmknz 1/1 Running 0 122m
sriov-network-config-daemon-bzclg 1/1 Running 0 138m
sriov-network-config-daemon-ks2hw 1/1 Running 0 138m
sriov-network-operator-785676dfcf-wsgjt 1/1 Running 0 138m
The sriov-network-operator shows version 4.3.0:
# oc describe pod -n openshift-sriov-network-operator sriov-network-operator-785676dfcf-wsgjt
Name: sriov-network-operator-785676dfcf-wsgjt
Namespace: openshift-sriov-network-operator
Priority: 0
Node: master-2/192.168.222.12
Start Time: Wed, 29 Jul 2020 01:54:14 +0000
Labels: name=sriov-network-operator
pod-template-hash=785676dfcf
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.128.0.10/23"],"mac_address":"fa:28:ea:80:00:0b","gateway_ips":["10.128.0.1"],"ip_address":"10.128.0.10/23"...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.128.0.10"
],
"mac": "fa:28:ea:80:00:0b",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.128.0.10"
],
"mac": "fa:28:ea:80:00:0b",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
Status: Running
IP: 10.128.0.10
IPs:
IP: 10.128.0.10
Controlled By: ReplicaSet/sriov-network-operator-785676dfcf
Containers:
sriov-network-operator:
Container ID: cri-o://0adae7c5713ddd4dc116f16742ceb31272aa97eb7d592687892b3781b8265288
Image: quay.io/openshift/origin-sriov-network-operator@sha256:3383f608660e0b153ddd8b70f33f295b028662ecfa0e732834cfaf97e5a3a34a
Image ID: quay.io/openshift/origin-sriov-network-operator@sha256:3383f608660e0b153ddd8b70f33f295b028662ecfa0e732834cfaf97e5a3a34a
Port: <none>
Host Port: <none>
Command:
sriov-network-operator
State: Running
Started: Wed, 29 Jul 2020 01:54:20 +0000
Ready: True
Restart Count: 0
Environment:
WATCH_NAMESPACE: openshift-sriov-network-operator (v1:metadata.namespace)
SRIOV_CNI_IMAGE: quay.io/openshift/origin-sriov-cni@sha256:38ce1d1ab4d1e6508ea860fdce37e2746a26796a49eff2ec5c569e689459198b
SRIOV_INFINIBAND_CNI_IMAGE: quay.io/openshift/origin-sriov-infiniband-cni@sha256:26e1e88443e2f258dd06b196f549c346d1061c961c4216ac30a6ae0d8e413a57
SRIOV_DEVICE_PLUGIN_IMAGE: quay.io/openshift/origin-sriov-network-device-plugin@sha256:757c8f8659ed4702918e30efff604cb40aecc8424ec1cc43ad655bc1594d0739
NETWORK_RESOURCES_INJECTOR_IMAGE: quay.io/openshift/origin-sriov-dp-admission-controller@sha256:9128b6ec8c2b4895b9e927b1b54f01d2e51440d3d6232f9b3c08b17b861d8209
OPERATOR_NAME: sriov-network-operator
SRIOV_NETWORK_CONFIG_DAEMON_IMAGE: quay.io/openshift/origin-sriov-network-config-daemon@sha256:f9135e2381d0986e421e4a4799615de2cbf0690e22c90470b4863d956164b67e
SRIOV_NETWORK_WEBHOOK_IMAGE: quay.io/openshift/origin-sriov-network-webhook@sha256:0f54d40344967eb052a0a90bdf6b9447dc68f93f77ff9d4e40955e6b217b7a9a
RESOURCE_PREFIX: openshift.io
ENABLE_ADMISSION_CONTROLLER: true
NAMESPACE: openshift-sriov-network-operator (v1:metadata.namespace)
POD_NAME: sriov-network-operator-785676dfcf-wsgjt (v1:metadata.name)
RELEASE_VERSION: 4.3.0
SRIOV_CNI_BIN_PATH:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from sriov-network-operator-token-dnfkj (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
sriov-network-operator-token-dnfkj:
Type: Secret (a volume populated by a Secret)
SecretName: sriov-network-operator-token-dnfkj
Optional: false
QoS Class: BestEffort
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-sriov-network-operator/sriov-network-operator-785676dfcf-wsgjt to master-2
Normal AddedInterface 137m multus Add eth0 [10.128.0.10/23]
Normal Pulling 137m kubelet, master-2 Pulling image "quay.io/openshift/origin-sriov-network-operator@sha256:3383f608660e0b153ddd8b70f33f295b028662ecfa0e732834cfaf97e5a3a34a"
Normal Pulled 137m kubelet, master-2 Successfully pulled image "quay.io/openshift/origin-sriov-network-operator@sha256:3383f608660e0b153ddd8b70f33f295b028662ecfa0e732834cfaf97e5a3a34a"
Normal Created 137m kubelet, master-2 Created container sriov-network-operator
Normal Started 137m kubelet, master-2 Started container sriov-network-operator
-----------
# oc describe pods/sriov-cni-ft6r8 -n openshift-sriov-network-operator
Name: sriov-cni-ft6r8
Namespace: openshift-sriov-network-operator
Priority: 0
Node: worker000/192.168.222.13
Start Time: Wed, 29 Jul 2020 02:09:48 +0000
Labels: app=sriov-cni
component=network
controller-revision-hash=75c97d4f88
openshift.io/component=network
pod-template-generation=3
type=infra
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.128.2.4/23"],"mac_address":"fa:28:ea:80:02:05","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.4/23","...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.128.2.4"
],
"mac": "fa:28:ea:80:02:05",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.128.2.4"
],
"mac": "fa:28:ea:80:02:05",
"default": true,
"dns": {}
}]
openshift.io/scc: privileged
Status: Pending
IP: 10.128.2.4
IPs:
IP: 10.128.2.4
Controlled By: DaemonSet/sriov-cni
Init Containers:
sriov-infiniband-cni:
Container ID: cri-o://cc4a57eeab17bbb8c58e53c9e8a190082e0ad55925dbbbf0e701229c8cd48cbd
Image: quay.io/openshift/origin-sriov-infiniband-cni@sha256:26e1e88443e2f258dd06b196f549c346d1061c961c4216ac30a6ae0d8e413a57
Image ID: quay.io/openshift/origin-sriov-infiniband-cni@sha256:26e1e88443e2f258dd06b196f549c346d1061c961c4216ac30a6ae0d8e413a57
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 29 Jul 2020 02:15:48 +0000
Finished: Wed, 29 Jul 2020 02:15:48 +0000
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/host/opt/cni/bin from cnibin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from sriov-cni-token-6gxhb (ro)
Containers:
sriov-cni:
Container ID:
Image: quay.io/openshift/origin-sriov-cni@sha256:38ce1d1ab4d1e6508ea860fdce37e2746a26796a49eff2ec5c569e689459198b
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/host/opt/cni/bin from cnibin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from sriov-cni-token-6gxhb (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
cnibin:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/bin
HostPathType:
sriov-cni-token-6gxhb:
Type: Secret (a volume populated by a Secret)
SecretName: sriov-cni-token-6gxhb
Optional: false
QoS Class: BestEffort
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations:
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-sriov-network-operator/sriov-cni-ft6r8 to worker000
Normal AddedInterface 8m1s multus Add eth0 [10.128.2.4/23]
Normal Pulled 6m20s (x5 over 8m) kubelet, worker000 Container image "quay.io/openshift/origin-sriov-infiniband-cni@sha256:26e1e88443e2f258dd06b196f549c346d1061c961c4216ac30a6ae0d8e413a57" already present on machine
Normal Created 6m20s (x5 over 8m) kubelet, worker000 Created container sriov-infiniband-cni
Normal Started 6m20s (x5 over 8m) kubelet, worker000 Started container sriov-infiniband-cni
Warning BackOff <invalid> (x48 over 7m58s) kubelet, worker000 Back-off restarting failed container
Walid, it's a different issue. The sriov-cni pod problem shall be fixed by https://github.com/openshift/sriov-network-operator/pull/312. However, recently, the sriov-network-operator origin images were not built as expected. So you may need to wait until the origin image build back to normal. Verified this bug on 4.6.0-202008121454.p0 After node restarted. the sriov resource can be restored to normal. Verified this bz also on baremetal OCP 4.6.0-0.nightly-2020-10-03-051134 cluster with sriov deployed from github master branch. sriov-network-operator image build-date=2020-09-05T01:13:15.933978. release=202009050041.5133 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |