Description of problem: After doing a kernel upgrade daemonsets are not reconciled until the node in which SRO manager is running is restarted. This leads to losing pods when they are unscheduled from non-kernel affine nodes. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install SRO and a DaemonSet SR with kernel affine pods. 2. Upgrade kernel in a worker. 3. Check how a pod is lost. Actual results: Expected results: Additional info:
Verified Results: ###################################################### When the worker have different kernel version, the new worker node failed to create pod due to ImagePullBackOff, the image tag with kernel version, it will tag with the kernel version that build-configure job execute from which node. ###################################################### [ocpadmin@ec2-18-217-45-133 k]$ oc describe pod simple-kmod-driver-container-396f682197e94c38-rjn95 -n simple-kmod Name: simple-kmod-driver-container-396f682197e94c38-rjn95 Namespace: simple-kmod Priority: 0 Node: ip-10-0-54-185.us-east-2.compute.internal/10.0.54.185 Start Time: Wed, 29 Dec 2021 06:23:13 +0000 Labels: app=simple-kmod-driver-container-396f682197e94c38 controller-revision-hash=df8d695dc pod-template-generation=1 specialresource.openshift.io/owned=true Annotations: k8s.ovn.org/pod-networks: ..................................................... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 37m default-scheduler Successfully assigned simple-kmod/simple-kmod-driver-container-396f682197e94c38-rjn95 to ip-10-0-54-185.us-east-2.compute.internal Normal AddedInterface 37m multus Add eth0 [10.130.2.10/23] from ovn-kubernetes Warning Failed 35m (x6 over 37m) kubelet Error: ImagePullBackOff Normal Pulling 35m (x4 over 37m) kubelet Pulling image "image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.el8.x86_64" Warning Failed 35m (x4 over 37m) kubelet Failed to pull image "image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.el8.x86_64": rpc error: code = Unknown desc = reading manifest v4.18.0-305.el8.x86_64 in image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container: manifest unknown: manifest unknown Warning Failed 35m (x4 over 37m) kubelet Error: ErrImagePull Normal BackOff 2m9s (x153 over 37m) kubelet Back-off pulling image "image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.el8.x86_64" [ocpadmin@ec2-18-217-45-133 k]$ oc get pods -n simple-kmod NAME READY STATUS RESTARTS AGE simple-kmod-driver-build-396f682197e94c38-1-build 0/1 Error 0 36m simple-kmod-driver-build-7a2fc1535ea1b11f-1-build 0/1 Completed 0 36m simple-kmod-driver-container-396f682197e94c38-rjn95 0/1 ImagePullBackOff 0 38m simple-kmod-driver-container-7a2fc1535ea1b11f-ffd87 1/1 Running 0 37m simple-kmod-driver-container-7a2fc1535ea1b11f-gxsc7 1/1 Running 0 37m simple-kmod-driver-container-7a2fc1535ea1b11f-qd82z 1/1 Running 0 37m ###################################################### No simple-kmod pod scheduled to the node that have higher kernel version: ###################################################### [ocpadmin@ec2-18-217-45-133 k]$ oc get pods -n simple-kmod NAME READY STATUS RESTARTS AGE simple-kmod-driver-build-396f682197e94c38-1-build 0/1 Error 0 116m simple-kmod-driver-build-7a2fc1535ea1b11f-1-build 0/1 Completed 0 115m simple-kmod-driver-container-396f682197e94c38-ngwnp 1/1 Running 0 67m simple-kmod-driver-container-7a2fc1535ea1b11f-ffd87 1/1 Running 0 117m simple-kmod-driver-container-7a2fc1535ea1b11f-gxsc7 1/1 Running 0 117m simple-kmod-driver-container-7a2fc1535ea1b11f-qd82z 1/1 Running 0 117m [ocpadmin@ec2-18-217-45-133 k]$ oc get pods -n simple-kmod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES simple-kmod-driver-build-396f682197e94c38-1-build 0/1 Error 0 116m 10.130.2.11 ip-10-0-54-185.us-east-2.compute.internal <none> <none> simple-kmod-driver-build-7a2fc1535ea1b11f-1-build 0/1 Completed 0 116m 10.129.2.148 ip-10-0-59-7.us-east-2.compute.internal <none> <none> simple-kmod-driver-container-396f682197e94c38-ngwnp 1/1 Running 0 67m 10.130.2.23 ip-10-0-54-185.us-east-2.compute.internal <none> <none> simple-kmod-driver-container-7a2fc1535ea1b11f-ffd87 1/1 Running 0 117m 10.128.2.32 ip-10-0-61-240.us-east-2.compute.internal <none> <none> simple-kmod-driver-container-7a2fc1535ea1b11f-gxsc7 1/1 Running 0 117m 10.131.0.25 ip-10-0-68-29.us-east-2.compute.internal <none> <none> simple-kmod-driver-container-7a2fc1535ea1b11f-qd82z 1/1 Running 0 117m 10.129.2.147 ip-10-0-59-7.us-east-2.compute.internal <none> <none> [ocpadmin@ec2-18-217-45-133 k]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-48-229.us-east-2.compute.internal Ready master 7h3m v1.22.3+e790d7f ip-10-0-49-124.us-east-2.compute.internal Ready master 7h3m v1.22.3+e790d7f ip-10-0-54-185.us-east-2.compute.internal Ready worker 4h19m v1.22.3+ffbb954 ip-10-0-59-7.us-east-2.compute.internal Ready worker 6h45m v1.22.3+e790d7f ip-10-0-60-73.us-east-2.compute.internal Ready worker 2m26s v1.22.3+ffbb954 ip-10-0-61-240.us-east-2.compute.internal Ready worker 6h46m v1.22.3+e790d7f ip-10-0-68-29.us-east-2.compute.internal Ready worker 6h46m v1.22.3+e790d7f ip-10-0-69-143.us-east-2.compute.internal Ready master 7h3m v1.22.3+e790d7f [ocpadmin@ec2-18-217-45-133 k]$ oc debug node/ip-10-0-60-73.us-east-2.compute.internal Starting pod/ip-10-0-60-73us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.60.73 If you don't see a command prompt, try pressing enter. sh-4.4# sh-4.4# chroot /host sh-4.4# uname -a Linux ip-10-0-60-73.us-east-2.compute.internal 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 8 21:51:17 EST 2021 x86_64 x86_64 x86_64 GNU/Linux sh-4.4# ###################################################### After upgrade one worker nodes, the pod on upgraded node will automatically terminate, no new pod created anymore. ###################################################### [ec2-user@ip-10-0-60-73 ~]$ uname -a Linux ip-10-0-60-73.us-east-2.compute.internal 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux [root@ip-10-0-60-73 ec2-user]# yum -y update kernel Updating Subscription Management repositories. Red Hat Update Infrastructure 3 Client Configuration Server 8 9.6 kB/s | 2.1 kB 00:00 Red Hat Enterprise Linux 8 for x86_64 - AppStream from RHUI (RPMs) 14 kB/s | 2.8 kB 00:00 Red Hat Enterprise Linux 8 for x86_64 - BaseOS from RHUI (RPMs) 13 kB/s | 2.4 kB 00:00 Dependencies resolved. ==================================================================================================================================================================== Package Architecture Version Repository Size ==================================================================================================================================================================== Installing: kernel x86_64 4.18.0-348.7.1.el8_5 rhel-8-baseos-rhui-rpms 7.0 M Installing dependencies: kernel-core x86_64 4.18.0-348.7.1.el8_5 rhel-8-baseos-rhui-rpms 38 M kernel-modules x86_64 4.18.0-348.7.1.el8_5 rhel-8-baseos-rhui-rpms 30 M Transaction Summary ==================================================================================================================================================================== Install 3 Packages Total size: 74 M Installed size: 90 M Downloading Packages: [SKIPPED] kernel-core-4.18.0-348.7.1.el8_5.x86_64.rpm: Already downloaded [SKIPPED] kernel-4.18.0-348.7.1.el8_5.x86_64.rpm: Already downloaded [SKIPPED] kernel-modules-4.18.0-348.7.1.el8_5.x86_64.rpm: Already downloaded Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : kernel-core-4.18.0-348.7.1.el8_5.x86_64 1/3 Running scriptlet: kernel-core-4.18.0-348.7.1.el8_5.x86_64 1/3 Installing : kernel-modules-4.18.0-348.7.1.el8_5.x86_64 2/3 Running scriptlet: kernel-modules-4.18.0-348.7.1.el8_5.x86_64 2/3 Installing : kernel-4.18.0-348.7.1.el8_5.x86_64 3/3 Running scriptlet: kernel-core-4.18.0-348.7.1.el8_5.x86_64 3/3 Running scriptlet: kernel-4.18.0-348.7.1.el8_5.x86_64 3/3 Verifying : kernel-core-4.18.0-348.7.1.el8_5.x86_64 1/3 Verifying : kernel-4.18.0-348.7.1.el8_5.x86_64 2/3 Verifying : kernel-modules-4.18.0-348.7.1.el8_5.x86_64 3/3 Installed products updated. Installed: kernel-4.18.0-348.7.1.el8_5.x86_64 kernel-core-4.18.0-348.7.1.el8_5.x86_64 kernel-modules-4.18.0-348.7.1.el8_5.x86_64 Complete!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056