2018542 – Kernel upgrade does not reconcile DaemonSet

Bug 2018542 - Kernel upgrade does not reconcile DaemonSet

Summary: Kernel upgrade does not reconcile DaemonSet

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Special Resource Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Pablo Acevedo
QA Contact:	liqcui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-29 16:06 UTC by Pablo Acevedo
Modified:	2022-03-10 16:24 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:23:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift special-resource-operator pull 65	0	None	open	Bug 2018542: Kernel upgrade does not reconcile DaemonSet	2021-11-04 15:44:05 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:24:03 UTC

Description Pablo Acevedo 2021-10-29 16:06:12 UTC

Description of problem:
After doing a kernel upgrade daemonsets are not reconciled until the node in which SRO manager is running is restarted. This leads to losing pods when they are unscheduled from non-kernel affine nodes.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install SRO and a DaemonSet SR with kernel affine pods.
2. Upgrade kernel in a worker.
3. Check how a pod is lost.

Actual results:


Expected results:


Additional info:

Comment 4 liqcui 2021-12-29 09:07:36 UTC

Verified Results:

######################################################
When the worker have different kernel version, the new worker node failed to create pod due to  ImagePullBackOff, the image tag with kernel version, it will tag with the kernel version that build-configure job execute from which node.
######################################################

[ocpadmin@ec2-18-217-45-133 k]$ oc describe pod simple-kmod-driver-container-396f682197e94c38-rjn95 -n simple-kmod
Name:         simple-kmod-driver-container-396f682197e94c38-rjn95
Namespace:    simple-kmod
Priority:     0
Node:         ip-10-0-54-185.us-east-2.compute.internal/10.0.54.185
Start Time:   Wed, 29 Dec 2021 06:23:13 +0000
Labels:       app=simple-kmod-driver-container-396f682197e94c38
              controller-revision-hash=df8d695dc
              pod-template-generation=1
              specialresource.openshift.io/owned=true
Annotations:  k8s.ovn.org/pod-networks:
 .....................................................
Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       37m                   default-scheduler  Successfully assigned simple-kmod/simple-kmod-driver-container-396f682197e94c38-rjn95 to ip-10-0-54-185.us-east-2.compute.internal
  Normal   AddedInterface  37m                   multus             Add eth0 [10.130.2.10/23] from ovn-kubernetes
  Warning  Failed          35m (x6 over 37m)     kubelet            Error: ImagePullBackOff
  Normal   Pulling         35m (x4 over 37m)     kubelet            Pulling image "image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.el8.x86_64"
  Warning  Failed          35m (x4 over 37m)     kubelet            Failed to pull image "image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.el8.x86_64": rpc error: code = Unknown desc = reading manifest v4.18.0-305.el8.x86_64 in image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container: manifest unknown: manifest unknown
  Warning  Failed          35m (x4 over 37m)     kubelet            Error: ErrImagePull
  Normal   BackOff         2m9s (x153 over 37m)  kubelet            Back-off pulling image "image-registry.openshift-image-registry.svc:5000/simple-kmod/simple-kmod-driver-container:v4.18.0-305.el8.x86_64"
[ocpadmin@ec2-18-217-45-133 k]$ oc get pods -n simple-kmod
NAME                                                  READY   STATUS             RESTARTS   AGE
simple-kmod-driver-build-396f682197e94c38-1-build     0/1     Error              0          36m
simple-kmod-driver-build-7a2fc1535ea1b11f-1-build     0/1     Completed          0          36m
simple-kmod-driver-container-396f682197e94c38-rjn95   0/1     ImagePullBackOff   0          38m
simple-kmod-driver-container-7a2fc1535ea1b11f-ffd87   1/1     Running            0          37m
simple-kmod-driver-container-7a2fc1535ea1b11f-gxsc7   1/1     Running            0          37m
simple-kmod-driver-container-7a2fc1535ea1b11f-qd82z   1/1     Running            0          37m

######################################################
No simple-kmod pod scheduled to the node that have higher kernel version:
######################################################

[ocpadmin@ec2-18-217-45-133 k]$ oc get pods -n simple-kmod
NAME                                                  READY   STATUS      RESTARTS   AGE
simple-kmod-driver-build-396f682197e94c38-1-build     0/1     Error       0          116m
simple-kmod-driver-build-7a2fc1535ea1b11f-1-build     0/1     Completed   0          115m
simple-kmod-driver-container-396f682197e94c38-ngwnp   1/1     Running     0          67m
simple-kmod-driver-container-7a2fc1535ea1b11f-ffd87   1/1     Running     0          117m
simple-kmod-driver-container-7a2fc1535ea1b11f-gxsc7   1/1     Running     0          117m
simple-kmod-driver-container-7a2fc1535ea1b11f-qd82z   1/1     Running     0          117m
[ocpadmin@ec2-18-217-45-133 k]$ oc get pods -n simple-kmod -o wide
NAME                                                  READY   STATUS      RESTARTS   AGE    IP             NODE                                        NOMINATED NODE   READINESS GATES
simple-kmod-driver-build-396f682197e94c38-1-build     0/1     Error       0          116m   10.130.2.11    ip-10-0-54-185.us-east-2.compute.internal   <none>           <none>
simple-kmod-driver-build-7a2fc1535ea1b11f-1-build     0/1     Completed   0          116m   10.129.2.148   ip-10-0-59-7.us-east-2.compute.internal     <none>           <none>
simple-kmod-driver-container-396f682197e94c38-ngwnp   1/1     Running     0          67m    10.130.2.23    ip-10-0-54-185.us-east-2.compute.internal   <none>           <none>
simple-kmod-driver-container-7a2fc1535ea1b11f-ffd87   1/1     Running     0          117m   10.128.2.32    ip-10-0-61-240.us-east-2.compute.internal   <none>           <none>
simple-kmod-driver-container-7a2fc1535ea1b11f-gxsc7   1/1     Running     0          117m   10.131.0.25    ip-10-0-68-29.us-east-2.compute.internal    <none>           <none>
simple-kmod-driver-container-7a2fc1535ea1b11f-qd82z   1/1     Running     0          117m   10.129.2.147   ip-10-0-59-7.us-east-2.compute.internal     <none>           <none>
[ocpadmin@ec2-18-217-45-133 k]$ oc get nodes
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-0-48-229.us-east-2.compute.internal   Ready    master   7h3m    v1.22.3+e790d7f
ip-10-0-49-124.us-east-2.compute.internal   Ready    master   7h3m    v1.22.3+e790d7f
ip-10-0-54-185.us-east-2.compute.internal   Ready    worker   4h19m   v1.22.3+ffbb954
ip-10-0-59-7.us-east-2.compute.internal     Ready    worker   6h45m   v1.22.3+e790d7f
ip-10-0-60-73.us-east-2.compute.internal    Ready    worker   2m26s   v1.22.3+ffbb954
ip-10-0-61-240.us-east-2.compute.internal   Ready    worker   6h46m   v1.22.3+e790d7f
ip-10-0-68-29.us-east-2.compute.internal    Ready    worker   6h46m   v1.22.3+e790d7f
ip-10-0-69-143.us-east-2.compute.internal   Ready    master   7h3m    v1.22.3+e790d7f

[ocpadmin@ec2-18-217-45-133 k]$ oc debug node/ip-10-0-60-73.us-east-2.compute.internal
Starting pod/ip-10-0-60-73us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`

Pod IP: 10.0.60.73
If you don't see a command prompt, try pressing enter.
sh-4.4# 
sh-4.4# chroot /host
sh-4.4# uname -a
Linux ip-10-0-60-73.us-east-2.compute.internal 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 8 21:51:17 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# 

######################################################
After upgrade one worker nodes, the pod on upgraded node will automatically terminate, no new pod created anymore.
######################################################

[ec2-user@ip-10-0-60-73 ~]$ uname -a
Linux ip-10-0-60-73.us-east-2.compute.internal 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@ip-10-0-60-73 ec2-user]# yum -y update kernel
Updating Subscription Management repositories.
Red Hat Update Infrastructure 3 Client Configuration Server 8                                                                       9.6 kB/s | 2.1 kB     00:00    
Red Hat Enterprise Linux 8 for x86_64 - AppStream from RHUI (RPMs)                                                                   14 kB/s | 2.8 kB     00:00    
Red Hat Enterprise Linux 8 for x86_64 - BaseOS from RHUI (RPMs)                                                                      13 kB/s | 2.4 kB     00:00    
Dependencies resolved.
====================================================================================================================================================================
 Package                              Architecture                 Version                                      Repository                                     Size
====================================================================================================================================================================
Installing:
 kernel                               x86_64                       4.18.0-348.7.1.el8_5                         rhel-8-baseos-rhui-rpms                       7.0 M
Installing dependencies:
 kernel-core                          x86_64                       4.18.0-348.7.1.el8_5                         rhel-8-baseos-rhui-rpms                        38 M
 kernel-modules                       x86_64                       4.18.0-348.7.1.el8_5                         rhel-8-baseos-rhui-rpms                        30 M

Transaction Summary
====================================================================================================================================================================
Install  3 Packages

Total size: 74 M
Installed size: 90 M
Downloading Packages:
[SKIPPED] kernel-core-4.18.0-348.7.1.el8_5.x86_64.rpm: Already downloaded                                                                                          
[SKIPPED] kernel-4.18.0-348.7.1.el8_5.x86_64.rpm: Already downloaded                                                                                               
[SKIPPED] kernel-modules-4.18.0-348.7.1.el8_5.x86_64.rpm: Already downloaded                                                                                       
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                                            1/1 
  Installing       : kernel-core-4.18.0-348.7.1.el8_5.x86_64                                                                                                    1/3 
  Running scriptlet: kernel-core-4.18.0-348.7.1.el8_5.x86_64                                                                                                    1/3 
  Installing       : kernel-modules-4.18.0-348.7.1.el8_5.x86_64                                                                                                 2/3 
  Running scriptlet: kernel-modules-4.18.0-348.7.1.el8_5.x86_64                                                                                                 2/3 
  Installing       : kernel-4.18.0-348.7.1.el8_5.x86_64                                                                                                         3/3 
  Running scriptlet: kernel-core-4.18.0-348.7.1.el8_5.x86_64                                                                                                    3/3 
  Running scriptlet: kernel-4.18.0-348.7.1.el8_5.x86_64                                                                                                         3/3 
  Verifying        : kernel-core-4.18.0-348.7.1.el8_5.x86_64                                                                                                    1/3 
  Verifying        : kernel-4.18.0-348.7.1.el8_5.x86_64                                                                                                         2/3 
  Verifying        : kernel-modules-4.18.0-348.7.1.el8_5.x86_64                                                                                                 3/3 
Installed products updated.

Installed:
  kernel-4.18.0-348.7.1.el8_5.x86_64                kernel-core-4.18.0-348.7.1.el8_5.x86_64                kernel-modules-4.18.0-348.7.1.el8_5.x86_64               

Complete!

Comment 7 errata-xmlrpc 2022-03-10 16:23:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.