Bug 2093357

Summary: Upgrading sno spoke with acm-ice, causes the sno to get unreachable
Product: OpenShift Container Platform Reporter: Constantin Vultur <cvultur>
Component: Special Resource OperatorAssignee: Pablo Acevedo <pacevedo>
Status: CLOSED ERRATA QA Contact: Constantin Vultur <cvultur>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.11CC: bblock, bthurber, mlammon
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:16:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Constantin Vultur 2022-06-03 14:14:53 UTC
Description of problem:
Upgrade of SNO cluster never finishes due to missing acm-ice image in registry. 

SNO cluster remains indefinetly unreachable


Version-Release number of selected component (if applicable):
bundle / release-4.11

How reproducible:


Steps to Reproduce:
1. deploy acm-ice on SNO spokes
2. perform upgrade of SNOs, making sure that there is a new kernel being deployed
3.

Actual results:
- SNO Cluster never gets up.
- acm-ice service stays blocked in activating , due to missing new image file
- kubelet service never starts


[core@sno2-0-0 ~]$ systemctl status acm-ice
● acm-ice.service - out-of-tree driver loader
   Loaded: loaded (/etc/systemd/system/acm-ice.service; enabled; vendor preset: disabled)
   Active: activating (start) since Fri 2022-06-03 13:29:48 UTC; 43min ago
 Main PID: 2272 (bash)
    Tasks: 2 (limit: 153437)
   Memory: 39.0M
      CPU: 1min 38.345s
   CGroup: /system.slice/acm-ice.service
           ├─ 2272 /usr/bin/bash -c while ! /usr/local/bin/acm-ice load registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container; do sleep 10; done
           └─18684 sleep 10

Jun 03 14:12:53 sno2-0-0 bash[2272]: Trying to pull registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64...
Jun 03 14:12:53 sno2-0-0 bash[2272]: Error: Error initializing source docker://registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64: Error re>
Jun 03 14:13:04 sno2-0-0 bash[2272]: Trying to pull registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64...
Jun 03 14:13:04 sno2-0-0 bash[2272]: Error: Error initializing source docker://registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64: Error re>
Jun 03 14:13:14 sno2-0-0 bash[2272]: Trying to pull registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64...
Jun 03 14:13:14 sno2-0-0 bash[2272]: Error: Error initializing source docker://registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64: Error re>
Jun 03 14:13:24 sno2-0-0 bash[2272]: Trying to pull registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64...
Jun 03 14:13:24 sno2-0-0 bash[2272]: Error: Error initializing source docker://registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64: Error re>
Jun 03 14:13:35 sno2-0-0 bash[2272]: Trying to pull registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64...
Jun 03 14:13:35 sno2-0-0 bash[2272]: Error: Error initializing source docker://registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-106/acm-ice-driver-container:4.18.0-305.28.1.el8_4.x86_64: Error re>



Expected results:
upgrade not to get stuck, 

Additional info:

Comment 2 Constantin Vultur 2022-06-14 15:31:45 UTC
Testing with the new version of acm-ice and build was ok:

# oc get all
NAME                         READY   STATUS      RESTARTS   AGE
pod/acm-ice-4-8-20-1-build   0/1     Completed   0          148m

NAME                                            TYPE     FROM         LATEST
buildconfig.build.openshift.io/acm-ice-4-8-20   Docker   Dockerfile   1

NAME                                        TYPE     FROM         STATUS     STARTED       DURATION
build.build.openshift.io/acm-ice-4-8-20-1   Docker   Dockerfile   Complete   2 hours ago   2m43s


Then started spoke cluster upgrade from 4.8.20 to 4.8.24 ( 4.18.0-305.25 to 4.18.0-305.28 )

Upgrade went up to 77 % then became unreachable.

Checked the spoke system and journalctl -xef showed

Jun 14 15:23:52 sno1-0-0 bash[2265]: Trying to pull registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/sro-1306/acm-ice-driver-container:4.18.0-305.25.1.el8_4.x86_64...
Jun 14 15:23:52 sno1-0-0 bash[2265]: Getting image source signatures
Jun 14 15:23:52 sno1-0-0 bash[2265]: Copying blob sha256:34c2415aebfcf7c0bc4e8fe2063061c614f7819f68b786c19332d009732fafe1
Jun 14 15:23:52 sno1-0-0 bash[2265]: Copying blob sha256:dddc255e8c1694957778335dc22356798286868501a76e53e5ac328ed9d0e0c8
Jun 14 15:23:52 sno1-0-0 bash[2265]: Copying blob sha256:87b7bd227a863470eb564222dff5ab56d5d86dd8446103505f646afb5fc2c827
Jun 14 15:23:52 sno1-0-0 bash[2265]: Copying blob sha256:aba7e1b5cddd91442924d86159bbc012f115d8f2bedc8e8c1eed835c09a8da14
Jun 14 15:23:52 sno1-0-0 bash[2265]: Copying blob sha256:4752687a61a97d6f352ae62c381c87564bcb2f5b6523a05510ca1fb60d640216
Jun 14 15:23:52 sno1-0-0 bash[2265]: Copying blob sha256:0344366a246a0f7590c2bae4536c01f15f20c6d802b4654ce96ac81047bc23f3
Jun 14 15:23:52 sno1-0-0 bash[2265]: Copying config sha256:48dcd048d16dd7e389afe01483265d38cb27dcd99a0af233f50f0ca5143a416a
Jun 14 15:23:52 sno1-0-0 bash[2265]: Writing manifest to image destination
Jun 14 15:23:52 sno1-0-0 bash[2265]: Storing signatures
Jun 14 15:23:52 sno1-0-0 bash[2265]: 48dcd048d16dd7e389afe01483265d38cb27dcd99a0af233f50f0ca5143a416a
Jun 14 15:23:52 sno1-0-0 systemd[6319]: var-lib-containers-storage-overlay.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit UNIT has successfully entered the 'dead' state.
Jun 14 15:23:52 sno1-0-0 systemd[1]: var-lib-containers-storage-overlay.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit var-lib-containers-storage-overlay.mount has successfully entered the 'dead' state.
Jun 14 15:23:52 sno1-0-0 bash[2265]: Error: statfs /lib/modules/4.18.0-305.25.1.el8_4.x86_64/kernel/drivers: no such file or directory


On the spoke node, this is the content of /lib/modules:

$ ll /lib/modules/
total 4
drwxr-xr-x. 7 root root 4096 Jan  1  1970 4.18.0-305.28.1.el8_4.x86_64

Comment 4 Constantin Vultur 2022-06-20 08:06:39 UTC
Verified the new example and upgrade now works as expected. 

Noting down here, the requirement that the clusterclaim has to be created before the test is started. Also existing/statically created clusterclaims could impact the outcome of ice driver installation.

Comment 6 errata-xmlrpc 2022-08-10 11:16:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069