Description of problem: IHAC who had recently hit an upgrade to OCP 4.6.17 where one of the worker node failed the validation while checking the expected machine-os-content for the desired renderd-config of worker node. The rpm-ostree status shows, ~~~ 2021-03-30T23:29:17.278448306Z I0330 23:29:17.278374 3453 daemon.go:862] State: idle 2021-03-30T23:29:17.278448306Z Warning: failed to finalize previous deployment 2021-03-30T23:29:17.278448306Z ostree-finalize-staged.service: Failed with result 'core-dump'. 2021-03-30T23:29:17.278448306Z check `journalctl -b -1 -u ostree-finalize-staged.service` 2021-03-30T23:29:17.278448306Z Deployments: 2021-03-30T23:29:17.278448306Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1943d14b0a05126aa758a72fd996f08389aaa2140005d610715a0f3b11eeea57 2021-03-30T23:29:17.278448306Z CustomOrigin: Managed by machine-config-operator 2021-03-30T23:29:17.278448306Z Version: 45.82.202011131531-0 (2020-11-13T15:35:05Z) 2021-03-30T23:29:17.278448306Z 2021-03-30T23:29:17.278448306Z ostree://c95ed1eb5c045492d7293a2a1b7178a050f857944fc46ab3377fca3afd5b7b31 2021-03-30T23:29:17.278448306Z Version: 44.81.202004250133-0 (2020-04-25T01:38:52Z) 2021-03-30T23:29:17.278533903Z I0330 23:29:17.278519 3453 rpm-ostree.go:261] Running captured: journalctl --list-boots ~~~ Checking the service logs, I see the following ~~~ -- Logs begin at Tue 2021-03-30 07:35:34 UTC, end at Wed 2021-03-31 11:52:15 UTC. -- Mar 31 00:26:28 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Started OSTree Finalize Staged Deployment. Mar 31 00:26:45 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopping OSTree Finalize Staged Deployment... Mar 31 00:26:45 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: Finalizing staged deployment Mar 31 00:26:48 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: Copying /etc changes: 29 modified, 1 removed, 214 added Mar 31 00:26:48 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: Copying /etc changes: 29 modified, 1 removed, 214 added Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: ** Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: OSTree:ERROR:src/libostree/ostree-sysroot-deploy.c:1682:install_deployment_kernel: assertion failed (kernel_layout->bootcsum == b> Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Control process exited, code=dumped status=6 Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Failed with result 'core-dump'. Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped OSTree Finalize Staged Deployment. Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Consumed 1.350s CPU time lines 1-12/12 (END) ~~~ Attached in the google-drive:- 1 sos-report from the node 2. latest must-gather from the cluster 3. /var/log/journal/ directory tar from the problematic node. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Unfortunately the journal files that are in the giant tarball don't have any detailed stack trace from `ostree-finalize-staged.service` dumping core. In fact, it looks like the `systemd-coredump` service wasn't able to start successfully at all which would explain why there is no core file to work with. (The error from `systemd-coredump.socket` appears to be a generic catch-all error, it looks like: https://github.com/systemd/systemd/blob/87f9300d71e30ab04dee60cf78b8cdf09506c65b/src/core/service.h#L70 which seems similar to the socket version of the error https://github.com/systemd/systemd/blob/main/src/core/socket.h#L34) FWIW, the first evidence of `ostree-finalize-staged.service` dumping core shows up in the journal file `system` ``` $ journalctl --file system | grep -B 50 -A 10 -m 1 ostree-sysroot-deploy Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Closed D-Bus System Message Bus Socket. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: dbus.socket: Consumed 0 CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: Finalizing staged deployment Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-2f2828c75715e1eb6b7b71aecaead22ebc1ec499c25974ba920c9e406d0c8312.scope. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-2f2828c75715e1eb6b7b71aecaead22ebc1ec499c25974ba920c9e406d0c8312.scope: Consumed 87ms CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-31e0bfe4f7fba89c258ad00f90bbf738f828d968dd520f7eb28643efb1cc839e.scope. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-31e0bfe4f7fba89c258ad00f90bbf738f828d968dd520f7eb28643efb1cc839e.scope: Consumed 87ms CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 2f2828c75715e1eb6b7b71aecaead22ebc1ec499c25974ba920c9e406d0c8312. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-2f2828c75715e1eb6b7b71aecaead22ebc1ec499c25974ba920c9e406d0c8312.scope: Consumed 1min 46.381s CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 31e0bfe4f7fba89c258ad00f90bbf738f828d968dd520f7eb28643efb1cc839e. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-31e0bfe4f7fba89c258ad00f90bbf738f828d968dd520f7eb28643efb1cc839e.scope: Consumed 1.075s CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container dec1c97af1328289e7d8b8221decf147dbea9b5a5680d3f008ee85466de2072d. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-dec1c97af1328289e7d8b8221decf147dbea9b5a5680d3f008ee85466de2072d.scope: Consumed 781ms CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-dec1c97af1328289e7d8b8221decf147dbea9b5a5680d3f008ee85466de2072d.scope. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-dec1c97af1328289e7d8b8221decf147dbea9b5a5680d3f008ee85466de2072d.scope: Consumed 88ms CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-pod7a9564f6_dc05_4bf2_8906_ae2a4d790ef2.slice. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-pod7a9564f6_dc05_4bf2_8906_ae2a4d790ef2.slice: Consumed 1min 49.504s CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-c30073d22f67af35e664160b772c11dfb4bc605d4b3b5733433784e6e2c2f81c.scope. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-c30073d22f67af35e664160b772c11dfb4bc605d4b3b5733433784e6e2c2f81c.scope: Consumed 394ms CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-6ab91aa076b1d331f8fb3e0b85bce1f06055ea4f0d6ab0de2633b9d12499b907.scope. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-6ab91aa076b1d331f8fb3e0b85bce1f06055ea4f0d6ab0de2633b9d12499b907.scope: Consumed 86ms CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container c30073d22f67af35e664160b772c11dfb4bc605d4b3b5733433784e6e2c2f81c. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-c30073d22f67af35e664160b772c11dfb4bc605d4b3b5733433784e6e2c2f81c.scope: Consumed 9min 16.723s CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-podb24dfefc_a3b1_4809_aeb5_19689114dda1.slice. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-podb24dfefc_a3b1_4809_aeb5_19689114dda1.slice: Consumed 9min 18.498s CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 798da44176ac0e642c2253401eb037d3c399a80700b3f1de11f2287dddcad4c9. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-798da44176ac0e642c2253401eb037d3c399a80700b3f1de11f2287dddcad4c9.scope: Consumed 1.758s CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 6ab91aa076b1d331f8fb3e0b85bce1f06055ea4f0d6ab0de2633b9d12499b907. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-6ab91aa076b1d331f8fb3e0b85bce1f06055ea4f0d6ab0de2633b9d12499b907.scope: Consumed 1.617s CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-pod64072f19_880b_446d_8983_57391862c2a5.slice. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-pod64072f19_880b_446d_8983_57391862c2a5.slice: Consumed 18.589s CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-798da44176ac0e642c2253401eb037d3c399a80700b3f1de11f2287dddcad4c9.scope. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-798da44176ac0e642c2253401eb037d3c399a80700b3f1de11f2287dddcad4c9.scope: Consumed 87ms CPU time Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-pod427eaf5b_51db_4731_9b82_6ab23e2380a7.slice. Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-pod427eaf5b_51db_4731_9b82_6ab23e2380a7.slice: Consumed 1min 40.405s CPU time Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-a3d5eb26f97c6fda5241539471f2eecdd04a6254cca4e0340e88514a17966523.scope. Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-a3d5eb26f97c6fda5241539471f2eecdd04a6254cca4e0340e88514a17966523.scope: Consumed 82ms CPU time Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container a3d5eb26f97c6fda5241539471f2eecdd04a6254cca4e0340e88514a17966523. Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-a3d5eb26f97c6fda5241539471f2eecdd04a6254cca4e0340e88514a17966523.scope: Consumed 263ms CPU time Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-podc9532e71d69c40d6430fbf1af181f14b.slice. Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-podc9532e71d69c40d6430fbf1af181f14b.slice: Consumed 908ms CPU time Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: Copying /etc changes: 27 modified, 1 removed, 211 added Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: Copying /etc changes: 27 modified, 1 removed, 211 added Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 860195f44a8617b2ecfccfc033656b9c3f4ae1150934dfcad6d81d4876c685d6. Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-860195f44a8617b2ecfccfc033656b9c3f4ae1150934dfcad6d81d4876c685d6.scope: Consumed 13h 55min 25.411s CPU time Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-860195f44a8617b2ecfccfc033656b9c3f4ae1150934dfcad6d81d4876c685d6.scope. Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-860195f44a8617b2ecfccfc033656b9c3f4ae1150934dfcad6d81d4876c685d6.scope: Consumed 597ms CPU time Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-pod1111092f_8f6c_4d90_b10e_dcc631ebf2c6.slice. Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-pod1111092f_8f6c_4d90_b10e_dcc631ebf2c6.slice: Consumed 13h 55min 26.647s CPU time Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: ** Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: OSTree:ERROR:src/libostree/ostree-sysroot-deploy.c:1682:install_deployment_kernel: assertion failed (kernel_layout->bootcsum == bootcsum): ("f6972c6abcd2c2ccd8d9b850cceb48022fe9413ed6578bc8cb2c2a61469cb2a4" == "b9b2eb4a84cad86d10b334f55197d623f099053a44d7ce797a13f998bc7465db") Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Requested transaction contradicts existing jobs: Transaction is destructive. Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: systemd-coredump.socket: Failed to queue service startup job (Maybe the service file is missing or not a non-template unit?): Transaction is destructive. Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: systemd-coredump.socket: Failed with result 'resources'. Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: systemd-coredump.socket: Consumed 0 CPU time Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Control process exited, code=dumped status=6 Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Failed with result 'core-dump'. Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped OSTree Finalize Staged Deployment. Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Consumed 1.325s CPU time Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped OSTree Monitor Staged Deployment. Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped target System Initialization. ``` @Colin could you have a look at this and see if there is something I missed?
I think we'd also need the content of `ls -al /boot/loader /ostree/boot.*/*` and `rpm-ostree status` would be helpful too. Note this assertion was removed in https://github.com/ostreedev/ostree/commit/aa2a2783ea71d956db41f0f874d596752e47449f which is in 2020.7 which is shipped in OpenShift 4.7. One clear problem from this trace is that kubelet is still running when we go to shut down, which it shouldn't be. I believe that's also been causing other problems and as of 4.8 I think we're now correctly blocking shutdown until kubelet is done.
A quick look at the journal here shows indeed that kubelet is stopping simultaneously with ostree-finalize-staged starting. That's covered by https://github.com/openshift/machine-config-operator/pull/2414 which (following the cherry-pick chain) landed in https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.6.29 But note it only helps for workloads upgrading *from* 4.6.29. I think the root cause here may be related to > Mar 31 09:51:27 gocpp2-9gbr8-worker-1-j6hr2 root[23267]: machine-config-daemon[3436]: Disk currentConfig rendered-worker-9af53086b18e5c57f05dc57e913ae4b8 overrides node's currentConfig annotation rendered-worker-5f7e84fd8cda6fd543758dd44c5063db Which might explain this because we end up racing for synchronizing `/etc` with the finalization process. This still needs more investigation though. # Workarounds Depending on the deployment model, one simple option is to reprovision the affected nodes. Barring that, one thing we can try is effectively "cherry picking" the change from https://github.com/openshift/machine-config-operator/pull/2414 by adding an override via `oc debug node` or ssh.
To elaborate on the above proposed workaround, via `oc debug node` or ssh: $ echo -e '[Unit]\nAfter=ostree-finalize-staged.service' > /etc/systemd/system/kubelet.service.d/backport-mco-2414.conf $ systemctl daemon-reload
Hello Team, In one of the cases, the customer is having a similar issue after cluster upgrades from 4.6.32 to 4.6.42, all the operators upgraded and available but multiple worker nodes are marked as SchedulingDisabled and not getting upgraded. $ omg logs machine-config-daemon-x67jp -c machine-config-daemon 2021-09-17T22:59:48.983294844Z E0917 22:59:48.983282 3904 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-unset-c96de46d04e9074a7ad22d26faa069b1: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9f9926665f165d3fbc5f0e089b0bbff0a77440a7aac41da32a69db6a4b21b2cc", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1" # rpm-ostree status State: idle Warning: failed to finalize previous deployment ostree-finalize-staged.service: Failed with result 'timeout'. check `journalctl -b -1 -u ostree-finalize-staged.service` Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1 CustomOrigin: Managed by machine-config-operator Version: 46.82.202105291300-0 (2021-05-29T13:03:27Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1 CustomOrigin: Managed by machine-config-operator # journalctl -b -1 -u ostree-finalize-staged.service -- Logs begin at Thu 2021-09-16 04:49:30 UTC, end at Sat 2021-09-18 01:22:38 UTC. -- Sep 17 21:52:57 e1n4.fbond systemd[1]: Started OSTree Finalize Staged Deployment. Sep 17 21:53:21 e1n4.fbond systemd[1]: Stopping OSTree Finalize Staged Deployment... Sep 17 21:53:21 e1n4.fbond ostree[399903]: Finalizing staged deployment Sep 17 21:53:22 e1n4.fbond ostree[399903]: Copying /etc changes: 21 modified, 0 removed, 210 added Sep 17 21:53:22 e1n4.fbond ostree[399903]: Copying /etc changes: 21 modified, 0 removed, 210 added Sep 17 21:58:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Stopping timed out. Terminating. Sep 17 22:03:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: State 'stop-sigterm' timed out. Killing. Sep 17 22:03:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Killing process 399903 (ostree) with signal SIGKILL. Sep 17 22:08:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Processes still around after SIGKILL. Ignoring. Sep 17 22:13:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: State 'stop-final-sigterm' timed out. Killing. Sep 17 22:13:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Killing process 399903 (ostree) with signal SIGKILL. Sep 17 22:18:22 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Processes still around after final SIGKILL. Entering failed mode. Sep 17 22:18:22 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Failed with result 'timeout'. Sep 17 22:18:22 e1n4.fbond systemd[1]: Stopped OSTree Finalize Staged Deployment. Sep 17 22:18:22 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Consumed 1.545s CPU time Version: 46.82.202105291300-0 (2021-05-29T13:03:27Z) # ls -al /boot/loader /ostree/boot.*/* lrwxrwxrwx. 1 root root 8 Sep 16 02:42 /boot/loader -> loader.1 /ostree/boot.0.1/rhcos: total 0 drwxr-xr-x. 4 root root 150 Sep 17 21:53 . drwxr-xr-x. 3 root root 19 Sep 17 21:53 .. drwxr-xr-x. 2 root root 15 Sep 17 21:53 5d176d144f2f638bac6c488224cba497681646261b0eabeda8965fe7f3a87bcf drwxr-xr-x. 2 root root 15 Sep 17 21:53 85b9ff4cf967e9434b1be633ed7355ec79427b853b0d22260c941d521855792f /ostree/boot.0/rhcos: total 0 drwxr-xr-x. 4 root root 150 Sep 17 21:53 . drwxr-xr-x. 3 root root 19 Sep 17 21:53 .. drwxr-xr-x. 2 root root 15 Sep 17 21:53 5d176d144f2f638bac6c488224cba497681646261b0eabeda8965fe7f3a87bcf drwxr-xr-x. 2 root root 15 Sep 17 21:53 85b9ff4cf967e9434b1be633ed7355ec79427b853b0d22260c941d521855792f /ostree/boot.1.0/rhcos: total 0 drwxr-xr-x. 3 root root 78 Sep 16 12:59 . drwxr-xr-x. 3 root root 19 Sep 16 12:59 .. drwxr-xr-x. 2 root root 24 Sep 16 12:59 5d176d144f2f638bac6c488224cba497681646261b0eabeda8965fe7f3a87bcf /ostree/boot.1/rhcos: total 0 drwxr-xr-x. 3 root root 78 Sep 16 12:59 . drwxr-xr-x. 3 root root 19 Sep 16 12:59 .. drwxr-xr-x. 2 root root 24 Sep 16 12:59 5d176d144f2f638bac6c488224cba497681646261b0eabeda8965fe7f3a87bcf I will be attaching the sosreport. As of now the workaround for this is to follow the below KCS but the customer wants to know why this issue is coming up as it's not feasible to fix each and every node with the workaround. --> https://access.redhat.com/solutions/5598401 Regards, Ayush Garg
Specifically, the place to put sockets is `/run`.
I'm not seeing such a socket it a fresh 4.8.0-0.nightly-2021-09-18-202713 cluster. I'm guessing this is from some 3rd party daemonset or automation?
Hi, I went deep into the cluster and was able to find the cause. in ARO case the issue seems to be cause by the installed. Specifically the issue seems to be caused by the CSI adriver https://github.com/Azure/secrets-store-csi-driver-provider-azure which causes the azure.sock to block ostree finalize. More investigation have to be done, why this happens. Most importantly the operator is the most likely case for the ARO issue, not the rest.
Great debugging Petr! Looking at the repo I see https://github.com/Azure/secrets-store-csi-driver-provider-azure/blob/9a7b4826f3e41847b8c9a4603891af1e8a893d42/manifest_staging/charts/csi-secrets-store-provider-azure/templates/provider-azure-installer.yaml#L77-L96 which looks to be setting up the sock in /etc/kubernetes/secrets-store-csi-providers by default. Given this is an issue with how the secrets-store-csi-driver-provider-azure is set up would it be fair to close this RHCOS bug and for your team to work with the Azure/secrets-store-csi-driver-provider-azure/ team for a workaround or fix?
As for ARO, I believe CSI driver is to blame. I do not believe this can be closed as original report did not come from ARO.
Are you able to confirm the fix and/or workaround worked for you? If not, could you provide updated log output for folks to review?
I think we could make a change to ostree to ignore sockets: https://github.com/ostreedev/ostree/issues/2446 The issue here is currently somewhat ostree specific is that to my knowledge, few other operating system update systems version `/etc` by default. The versioning of `/etc` though is quite important to how we do transactional updates. Did anyone file an issue against https://github.com/Azure/secrets-store-csi-driver-provider-azure yet?
OK https://github.com/ostreedev/ostree/pull/2453 However, I still strongly recommend moving sockets into `/run` - the basic rationale here is that a socket is inherently something bound to a *process*. And processes go away on reboot, so having the socket persist just causes problems. Even more generally, the best approach now is to use systemd to listen on the socket and use socket activation http://0pointer.de/blog/projects/socket-activation.html which binds the lifecycle of the socket and process in rich, sophisticated way.
`error: During /etc merge: Reading modified config file: fstatat(kubernetes/aide.reinit): No such file or directory` At this point, this bugzilla is a mix of multiple distinct issues that can cause the finalization process to fail. *That* failure is a race where kubelet.service is still running when we go to reboot, which is often caused by drain failures. We added a workaround for this in https://github.com/openshift/machine-config-operator/commit/0369c47835fe9e2a36c3c72b27f1126d8391579a but it won't help all situations. https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/ is the comprehensive fix to this, as it will prevent the reboot process from starting at all until kubelet is ready for it.
Thanks - hopefully that lands in 4.10! Any thoughts on the failure to pivot manually afterwards for the master? error: with stdout output: : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-884750218/srv/repo:17cd2523079d57070db3494b8a3a47649bd33a82806da0287a16743ed7e2f9de --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5483701cda95d5ac2fa1ef22c72e15f4bd4104204e16c499bc1d5ab7debba054 --custom-origin-description Managed by machine-config-operator: exit status 1 error: No enabled repositories
This is a straightforward and low-risk backport, sorry I should have initiated that process at the same time as fixing it upstream. Started the paperwork in https://bugzilla.redhat.com/show_bug.cgi?id=2027788 (But it's also still a good idea to fix upstream projects to not put sockets in /etc)
The build from https://bugzilla.redhat.com/show_bug.cgi?id=2029577 has been cherry picked to current 4.10 nightlies: $ oc image info (oc adm release info --image-for=machine-os-content registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-12-12-232810)| grep ostree ... com.coreos.rpm.ostree=2020.7-6.el8_4.x86_64 $ I'll be working on fast-tracking it to 4.9 and 4.8 shortly.
The updated ostree is in 4.9 and 4.8 nightlies: $ oc image info (oc adm release info --image-for=machine-os-content registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-12-15-103522) | grep rpm.ostree com.coreos.rpm.ostree=2020.7-6.el8_4.x86_64 ... $ oc image info (oc adm release info --image-for=machine-os-content registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-12-15-153530) | grep rpm.ostree com.coreos.rpm.ostree=2020.7-6.el8_4.x86_64 ... $ Promotion jobs for 4.9 are looking fine, so I'd expect this shows up in 4.9.12. 4.8 nightlies are red but just on metal for what looks like unrelated reasons (marketplace image). I'll look at backporting to 4.7 and 4.6 next.
Backport BZs have been created to track the inclusion of `ostree-2020.7-6` in OCP 4.9 -> OCP 4.6 4.9.z - https://bugzilla.redhat.com/show_bug.cgi?id=2037906 4.8.z - https://bugzilla.redhat.com/show_bug.cgi?id=2037907 4.7.z - https://bugzilla.redhat.com/show_bug.cgi?id=2037908 4.6.z - https://bugzilla.redhat.com/show_bug.cgi?id=2037909 Please use those individual BZs to track status for the particular OCP release you are interested in.
To verify this BZ, you can use the test from the upstream PR. https://github.com/ostreedev/ostree/pull/2453/files#diff-b825c339fe39f7ad35412348ab757db3ad07f37d4b3658fbf8b2daa40ec2a152R15-R31 1. Boot RHCOS 4.10 and confirm the version of `ostree` is present 2. As the `root` user, create a service that creates a socket + create a FIFO ``` # cat >/etc/systemd/system/sock-to-ignore.socket << 'EOF' [Socket] ListenStream=/etc/sock-to-ignore EOF $ cat >/etc/systemd/system/sock-to-ignore.service << 'EOF' [Service] ExecStart=/bin/cat EOF # setenforce 0 # systemctl daemon-reload # systemctl start --now sock-to-ignore.socket # setenforce 1 # test -S /etc/sock-to-ignore # mkfifo /etc/fifo-to-ignore ``` 3. Copy a newer version of `strace` to the host; do `rpm-ostree override replace strace.rpm` 4. Reboot 5. Verify that there is a log message from `ostree-finalize-staged.service` about skipping the socket/FIFO during /etc merge `journalctl -b "-1" -u ostree-finalize-staged.service`
Verify passed with latest rhcos-410.84.202201112219-0-qemu.x86_64.qcow2 1. Start vm and check ostree version # rpm -q ostree strace ostree-2020.7-6.el8_4.x86_64 strace-5.7-2.1.el8_4.x86_64 2. Execute Step2 in Comment 54 3. Download newer strace and install # download strace-5.13-2.el8.x86_64.rpm [root@cosa-devsh ~]# rpm-ostree override replace strace-5.13-2.el8.x86_64.rpm 4. Reboot and check result # reboot $ rpm -q strace strace-5.13-2.el8.x86_64 [core@cosa-devsh ~]$ journalctl -b "-1" -u ostree-finalize-staged.service -- Logs begin at Wed 2022-01-12 01:17:21 UTC, end at Wed 2022-01-12 01:26:24 UTC. -- Jan 12 01:24:37 cosa-devsh systemd[1]: Started OSTree Finalize Staged Deployment. Jan 12 01:25:23 cosa-devsh systemd[1]: Stopping OSTree Finalize Staged Deployment... Jan 12 01:25:23 cosa-devsh ostree[1970]: Finalizing staged deployment Jan 12 01:25:24 cosa-devsh ostree[1970]: Copying /etc changes: 8 modified, 0 removed, 33 added Jan 12 01:25:24 cosa-devsh ostree[1970]: Copying /etc changes: 8 modified, 0 removed, 33 added Jan 12 01:25:24 cosa-devsh ostree[1970]: Ignoring non-regular/non-symlink file found during /etc merge: sock-to-ignore Jan 12 01:25:24 cosa-devsh ostree[1970]: Ignoring non-regular/non-symlink file found during /etc merge: fifo-to-ignore Jan 12 01:25:25 cosa-devsh ostree[1970]: Bootloader updated; bootconfig swap: yes; deployment count change: 1 Jan 12 01:25:25 cosa-devsh ostree[1970]: Bootloader updated; bootconfig swap: yes; deployment count change: 1 Jan 12 01:25:25 cosa-devsh systemd[1]: ostree-finalize-staged.service: Succeeded. Jan 12 01:25:25 cosa-devsh systemd[1]: Stopped OSTree Finalize Staged Deployment.
Thanks Micah for the confirmation, change status to VERIFIED according to comment 55
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056