1945274 – ostree-finalize-staged.service failed while upgrading a rhcos node to 4.6.17

Bug 1945274 - ostree-finalize-staged.service failed while upgrading a rhcos node to 4.6.17

Summary: ostree-finalize-staged.service failed while upgrading a rhcos node to 4.6.17

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.6.z
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Colin Walters
QA Contact:	HuijingHei
Docs Contact:
URL:
Whiteboard:
Depends On:	2027788
Blocks:	2037906
TreeView+	depends on / blocked

Reported:	2021-03-31 15:10 UTC by Jatan Malde
Modified:	2024-12-20 19:50 UTC (History)
CC List:	29 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	The sysroot code now ignores (with a log) any non-regular/non-symlink files in /etc.
Clone Of:
Clones:	2037906 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:03:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ostreedev ostree pull 2453	0	None	Merged	deploy: Ignore sockets, fifos in /etc during merge	2021-11-30 16:28:37 UTC
Red Hat Bugzilla	2027788	1	urgent	CLOSED	backport https://github.com/ostreedev/ostree/pull/2453	2023-10-13 12:16:35 UTC
Red Hat Issue Tracker	ART-3593	0	None	None	None	2021-12-13 18:45:36 UTC
Red Hat Knowledge Base (Solution)	5598401	0	None	None	None	2021-04-12 12:45:14 UTC
Red Hat Knowledge Base (Solution)	6522771	0	None	None	None	2021-11-17 19:10:14 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:03:35 UTC

Description Jatan Malde 2021-03-31 15:10:36 UTC

Description of problem:

IHAC who had recently hit an upgrade to OCP 4.6.17 where one of the worker node failed the validation while checking the expected machine-os-content for the desired renderd-config of worker node. 

The rpm-ostree status shows, 
~~~
2021-03-30T23:29:17.278448306Z I0330 23:29:17.278374    3453 daemon.go:862] State: idle
2021-03-30T23:29:17.278448306Z Warning: failed to finalize previous deployment
2021-03-30T23:29:17.278448306Z          ostree-finalize-staged.service: Failed with result 'core-dump'.
2021-03-30T23:29:17.278448306Z          check `journalctl -b -1 -u ostree-finalize-staged.service`
2021-03-30T23:29:17.278448306Z Deployments:
2021-03-30T23:29:17.278448306Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1943d14b0a05126aa758a72fd996f08389aaa2140005d610715a0f3b11eeea57
2021-03-30T23:29:17.278448306Z               CustomOrigin: Managed by machine-config-operator
2021-03-30T23:29:17.278448306Z                    Version: 45.82.202011131531-0 (2020-11-13T15:35:05Z)
2021-03-30T23:29:17.278448306Z 
2021-03-30T23:29:17.278448306Z   ostree://c95ed1eb5c045492d7293a2a1b7178a050f857944fc46ab3377fca3afd5b7b31
2021-03-30T23:29:17.278448306Z                    Version: 44.81.202004250133-0 (2020-04-25T01:38:52Z)
2021-03-30T23:29:17.278533903Z I0330 23:29:17.278519    3453 rpm-ostree.go:261] Running captured: journalctl --list-boots

~~~

Checking the service logs, I see the following

~~~
-- Logs begin at Tue 2021-03-30 07:35:34 UTC, end at Wed 2021-03-31 11:52:15 UTC. --
Mar 31 00:26:28 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Started OSTree Finalize Staged Deployment.
Mar 31 00:26:45 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopping OSTree Finalize Staged Deployment...
Mar 31 00:26:45 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: Finalizing staged deployment
Mar 31 00:26:48 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: Copying /etc changes: 29 modified, 1 removed, 214 added
Mar 31 00:26:48 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: Copying /etc changes: 29 modified, 1 removed, 214 added
Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: **
Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 ostree[17337]: OSTree:ERROR:src/libostree/ostree-sysroot-deploy.c:1682:install_deployment_kernel: assertion failed (kernel_layout->bootcsum == b>
Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Control process exited, code=dumped status=6
Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Failed with result 'core-dump'.
Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped OSTree Finalize Staged Deployment.
Mar 31 00:26:49 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Consumed 1.350s CPU time
lines 1-12/12 (END)
~~~

Attached in the google-drive:- 
1 sos-report from the node
2. latest must-gather from the cluster
3. /var/log/journal/ directory tar from the problematic node.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Micah Abbott 2021-03-31 20:41:57 UTC

Unfortunately the journal files that are in the giant tarball don't have any detailed stack trace from `ostree-finalize-staged.service` dumping core.  In fact, it looks like the `systemd-coredump` service wasn't able to start successfully at all which would explain why there is no core file to work with.

(The error from `systemd-coredump.socket` appears to be a generic catch-all error, it looks like: https://github.com/systemd/systemd/blob/87f9300d71e30ab04dee60cf78b8cdf09506c65b/src/core/service.h#L70 which seems similar to the socket version of the error https://github.com/systemd/systemd/blob/main/src/core/socket.h#L34)


FWIW, the first evidence of `ostree-finalize-staged.service` dumping core shows up in the journal file `system`


```
$ journalctl --file system | grep -B 50 -A 10 -m 1 ostree-sysroot-deploy
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Closed D-Bus System Message Bus Socket.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: dbus.socket: Consumed 0 CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: Finalizing staged deployment
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-2f2828c75715e1eb6b7b71aecaead22ebc1ec499c25974ba920c9e406d0c8312.scope.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-2f2828c75715e1eb6b7b71aecaead22ebc1ec499c25974ba920c9e406d0c8312.scope: Consumed 87ms CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-31e0bfe4f7fba89c258ad00f90bbf738f828d968dd520f7eb28643efb1cc839e.scope.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-31e0bfe4f7fba89c258ad00f90bbf738f828d968dd520f7eb28643efb1cc839e.scope: Consumed 87ms CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 2f2828c75715e1eb6b7b71aecaead22ebc1ec499c25974ba920c9e406d0c8312.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-2f2828c75715e1eb6b7b71aecaead22ebc1ec499c25974ba920c9e406d0c8312.scope: Consumed 1min 46.381s CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 31e0bfe4f7fba89c258ad00f90bbf738f828d968dd520f7eb28643efb1cc839e.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-31e0bfe4f7fba89c258ad00f90bbf738f828d968dd520f7eb28643efb1cc839e.scope: Consumed 1.075s CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container dec1c97af1328289e7d8b8221decf147dbea9b5a5680d3f008ee85466de2072d.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-dec1c97af1328289e7d8b8221decf147dbea9b5a5680d3f008ee85466de2072d.scope: Consumed 781ms CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-dec1c97af1328289e7d8b8221decf147dbea9b5a5680d3f008ee85466de2072d.scope.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-dec1c97af1328289e7d8b8221decf147dbea9b5a5680d3f008ee85466de2072d.scope: Consumed 88ms CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-pod7a9564f6_dc05_4bf2_8906_ae2a4d790ef2.slice.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-pod7a9564f6_dc05_4bf2_8906_ae2a4d790ef2.slice: Consumed 1min 49.504s CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-c30073d22f67af35e664160b772c11dfb4bc605d4b3b5733433784e6e2c2f81c.scope.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-c30073d22f67af35e664160b772c11dfb4bc605d4b3b5733433784e6e2c2f81c.scope: Consumed 394ms CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-6ab91aa076b1d331f8fb3e0b85bce1f06055ea4f0d6ab0de2633b9d12499b907.scope.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-6ab91aa076b1d331f8fb3e0b85bce1f06055ea4f0d6ab0de2633b9d12499b907.scope: Consumed 86ms CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container c30073d22f67af35e664160b772c11dfb4bc605d4b3b5733433784e6e2c2f81c.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-c30073d22f67af35e664160b772c11dfb4bc605d4b3b5733433784e6e2c2f81c.scope: Consumed 9min 16.723s CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-podb24dfefc_a3b1_4809_aeb5_19689114dda1.slice.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-podb24dfefc_a3b1_4809_aeb5_19689114dda1.slice: Consumed 9min 18.498s CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 798da44176ac0e642c2253401eb037d3c399a80700b3f1de11f2287dddcad4c9.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-798da44176ac0e642c2253401eb037d3c399a80700b3f1de11f2287dddcad4c9.scope: Consumed 1.758s CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 6ab91aa076b1d331f8fb3e0b85bce1f06055ea4f0d6ab0de2633b9d12499b907.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-6ab91aa076b1d331f8fb3e0b85bce1f06055ea4f0d6ab0de2633b9d12499b907.scope: Consumed 1.617s CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-pod64072f19_880b_446d_8983_57391862c2a5.slice.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-pod64072f19_880b_446d_8983_57391862c2a5.slice: Consumed 18.589s CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-798da44176ac0e642c2253401eb037d3c399a80700b3f1de11f2287dddcad4c9.scope.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-798da44176ac0e642c2253401eb037d3c399a80700b3f1de11f2287dddcad4c9.scope: Consumed 87ms CPU time
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-pod427eaf5b_51db_4731_9b82_6ab23e2380a7.slice.
Mar 30 19:28:30 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-pod427eaf5b_51db_4731_9b82_6ab23e2380a7.slice: Consumed 1min 40.405s CPU time
Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-a3d5eb26f97c6fda5241539471f2eecdd04a6254cca4e0340e88514a17966523.scope.
Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-a3d5eb26f97c6fda5241539471f2eecdd04a6254cca4e0340e88514a17966523.scope: Consumed 82ms CPU time
Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container a3d5eb26f97c6fda5241539471f2eecdd04a6254cca4e0340e88514a17966523.
Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-a3d5eb26f97c6fda5241539471f2eecdd04a6254cca4e0340e88514a17966523.scope: Consumed 263ms CPU time
Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-podc9532e71d69c40d6430fbf1af181f14b.slice.
Mar 30 19:28:31 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-podc9532e71d69c40d6430fbf1af181f14b.slice: Consumed 908ms CPU time
Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: Copying /etc changes: 27 modified, 1 removed, 211 added
Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: Copying /etc changes: 27 modified, 1 removed, 211 added
Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped libcontainer container 860195f44a8617b2ecfccfc033656b9c3f4ae1150934dfcad6d81d4876c685d6.
Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-860195f44a8617b2ecfccfc033656b9c3f4ae1150934dfcad6d81d4876c685d6.scope: Consumed 13h 55min 25.411s CPU time
Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped crio-conmon-860195f44a8617b2ecfccfc033656b9c3f4ae1150934dfcad6d81d4876c685d6.scope.
Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: crio-conmon-860195f44a8617b2ecfccfc033656b9c3f4ae1150934dfcad6d81d4876c685d6.scope: Consumed 597ms CPU time
Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Removed slice libcontainer container kubepods-burstable-pod1111092f_8f6c_4d90_b10e_dcc631ebf2c6.slice.
Mar 30 19:28:32 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: kubepods-burstable-pod1111092f_8f6c_4d90_b10e_dcc631ebf2c6.slice: Consumed 13h 55min 26.647s CPU time
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: **
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 ostree[349103]: OSTree:ERROR:src/libostree/ostree-sysroot-deploy.c:1682:install_deployment_kernel: assertion failed (kernel_layout->bootcsum == bootcsum): ("f6972c6abcd2c2ccd8d9b850cceb48022fe9413ed6578bc8cb2c2a61469cb2a4" == "b9b2eb4a84cad86d10b334f55197d623f099053a44d7ce797a13f998bc7465db")
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Requested transaction contradicts existing jobs: Transaction is destructive.
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: systemd-coredump.socket: Failed to queue service startup job (Maybe the service file is missing or not a non-template unit?): Transaction is destructive.
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: systemd-coredump.socket: Failed with result 'resources'.
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: systemd-coredump.socket: Consumed 0 CPU time
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Control process exited, code=dumped status=6
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Failed with result 'core-dump'.
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped OSTree Finalize Staged Deployment.
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: ostree-finalize-staged.service: Consumed 1.325s CPU time
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped OSTree Monitor Staged Deployment.
Mar 30 19:28:33 gocpp2-9gbr8-worker-1-j6hr2 systemd[1]: Stopped target System Initialization.
```

@Colin could you have a look at this and see if there is something I missed?

Comment 5 Colin Walters 2021-04-08 19:41:10 UTC

I think we'd also need the content of `ls -al /boot/loader /ostree/boot.*/*` and `rpm-ostree status` would be helpful too.

Note this assertion was removed in https://github.com/ostreedev/ostree/commit/aa2a2783ea71d956db41f0f874d596752e47449f which is in 2020.7 which is shipped in OpenShift 4.7.

One clear problem from this trace is that kubelet is still running when we go to shut down, which it shouldn't be.  I believe that's also been causing other problems and as of 4.8 I think we're now correctly blocking shutdown until kubelet is done.

Comment 7 Colin Walters 2021-06-02 20:45:37 UTC

A quick look at the journal here shows indeed that kubelet is stopping simultaneously with ostree-finalize-staged starting.

That's covered by https://github.com/openshift/machine-config-operator/pull/2414
which (following the cherry-pick chain) landed in
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.6.29

But note it only helps for workloads upgrading *from* 4.6.29.

I think the root cause here may be related to

> Mar 31 09:51:27 gocpp2-9gbr8-worker-1-j6hr2 root[23267]: machine-config-daemon[3436]: Disk currentConfig rendered-worker-9af53086b18e5c57f05dc57e913ae4b8 overrides node's currentConfig annotation rendered-worker-5f7e84fd8cda6fd543758dd44c5063db

Which might explain this because we end up racing for synchronizing `/etc` with the finalization process.

This still needs more investigation though.

# Workarounds

Depending on the deployment model, one simple option is to reprovision the affected nodes.

Barring that, one thing we can try is effectively "cherry picking" the change from https://github.com/openshift/machine-config-operator/pull/2414 by adding an override via `oc debug node` or ssh.

Comment 8 Colin Walters 2021-06-02 20:54:45 UTC

To elaborate on the above proposed workaround, via `oc debug node` or ssh:

$ echo -e '[Unit]\nAfter=ostree-finalize-staged.service' > /etc/systemd/system/kubelet.service.d/backport-mco-2414.conf
$ systemctl daemon-reload

Comment 11 aygarg 2021-09-10 17:29:09 UTC

Hello Team,

In one of the cases, the customer is having a similar issue after cluster upgrades from 4.6.32 to 4.6.42, all the operators upgraded and available but multiple worker nodes are marked as SchedulingDisabled and not getting upgraded.


$ omg logs  machine-config-daemon-x67jp -c machine-config-daemon
2021-09-17T22:59:48.983294844Z E0917 22:59:48.983282    3904 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-unset-c96de46d04e9074a7ad22d26faa069b1: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9f9926665f165d3fbc5f0e089b0bbff0a77440a7aac41da32a69db6a4b21b2cc", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1"



# rpm-ostree status
State: idle
Warning: failed to finalize previous deployment
         ostree-finalize-staged.service: Failed with result 'timeout'.
         check `journalctl -b -1 -u ostree-finalize-staged.service`
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202105291300-0 (2021-05-29T13:03:27Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1
              CustomOrigin: Managed by machine-config-operator



# journalctl -b -1 -u ostree-finalize-staged.service
-- Logs begin at Thu 2021-09-16 04:49:30 UTC, end at Sat 2021-09-18 01:22:38 UTC. --
Sep 17 21:52:57 e1n4.fbond systemd[1]: Started OSTree Finalize Staged Deployment.
Sep 17 21:53:21 e1n4.fbond systemd[1]: Stopping OSTree Finalize Staged Deployment...
Sep 17 21:53:21 e1n4.fbond ostree[399903]: Finalizing staged deployment
Sep 17 21:53:22 e1n4.fbond ostree[399903]: Copying /etc changes: 21 modified, 0 removed, 210 added
Sep 17 21:53:22 e1n4.fbond ostree[399903]: Copying /etc changes: 21 modified, 0 removed, 210 added
Sep 17 21:58:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Stopping timed out. Terminating.
Sep 17 22:03:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: State 'stop-sigterm' timed out. Killing.
Sep 17 22:03:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Killing process 399903 (ostree) with signal SIGKILL.
Sep 17 22:08:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Processes still around after SIGKILL. Ignoring.
Sep 17 22:13:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: State 'stop-final-sigterm' timed out. Killing.
Sep 17 22:13:21 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Killing process 399903 (ostree) with signal SIGKILL.
Sep 17 22:18:22 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Processes still around after final SIGKILL. Entering failed mode.
Sep 17 22:18:22 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Failed with result 'timeout'.
Sep 17 22:18:22 e1n4.fbond systemd[1]: Stopped OSTree Finalize Staged Deployment.
Sep 17 22:18:22 e1n4.fbond systemd[1]: ostree-finalize-staged.service: Consumed 1.545s CPU time
                   Version: 46.82.202105291300-0 (2021-05-29T13:03:27Z)



#  ls -al /boot/loader /ostree/boot.*/*
lrwxrwxrwx. 1 root root   8 Sep 16 02:42 /boot/loader -> loader.1

/ostree/boot.0.1/rhcos:
total 0
drwxr-xr-x. 4 root root 150 Sep 17 21:53 .
drwxr-xr-x. 3 root root  19 Sep 17 21:53 ..
drwxr-xr-x. 2 root root  15 Sep 17 21:53 5d176d144f2f638bac6c488224cba497681646261b0eabeda8965fe7f3a87bcf
drwxr-xr-x. 2 root root  15 Sep 17 21:53 85b9ff4cf967e9434b1be633ed7355ec79427b853b0d22260c941d521855792f

/ostree/boot.0/rhcos:
total 0
drwxr-xr-x. 4 root root 150 Sep 17 21:53 .
drwxr-xr-x. 3 root root  19 Sep 17 21:53 ..
drwxr-xr-x. 2 root root  15 Sep 17 21:53 5d176d144f2f638bac6c488224cba497681646261b0eabeda8965fe7f3a87bcf
drwxr-xr-x. 2 root root  15 Sep 17 21:53 85b9ff4cf967e9434b1be633ed7355ec79427b853b0d22260c941d521855792f

/ostree/boot.1.0/rhcos:
total 0
drwxr-xr-x. 3 root root 78 Sep 16 12:59 .
drwxr-xr-x. 3 root root 19 Sep 16 12:59 ..
drwxr-xr-x. 2 root root 24 Sep 16 12:59 5d176d144f2f638bac6c488224cba497681646261b0eabeda8965fe7f3a87bcf

/ostree/boot.1/rhcos:
total 0
drwxr-xr-x. 3 root root 78 Sep 16 12:59 .
drwxr-xr-x. 3 root root 19 Sep 16 12:59 ..
drwxr-xr-x. 2 root root 24 Sep 16 12:59 5d176d144f2f638bac6c488224cba497681646261b0eabeda8965fe7f3a87bcf



I will be attaching the sosreport. As of now the workaround for this is to follow the below KCS but the customer wants to know why this issue is coming up as it's not feasible to fix each and every node with the workaround.
--> https://access.redhat.com/solutions/5598401


Regards,
Ayush Garg

Comment 23 Colin Walters 2021-09-20 17:25:14 UTC

Specifically, the place to put sockets is `/run`.

Comment 24 Colin Walters 2021-09-20 17:47:21 UTC

I'm not seeing such a socket it a fresh 4.8.0-0.nightly-2021-09-18-202713 cluster.  I'm guessing this is from some 3rd party daemonset or automation?

Comment 25 Petr Kotas 2021-09-24 12:43:41 UTC

Hi, I went deep into the cluster and was able to find the cause. in ARO case the issue seems to be cause by the installed. Specifically the issue seems to be caused by the CSI adriver https://github.com/Azure/secrets-store-csi-driver-provider-azure which causes the azure.sock to block ostree finalize.

More investigation have to be done, why this happens. Most importantly the operator is the most likely case for the ARO issue, not the rest.

Comment 26 Steve Milner 2021-09-24 13:08:15 UTC

Great debugging Petr! Looking at the repo I see https://github.com/Azure/secrets-store-csi-driver-provider-azure/blob/9a7b4826f3e41847b8c9a4603891af1e8a893d42/manifest_staging/charts/csi-secrets-store-provider-azure/templates/provider-azure-installer.yaml#L77-L96 which looks to be setting up the sock in /etc/kubernetes/secrets-store-csi-providers by default.

Given this is an issue with how the secrets-store-csi-driver-provider-azure is set up would it be fair to close this RHCOS bug and for your team to work with the Azure/secrets-store-csi-driver-provider-azure/ team for a workaround or fix?

Comment 29 Petr Kotas 2021-09-27 07:04:33 UTC

As for ARO, I believe CSI driver is to blame. I do not believe this can be closed as original report did not come from ARO.

Comment 30 Steve Milner 2021-09-27 13:07:46 UTC

Are you able to confirm the fix and/or workaround worked for you? If not, could you provide updated log output for folks to review?

Comment 31 Colin Walters 2021-09-28 21:14:57 UTC

I think we could make a change to ostree to ignore sockets: https://github.com/ostreedev/ostree/issues/2446

The issue here is currently somewhat ostree specific is that to my knowledge, few other operating system update systems version `/etc` by default.
The versioning of `/etc` though is quite important to how we do transactional updates.

Did anyone file an issue against https://github.com/Azure/secrets-store-csi-driver-provider-azure yet?

Comment 36 Colin Walters 2021-10-01 14:43:38 UTC

OK https://github.com/ostreedev/ostree/pull/2453

However, I still strongly recommend moving sockets into `/run` - the basic rationale here is that a socket is inherently something bound to a *process*.  And processes go away on reboot, so having the socket persist just causes problems.

Even more generally, the best approach now is to use systemd to listen on the socket and use socket activation http://0pointer.de/blog/projects/socket-activation.html which binds the lifecycle of the socket and process in rich, sophisticated way.

Comment 37 Colin Walters 2021-10-01 14:52:12 UTC

`error: During /etc merge: Reading modified config file: fstatat(kubernetes/aide.reinit): No such file or directory`

At this point, this bugzilla is a mix of multiple distinct issues that can cause the finalization process to fail.

*That* failure is a race where kubelet.service is still running when we go to reboot, which is often caused by drain failures.
We added a workaround for this in https://github.com/openshift/machine-config-operator/commit/0369c47835fe9e2a36c3c72b27f1126d8391579a
but it won't help all situations.

https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/
is the comprehensive fix to this, as it will prevent the reboot process from starting at all until kubelet is ready for it.

Comment 38 Matthew Robson 2021-10-01 15:01:56 UTC

Thanks - hopefully that lands in 4.10!

Any thoughts on the failure to pivot manually afterwards for the master?

error: with stdout output: : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-884750218/srv/repo:17cd2523079d57070db3494b8a3a47649bd33a82806da0287a16743ed7e2f9de --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5483701cda95d5ac2fa1ef22c72e15f4bd4104204e16c499bc1d5ab7debba054 --custom-origin-description Managed by machine-config-operator: exit status 1
error: No enabled repositories

Comment 45 Colin Walters 2021-11-30 16:12:30 UTC

This is a straightforward and low-risk backport, sorry I should have initiated that process at the same time as fixing it upstream.
Started the paperwork in https://bugzilla.redhat.com/show_bug.cgi?id=2027788

(But it's also still a good idea to fix upstream projects to not put sockets in /etc)

Comment 49 Colin Walters 2021-12-13 17:56:02 UTC

The build from https://bugzilla.redhat.com/show_bug.cgi?id=2029577 has been cherry picked to current 4.10 nightlies:

$ oc image info (oc adm release info --image-for=machine-os-content registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-12-12-232810)| grep ostree
...
             com.coreos.rpm.ostree=2020.7-6.el8_4.x86_64
$

I'll be working on fast-tracking it to 4.9 and 4.8 shortly.

Comment 50 Colin Walters 2021-12-15 20:27:37 UTC

The updated ostree is in 4.9 and 4.8 nightlies:

$ oc image info (oc adm release info --image-for=machine-os-content registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-12-15-103522) | grep rpm.ostree
             com.coreos.rpm.ostree=2020.7-6.el8_4.x86_64
...
$ oc image info (oc adm release info --image-for=machine-os-content registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-12-15-153530) | grep rpm.ostree
             com.coreos.rpm.ostree=2020.7-6.el8_4.x86_64
...
$

Promotion jobs for 4.9 are looking fine, so I'd expect this shows up in 4.9.12.  4.8 nightlies are red but just on metal for what looks like unrelated reasons (marketplace image).

I'll look at backporting to 4.7 and 4.6 next.

Comment 52 Micah Abbott 2022-01-06 19:42:20 UTC

Backport BZs have been created to track the inclusion of `ostree-2020.7-6` in OCP 4.9 -> OCP 4.6

4.9.z - https://bugzilla.redhat.com/show_bug.cgi?id=2037906
4.8.z - https://bugzilla.redhat.com/show_bug.cgi?id=2037907
4.7.z - https://bugzilla.redhat.com/show_bug.cgi?id=2037908
4.6.z - https://bugzilla.redhat.com/show_bug.cgi?id=2037909

Please use those individual BZs to track status for the particular OCP release you are interested in.

Comment 54 Micah Abbott 2022-01-07 14:45:47 UTC

To verify this BZ, you can use the test from the upstream PR.

https://github.com/ostreedev/ostree/pull/2453/files#diff-b825c339fe39f7ad35412348ab757db3ad07f37d4b3658fbf8b2daa40ec2a152R15-R31

1.  Boot RHCOS 4.10 and confirm the version of `ostree` is present
2.  As the `root` user, create a service that creates a socket + create a FIFO

```
# cat >/etc/systemd/system/sock-to-ignore.socket << 'EOF'
[Socket]
ListenStream=/etc/sock-to-ignore
EOF
$ cat >/etc/systemd/system/sock-to-ignore.service << 'EOF'
[Service]
ExecStart=/bin/cat
EOF

# setenforce 0
# systemctl daemon-reload
# systemctl start --now sock-to-ignore.socket
# setenforce 1
# test -S /etc/sock-to-ignore
# mkfifo /etc/fifo-to-ignore
```

3. Copy a newer version of `strace` to the host; do `rpm-ostree override replace strace.rpm`
4. Reboot
5. Verify that there is a log message from `ostree-finalize-staged.service` about skipping the socket/FIFO during /etc merge

`journalctl -b "-1" -u ostree-finalize-staged.service`

Comment 55 HuijingHei 2022-01-12 03:36:10 UTC

Verify passed with latest rhcos-410.84.202201112219-0-qemu.x86_64.qcow2

1. Start vm and check ostree version
# rpm -q ostree strace
ostree-2020.7-6.el8_4.x86_64
strace-5.7-2.1.el8_4.x86_64

2. Execute Step2 in Comment 54
3. Download newer strace and install
# download strace-5.13-2.el8.x86_64.rpm
[root@cosa-devsh ~]# rpm-ostree override replace strace-5.13-2.el8.x86_64.rpm

4. Reboot and check result
# reboot
$ rpm -q strace
strace-5.13-2.el8.x86_64

[core@cosa-devsh ~]$ journalctl -b "-1" -u ostree-finalize-staged.service
-- Logs begin at Wed 2022-01-12 01:17:21 UTC, end at Wed 2022-01-12 01:26:24 UTC. --
Jan 12 01:24:37 cosa-devsh systemd[1]: Started OSTree Finalize Staged Deployment.
Jan 12 01:25:23 cosa-devsh systemd[1]: Stopping OSTree Finalize Staged Deployment...
Jan 12 01:25:23 cosa-devsh ostree[1970]: Finalizing staged deployment
Jan 12 01:25:24 cosa-devsh ostree[1970]: Copying /etc changes: 8 modified, 0 removed, 33 added
Jan 12 01:25:24 cosa-devsh ostree[1970]: Copying /etc changes: 8 modified, 0 removed, 33 added
Jan 12 01:25:24 cosa-devsh ostree[1970]: Ignoring non-regular/non-symlink file found during /etc merge: sock-to-ignore
Jan 12 01:25:24 cosa-devsh ostree[1970]: Ignoring non-regular/non-symlink file found during /etc merge: fifo-to-ignore
Jan 12 01:25:25 cosa-devsh ostree[1970]: Bootloader updated; bootconfig swap: yes; deployment count change: 1
Jan 12 01:25:25 cosa-devsh ostree[1970]: Bootloader updated; bootconfig swap: yes; deployment count change: 1
Jan 12 01:25:25 cosa-devsh systemd[1]: ostree-finalize-staged.service: Succeeded.
Jan 12 01:25:25 cosa-devsh systemd[1]: Stopped OSTree Finalize Staged Deployment.

Comment 56 HuijingHei 2022-01-13 00:56:59 UTC

Thanks Micah for the confirmation, change status to VERIFIED according to comment 55

Comment 59 errata-xmlrpc 2022-03-10 16:03:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.

aivaraslaimikis
andcosta
apurty
aygarg
bbreard
cblecker
cpassare
dramseur
gvanderp
hhei
imcleod
jima
jlebon
jligon
lucab
miabbott
mrobson
mwoodson
nmalik
nstielau
oarribas
obockows
openshift-bugs-escalate
pkotas
rhodain
smilner
vkochuku
walters
yoliynyk