Created attachment 1870877 [details] must-gather OCP Version at Install Time: 4.11.0-0.nightly-2022-04-05-054839 RHCOS Version at Install Time: 411.85.202203181601-0 OCP Version after Upgrade (if applicable): RHCOS Version after Upgrade (if applicable): 411.86.202204031335-0 Platform: OpenStack Architecture: x86_64 What are you trying to do? What is your use case? Deploy OCP. Nothing fancy. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-e2e-openstack-parallel/1511030372081602560 What happened? What went wrong or what did you expect? After RHCOS upgrade, neither kubelet or crio start correctly. Looking at journalctl, we can see errors about read-only file system, such as: Apr 05 14:21:39 mandre-9mgl2-master-2 systemd[1]: var-lib-containers-storage-overlay-1da4fa1d5716ba68633041af0c61e05324a7a37defb76c5309fac17392cec785-merged.mount: Succeeded. Apr 05 14:21:39 mandre-9mgl2-master-2 bash[21327]: Error: setxattr /etc/systemd/system/basic.target.wants/coreos-ignition-firstboot-complete.service: read-only file system What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node. Deploy OCP using release 4.11.0-0.nightly-2022-04-05-054839.
I failed to mention that sssd is crashing in a loop, which could be a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2072050.
The error that gets logged seems correct: /etc/systemd/system/basic.target.wants/coreos-ignition-firstboot-complete.service is a symlink to /usr/lib/systemd/system/coreos-ignition-firstboot-complete.service, which lives on /usr, which is a read-only filesystem (part of the OS deployment/commit). In the logs there are in fact other similar setxattr errors (e.g. on /etc/systemd/system/ctrl-alt-del.target which links to /usr/lib/systemd/system/reboot.target), so I don't think this a generic RHCOS issue as I don't expect services to be tweaking that service unit. Whatever is the rogue component which is trying to add xattrs to files in /etc should possibly reconsider against doing that (or at least be ready to cope with symlinks and RO files). From the logs I couldn't easily distinguish which service is involved in this, possibly some bash-in-podman script? This may need some further evidence gathering on a running node to track down the specific service, and then checking the logic and the observed behavior with the relevant team.
Decreasing the severity as the blocking issue seems to be https://bugzilla.redhat.com/show_bug.cgi?id=2072050 where SSSD issue causes the boot to hang.
This issue is blocking openshift installation, so I'd say severity should be high enough I tried a new release of rhcos that does fix the SSSD problem and even so I'm still unable to start a new OCP cluster I narrowed down the issue to a script that runs a container with podman in a loop until it succeeds, but due to the setxattr error it will never succeed: root 2158 0.0 0.0 23056 3028 ? Ss 18:51 0:00 /bin/bash -c until /usr/bin/podman run --rm --authfile /var/lib/kubelet/config.json --net=host --volume /etc/systemd/system:/etc/systemd/system:z quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3090aca5bc36bb0baeb6ae84bffbd07a9b7549ef5a70cd466bd4d06bd72b2b3 node-ip set --retry-on-failure 192.168.122.119; do sleep 5; done Seems that the problem arises whenever we add the /etc/systemd volume with :z, with any container: [root@master0 ~]# podman run --volume /etc/systemd/system:/etc/systemd/system:z ubi8/ubi-minimal Error: setxattr /etc/systemd/system/basic.target.wants/coreos-ignition-firstboot-complete.service: read-only file system And digging further I arrive to this issue with podman, that seems a perfect match: https://github.com/containers/podman/issues/13727
Should there be a bug opened with node/container runtimes or does one exist already?
Just opened a bug on RHEl8.6/podman: https://bugzilla.redhat.com/show_bug.cgi?id=2074090
Martin/Javi - is this BZ still an issue? I think we root caused it to a bad template in the MCO, fixed here - https://github.com/openshift/machine-config-operator/pull/3079
We'll know if https://github.com/openshift/machine-config-operator/pull/3079 fixed the issue once the RHEL 8.6 rebase of RHCOS (and podman 4) lands in OCP.
Not an issue AFAIK I tried on latest RHEL86 based coreos and it works perfectly, also podman version is bumped to the one with the setxattr fix, so even if MCO had the bad template, it will work
Let's close as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2074613 then. *** This bug has been marked as a duplicate of bug 2074613 ***