Description of problem: The cluster upgrade was triggered from 4.6.32 to 4.6.42 (BareMetal UPI)version, all the operators were upgraded and available. However, multiple worker nodes upgrade stuck with the same errors mentioned in the following KCS and with multiple tries the issue gets resolved using the workaround from the same KCS. --> https://access.redhat.com/solutions/5598401 $ omg logs machine-config-daemon-j5cbm -c machine-config-daemon 2021-09-18T04:40:27.692760894Z E0918 04:40:27.692748 4174 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-c96de46d04e9074a7ad22d26faa069b1: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9f9926665f165d3fbc5f0e089b0bbff0a77440a7aac41da32a69db6a4b21b2cc", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1" # rpm-ostree status State: idle Warning: failed to finalize previous deployment ostree-finalize-staged.service: Failed with result 'timeout'. check `journalctl -b -1 -u ostree-finalize-staged.service` Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1 CustomOrigin: Managed by machine-config-operator Version: 46.82.202105291300-0 (2021-05-29T13:03:27Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1 CustomOrigin: Managed by machine-config-operator Version: 46.82.202105291300-0 (2021-05-29T13:03:27Z) Version-Release number of selected component (if applicable): Actual results: --> Multiple worker nodes aren't upgrading. Expected results: --> Nodes should be upgraded without any workaround or manual intervention. Additional info: The customer wants to know this issue is coming up. A few nodes were fixed with the workaround but we observed that some of the nodes were getting upgraded properly after applying the workaround and then rebooting while some nodes weren't able to reboot after applying the workaround and we need to do a force reboot and again apply the workaround on them. The node reboot was getting stuck with the following errors on the serial console logs. ~~~ [*** ] A stop job is running for OSTree Fi\u2026d Deployment (9min 15s / 10min 2s)[137509.856750] libceph: mon0 (1)9.x.x.x:6789 socket error on read [* ] A stop job is running for OSTree Fi\u2026d Deployment (9min 16s / 10min 2s)[137510.944757] libceph: mon1 (1)9.x.x.x:6789 socket error on read [ *** ] A stop job is running for OSTree Fi\u2026d Deployment (9min 18s / 10min 2s)[137512.800755] libceph: mon1 (1)9.x.x.x:6789 socket error on read [ **] A stop job is running for OSTree Fi\u2026d Deployment (9min 20s / 10min 2s)[137514.848750] libceph: mon1 (1)9.x.x.x:6789 socket closed (con state CONNECTING) [* ] A stop job is running for OSTree Fi\u2026d Deployment (9min 23s / 10min 2s)[137517.792741] libceph: mon1 (1)9.x.x.x:6789 socket error on write ~~~ I found the following Bugzilla earlier for a similar issue but thought to raise a separate one in case if the issue can be different. --> https://bugzilla.redhat.com/show_bug.cgi?id=1945274
Hello Team, Any updates on this? I have increased the bugzilla severity as the customer is stuck with the upgrade and wanted to know the RCA that's why not implementing with the workaround as the required logs will get flushed afterward. Regards, Ayush Garg
It looks like the problem may be that ceph is getting stuck during shutdown, and somehow this is actually wedging the ostree I/O process. This looks likely to be a ceph kernel bug. Can you verify a bit more about the block storage configuration here - is ceph only used to back Kubernetes PVs, or is it involved in the host storage?
Moving to Ceph for analysis.
It looks like the network was stopped before ceph was unmounted: > Sep 18 03:14:55 e2n2.fbond systemd[1]: Stopped target Remote File Systems. > ... > Sep 18 03:14:56 e2n2.fbond systemd[1]: Stopped target Network (Pre). > ... > Sep 18 03:15:20 e2n2.fbond kernel: libceph: mon1 (1)9.255.165.242:6789 session lost, hunting for new mon Not sure why ceph is not included in the remote file system target. Roping in the ceph-csi folks and apologies for kicking this bz into another component.
(In reply to Colin Walters from comment #3) > It looks like the problem may be that ceph is getting stuck during shutdown, > and somehow this is actually wedging the ostree I/O process. This looks > likely to be a ceph kernel bug. > > Can you verify a bit more about the block storage configuration here - is > ceph only used to back Kubernetes PVs, or is it involved in the host storage? I have confirmed about the ceph usage with customer and it is getting used for Kubernetes PVs only.
*** Bug 2012257 has been marked as a duplicate of this bug. ***
It'd be nice to try to figure out a workaround for this without waiting for systemd. We may be able to inject an additional dependency into the generated mount unit or so?
OK I put up an ostree-side workaround here: https://github.com/ostreedev/ostree/pull/2519 But, it's still clearly better to have systemd fixed, because we need that in order to ensure that ceph can be cleanly unmounted. And per the above comment, I do think it's probably possible for ceph to inject e.g. `x-systemd.after=network-online.target` into its fstab lines to force a network dependency. Or, alternatively ceph could stop using fstab at all, and generate its own mount units with its own dependencies.
Trying to run verification jobs for OCP upgrade here from OCP 4.10 to OCP 4.11: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3435/ and also ODF upgrade here from 4.9 to 4.10 build: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3436/ Karthick, I see you had some input in this BZ as well. Do you think that we can just test with those regression runs? Thanks
I am marking this as verified based on regression runs above as we didn't see any issue like this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372
Hello Team, I can see that the following Bugzilla which is related to this one as it's for the actual RHEL8 bug is still in a Verified state while this Bugzilla is closed. --> https://bugzilla.redhat.com/show_bug.cgi?id=2008825 Can you confirm if the fix is there in 4.10.0 or still the issue isn't fixed? As I am a bit confused since this Bugzilla for RHEL8 is still going on. Regards, Ayush Garg
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days