Bug 2003532
Summary: | [Tracker for RHEL BZ #2008825] Node upgrade failed due to "expected target osImageURL" MCD error | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | aygarg | |
Component: | unclassified | Assignee: | Yug Gupta <ygupta> | |
Status: | CLOSED ERRATA | QA Contact: | Petr Balogh <pbalogh> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.6 | CC: | bniver, branto, ceph-eng-bugs, dornelas, jligon, kcleveng, kelwhite, kramdoss, madam, mrajanna, mrussell, muagarwa, musman, nberry, ndevos, nstielau, ocs-bugs, odf-bz-bot, rar, rcyriac, rhcos-triage, sostapov, tdesala, tmuthami, walters, ychoukse, ygupta | |
Target Milestone: | --- | Keywords: | Tracking | |
Target Release: | ODF 4.10.0 | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2008825 (view as bug list) | Environment: | ||
Last Closed: | 2022-04-13 18:49:43 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2008825 | |||
Bug Blocks: |
Description
aygarg
2021-09-13 07:13:51 UTC
Hello Team, Any updates on this? I have increased the bugzilla severity as the customer is stuck with the upgrade and wanted to know the RCA that's why not implementing with the workaround as the required logs will get flushed afterward. Regards, Ayush Garg It looks like the problem may be that ceph is getting stuck during shutdown, and somehow this is actually wedging the ostree I/O process. This looks likely to be a ceph kernel bug. Can you verify a bit more about the block storage configuration here - is ceph only used to back Kubernetes PVs, or is it involved in the host storage? Moving to Ceph for analysis. It looks like the network was stopped before ceph was unmounted:
> Sep 18 03:14:55 e2n2.fbond systemd[1]: Stopped target Remote File Systems.
> ...
> Sep 18 03:14:56 e2n2.fbond systemd[1]: Stopped target Network (Pre).
> ...
> Sep 18 03:15:20 e2n2.fbond kernel: libceph: mon1 (1)9.255.165.242:6789 session lost, hunting for new mon
Not sure why ceph is not included in the remote file system target. Roping in the ceph-csi folks and apologies for kicking this bz into another component.
(In reply to Colin Walters from comment #3) > It looks like the problem may be that ceph is getting stuck during shutdown, > and somehow this is actually wedging the ostree I/O process. This looks > likely to be a ceph kernel bug. > > Can you verify a bit more about the block storage configuration here - is > ceph only used to back Kubernetes PVs, or is it involved in the host storage? I have confirmed about the ceph usage with customer and it is getting used for Kubernetes PVs only. *** Bug 2012257 has been marked as a duplicate of this bug. *** It'd be nice to try to figure out a workaround for this without waiting for systemd. We may be able to inject an additional dependency into the generated mount unit or so? OK I put up an ostree-side workaround here: https://github.com/ostreedev/ostree/pull/2519 But, it's still clearly better to have systemd fixed, because we need that in order to ensure that ceph can be cleanly unmounted. And per the above comment, I do think it's probably possible for ceph to inject e.g. `x-systemd.after=network-online.target` into its fstab lines to force a network dependency. Or, alternatively ceph could stop using fstab at all, and generate its own mount units with its own dependencies. Trying to run verification jobs for OCP upgrade here from OCP 4.10 to OCP 4.11: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3435/ and also ODF upgrade here from 4.9 to 4.10 build: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3436/ Karthick, I see you had some input in this BZ as well. Do you think that we can just test with those regression runs? Thanks I am marking this as verified based on regression runs above as we didn't see any issue like this. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372 Hello Team, I can see that the following Bugzilla which is related to this one as it's for the actual RHEL8 bug is still in a Verified state while this Bugzilla is closed. --> https://bugzilla.redhat.com/show_bug.cgi?id=2008825 Can you confirm if the fix is there in 4.10.0 or still the issue isn't fixed? As I am a bit confused since this Bugzilla for RHEL8 is still going on. Regards, Ayush Garg The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |