2003532 – [Tracker for RHEL BZ #2008825] Node upgrade failed due to "expected target osImageURL" MCD error

Bug 2003532 - [Tracker for RHEL BZ #2008825] Node upgrade failed due to "expected target osImageURL" MCD error

Summary: [Tracker for RHEL BZ #2008825] Node upgrade failed due to "expected target os...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	unclassified
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.10.0
Assignee:	Yug Gupta
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2012257 (view as bug list)
Depends On:	2008825
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-13 07:13 UTC by aygarg
Modified:	2023-12-08 04:26 UTC (History)
CC List:	27 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2008825 (view as bug list)
Environment:
Last Closed:	2022-04-13 18:49:43 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ostreedev ostree pull 2519	0	None	open	deploy: Add a 5s max timeout on global filesystem `sync()`	2022-01-17 17:51:10 UTC
Red Hat Product Errata	RHSA-2022:1372	0	None	None	None	2022-04-13 18:50:25 UTC

Internal Links: 2160075

Description aygarg 2021-09-13 07:13:51 UTC

Description of problem:
The cluster upgrade was triggered from 4.6.32 to 4.6.42 (BareMetal UPI)version, all the operators were upgraded and available. However, multiple worker nodes upgrade stuck with the same errors mentioned in the following KCS and with multiple tries the issue gets resolved using the workaround from the same KCS.
--> https://access.redhat.com/solutions/5598401

$ omg logs machine-config-daemon-j5cbm -c machine-config-daemon
2021-09-18T04:40:27.692760894Z E0918 04:40:27.692748    4174 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-c96de46d04e9074a7ad22d26faa069b1: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9f9926665f165d3fbc5f0e089b0bbff0a77440a7aac41da32a69db6a4b21b2cc", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1"


# rpm-ostree status
State: idle 
Warning: failed to finalize previous deployment
         ostree-finalize-staged.service: Failed with result 'timeout'.
         check `journalctl -b -1 -u ostree-finalize-staged.service`
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202105291300-0 (2021-05-29T13:03:27Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a747cb0239e38057336ca8dc4fb528d9cbc93a333409d02151a7ade40aaa4a1
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202105291300-0 (2021-05-29T13:03:27Z)

Version-Release number of selected component (if applicable):





Actual results:
--> Multiple worker nodes aren't upgrading.


Expected results:
--> Nodes should be upgraded without any workaround or manual intervention.

Additional info:
The customer wants to know this issue is coming up. A few nodes were fixed with the workaround but we observed that some of the nodes were getting upgraded properly after applying the workaround and then rebooting while some nodes weren't able to reboot after applying the workaround and we need to do a force reboot and again apply the workaround on them. The node reboot was getting stuck with the following errors on the serial console logs.

~~~
[***   ] A stop job is running for OSTree Fi\u2026d Deployment (9min 15s / 10min 2s)[137509.856750] libceph: mon0 (1)9.x.x.x:6789 socket error on read
[*     ] A stop job is running for OSTree Fi\u2026d Deployment (9min 16s / 10min 2s)[137510.944757] libceph: mon1 (1)9.x.x.x:6789 socket error on read
[  *** ] A stop job is running for OSTree Fi\u2026d Deployment (9min 18s / 10min 2s)[137512.800755] libceph: mon1 (1)9.x.x.x:6789 socket error on read
[    **] A stop job is running for OSTree Fi\u2026d Deployment (9min 20s / 10min 2s)[137514.848750] libceph: mon1 (1)9.x.x.x:6789 socket closed (con state CONNECTING)
[*     ] A stop job is running for OSTree Fi\u2026d Deployment (9min 23s / 10min 2s)[137517.792741] libceph: mon1 (1)9.x.x.x:6789 socket error on write
~~~

I found the following Bugzilla earlier for a similar issue but thought to raise a separate one in case if the issue can be different.
--> https://bugzilla.redhat.com/show_bug.cgi?id=1945274

Comment 2 aygarg 2021-09-17 13:59:09 UTC

Hello Team,

Any updates on this? I have increased the bugzilla severity as the customer is stuck with the upgrade and wanted to know the RCA that's why not implementing with the workaround as the required logs will get flushed afterward.

Regards,
Ayush Garg

Comment 3 Colin Walters 2021-09-17 15:19:42 UTC

It looks like the problem may be that ceph is getting stuck during shutdown, and somehow this is actually wedging the ostree I/O process.  This looks likely to be a ceph kernel bug.

Can you verify a bit more about the block storage configuration here - is ceph only used to back Kubernetes PVs, or is it involved in the host storage?

Comment 4 Colin Walters 2021-09-17 15:20:55 UTC

Moving to Ceph for analysis.

Comment 5 Patrick Donnelly 2021-09-17 17:07:43 UTC

It looks like the network was stopped before ceph was unmounted:

> Sep 18 03:14:55 e2n2.fbond systemd[1]: Stopped target Remote File Systems.
> ...
> Sep 18 03:14:56 e2n2.fbond systemd[1]: Stopped target Network (Pre).
> ...
> Sep 18 03:15:20 e2n2.fbond kernel: libceph: mon1 (1)9.255.165.242:6789 session lost, hunting for new mon

Not sure why ceph is not included in the remote file system target. Roping in the ceph-csi folks and apologies for kicking this bz into another component.

Comment 8 aygarg 2021-09-20 17:28:33 UTC

(In reply to Colin Walters from comment #3)
> It looks like the problem may be that ceph is getting stuck during shutdown,
> and somehow this is actually wedging the ostree I/O process.  This looks
> likely to be a ceph kernel bug.
> 
> Can you verify a bit more about the block storage configuration here - is
> ceph only used to back Kubernetes PVs, or is it involved in the host storage?

I have confirmed about the ceph usage with customer and it is getting used for Kubernetes PVs only.

Comment 28 Mudit Agarwal 2021-11-02 14:55:22 UTC

*** Bug 2012257 has been marked as a duplicate of this bug. ***

Comment 33 Colin Walters 2022-01-14 22:37:45 UTC

It'd be nice to try to figure out a workaround for this without waiting for systemd.  We may be able to inject an additional dependency into the generated mount unit or so?

Comment 34 Colin Walters 2022-01-17 17:51:11 UTC

OK I put up an ostree-side workaround here: https://github.com/ostreedev/ostree/pull/2519

But, it's still clearly better to have systemd fixed, because we need that in order to ensure that ceph can be cleanly unmounted.

And per the above comment, I do think it's probably possible for ceph to inject e.g.
`x-systemd.after=network-online.target` into its fstab lines to force a network dependency.

Or, alternatively ceph could stop using fstab at all, and generate its own mount units with its own dependencies.

Comment 42 Petr Balogh 2022-03-07 16:14:13 UTC

Trying to run verification jobs for OCP upgrade here from OCP 4.10 to OCP 4.11:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3435/

and also ODF upgrade here from 4.9 to 4.10 build:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3436/

Karthick, I see you had some input in this BZ as well.
Do you think that we can just test with those regression runs?

Thanks

Comment 44 Petr Balogh 2022-03-08 15:25:56 UTC

I am marking this as verified based on regression runs above as we didn't see any issue like this.

Comment 47 errata-xmlrpc 2022-04-13 18:49:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372

Comment 48 aygarg 2022-04-20 01:39:24 UTC

Hello Team,

I can see that the following Bugzilla which is related to this one as it's for the actual RHEL8 bug is still in a Verified state while this Bugzilla is closed.
--> https://bugzilla.redhat.com/show_bug.cgi?id=2008825

Can you confirm if the fix is there in 4.10.0 or still the issue isn't fixed? As I am a bit confused since this Bugzilla for RHEL8 is still going on.


Regards,
Ayush Garg

Comment 51 Red Hat Bugzilla 2023-12-08 04:26:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

bniver
branto
ceph-eng-bugs
dornelas
jligon
kcleveng
kelwhite
kramdoss
madam
mrajanna
mrussell
muagarwa
musman
nberry
ndevos
nstielau
ocs-bugs
odf-bz-bot
rar
rcyriac
rhcos-triage
sostapov
tdesala
tmuthami
walters
ychoukse
ygupta