1970503 – OCS OSD's rebooting in a loop (OCS 4.8 + OCP 4.7.9)

Bug 1970503 - OCS OSD's rebooting in a loop (OCS 4.8 + OCP 4.7.9)

Summary: OCS OSD's rebooting in a loop (OCS 4.8 + OCP 4.7.9)

Keywords:
Status:	VERIFIED
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Blaine Gardner
QA Contact:	Ramakrishnan Periyasamy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-10 15:17 UTC by Boaz
Modified:	2023-08-03 08:31 UTC (History)
CC List:	11 users (show)
Fixed In Version:	4.8.0-432.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 261	0	None	closed	Bug 1970503: sort multus annotation strings when applying	2021-06-30 05:53:40 UTC
Github	rook rook pull 8142	0	None	open	core: sort multus annotation strings when applying	2021-06-18 21:32:35 UTC

Comment 2 Travis Nielsen 2021-06-10 19:19:43 UTC

If the OSDs are being restarted continuously one-by-one, there must be a change in the pod spec that the operator keeps reconciling. We have had similar issues in the past where something in the pod spec was changing order, which then caused the pod to restart. Could you capture a pod spec before and after the pod restart? Then we should be able to diff the changes in the pod spec to see the cause.

Comment 3 Boaz 2021-06-13 09:13:20 UTC

Hey, 
Below are the pod's spec before & after reboots:
http://perf148h.perf.lab.eng.bos.redhat.com/share/osds_reboot/osds_before_reboot/
http://perf148h.perf.lab.eng.bos.redhat.com/share/osds_reboot/osds_after_reboot/

10x.

Comment 4 Blaine Gardner 2021-06-14 15:58:45 UTC

@bbenshab for clarification, did the workaround you tried (copied below) stop the OSDs from restarting?

> oc scale --replicas=0 deploy/ocs-operator -n openshift-storage
> oc scale --replicas=0 deploy/rook-ceph-operator -n openshift-storage

Comment 5 Boaz 2021-06-14 17:44:50 UTC

yeap, by eliminating the operators the OSD's stop the OSDs from restarting in a loop, but I also lose the operator's functionality.

Comment 6 Boaz 2021-06-14 17:47:33 UTC

the last comment was a good example of a terrible copy & paste
what I was trying to say is
by eliminating the operators it stops the OSDs from restarting in a loop, but I also lost any operator's functionality.

Comment 7 Blaine Gardner 2021-06-14 18:20:52 UTC

@bbenshab I didn't see anything telling from the OSDs before/after. I need to see logs from the OCS cluster, but the must-gather I see is for OCP (and CNV for some reason). Can you add the OCS must-gather for me?

Comment 8 Boaz 2021-06-14 20:02:59 UTC

sure:
http://perf148h.perf.lab.eng.bos.redhat.com/share/osds_reboot/must-gather.local.5847861408606822530/

Comment 11 Jenifer Abrams 2021-06-17 19:25:27 UTC

While this seems like some reconcile/update type of issue since the OSD restarts stop when the ocs & rook operators are scaled down, I did want to mention that OSD memory looks a little tight in case that is of interest here.. 

From the toolbox:
osd_memory_target                    2684354560

Looking at pod metric: topk (20, sum(container_memory_rss{container!="POD",container!="",pod!=""}) by (pod))
I see some OSDs have reached up to 2.44G in the last week or so.

Comment 12 Blaine Gardner 2021-06-17 21:34:32 UTC

Thanks, Jenifer. I think that OSD hovering near memory target is to be expected and indicates things are working correctly. If they start getting OOM killed, that would be a different story though.

Good news though! I believe I tracked down the source of the issue which is addressed in upstream Rook here: https://github.com/rook/rook/pull/8142. We'll get this merged into openshift/rook as soon as we can. I hope the workaround Boaz was using is good for the next few days until the backport is done and there is an updated 4.8 image to test.

Comment 13 Jenifer Abrams 2021-06-17 21:52:57 UTC

Ah great find, interesting there is a multus sort issue! We've seen some other stability issues (for ex. BZ1973317) after Boaz scaled the operators down, do you think it could confuse the network config if scaled down in the middle of one of these update rollouts? Guessing probably not but worth checking..

Comment 14 Blaine Gardner 2021-06-17 22:05:03 UTC

After reading through the BZ you linked, I don't suspect any correlation, especially since the operators are scaled to 0 and the OSDs aren't being restarted. That seems like a Ceph issue or perhaps an issue with how it's configured. 

But we should be able to confirm they are/aren't related once there is a 4.8 build with the upstream fix.

Comment 17 Blaine Gardner 2021-06-21 22:55:46 UTC

Fix merged into openshift fork in https://github.com/openshift/rook/pull/261.

Comment 19 Travis Nielsen 2021-06-22 02:57:35 UTC

Yes it really should be marked as a blocker. Without it, the OSDs in a multus cluster will continuously be restarted by the operator. Thanks for the reminder on waiting for full acks before merging at this point.

Comment 20 Mudit Agarwal 2021-06-22 04:01:44 UTC

Thanks Travis, proposing it as a blocker. Will ask QE to ack.

Comment 27 Jenifer Abrams 2021-07-06 17:31:53 UTC

We have not seen any OSD restarts since installing the new build. Fix looks good, thanks!

Comment 28 Ramakrishnan Periyasamy 2021-07-21 15:38:30 UTC

Not observing OSD restarts in scaled setup with 9 nodes, 2 OSD's and 70 PODs running with IO.
Performed restart of master and worker nodes, not observed OSD continuous restarts

Verified version:
OCS version: ocs-operator.v4.8.0-456.ci
OCP version: 4.7.0-0.nightly-2021-07-19-144858
ceph version: 14.2.11-184.el8cp

Moving this bz to verified state.

Comment 29 Shivam Durgbuns 2022-08-25 07:48:59 UTC

It's Multus Related and we will not be able to run it in ODF environment.

Note You need to log in before you can comment on or make changes to this bug.