Bug 1970503 - OCS OSD's rebooting in a loop (OCS 4.8 + OCP 4.7.9)
Summary: OCS OSD's rebooting in a loop (OCS 4.8 + OCP 4.7.9)
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.8
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.8.0
Assignee: Blaine Gardner
QA Contact: Ramakrishnan Periyasamy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-10 15:17 UTC by Boaz
Modified: 2023-08-03 08:31 UTC (History)
11 users (show)

Fixed In Version: 4.8.0-432.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 261 0 None closed Bug 1970503: sort multus annotation strings when applying 2021-06-30 05:53:40 UTC
Github rook rook pull 8142 0 None open core: sort multus annotation strings when applying 2021-06-18 21:32:35 UTC

Comment 2 Travis Nielsen 2021-06-10 19:19:43 UTC
If the OSDs are being restarted continuously one-by-one, there must be a change in the pod spec that the operator keeps reconciling. We have had similar issues in the past where something in the pod spec was changing order, which then caused the pod to restart. Could you capture a pod spec before and after the pod restart? Then we should be able to diff the changes in the pod spec to see the cause.

Comment 4 Blaine Gardner 2021-06-14 15:58:45 UTC
@bbenshab for clarification, did the workaround you tried (copied below) stop the OSDs from restarting?

> oc scale --replicas=0 deploy/ocs-operator -n openshift-storage
> oc scale --replicas=0 deploy/rook-ceph-operator -n openshift-storage

Comment 5 Boaz 2021-06-14 17:44:50 UTC
yeap, by eliminating the operators the OSD's stop the OSDs from restarting in a loop, but I also lose the operator's functionality.

Comment 6 Boaz 2021-06-14 17:47:33 UTC
the last comment was a good example of a terrible copy & paste
what I was trying to say is
by eliminating the operators it stops the OSDs from restarting in a loop, but I also lost any operator's functionality.

Comment 7 Blaine Gardner 2021-06-14 18:20:52 UTC
@bbenshab I didn't see anything telling from the OSDs before/after. I need to see logs from the OCS cluster, but the must-gather I see is for OCP (and CNV for some reason). Can you add the OCS must-gather for me?

Comment 11 Jenifer Abrams 2021-06-17 19:25:27 UTC
While this seems like some reconcile/update type of issue since the OSD restarts stop when the ocs & rook operators are scaled down, I did want to mention that OSD memory looks a little tight in case that is of interest here.. 

From the toolbox:
osd_memory_target                    2684354560

Looking at pod metric: topk (20, sum(container_memory_rss{container!="POD",container!="",pod!=""}) by (pod))
I see some OSDs have reached up to 2.44G in the last week or so.

Comment 12 Blaine Gardner 2021-06-17 21:34:32 UTC
Thanks, Jenifer. I think that OSD hovering near memory target is to be expected and indicates things are working correctly. If they start getting OOM killed, that would be a different story though.

Good news though! I believe I tracked down the source of the issue which is addressed in upstream Rook here: https://github.com/rook/rook/pull/8142. We'll get this merged into openshift/rook as soon as we can. I hope the workaround Boaz was using is good for the next few days until the backport is done and there is an updated 4.8 image to test.

Comment 13 Jenifer Abrams 2021-06-17 21:52:57 UTC
Ah great find, interesting there is a multus sort issue! We've seen some other stability issues (for ex. BZ1973317) after Boaz scaled the operators down, do you think it could confuse the network config if scaled down in the middle of one of these update rollouts? Guessing probably not but worth checking..

Comment 14 Blaine Gardner 2021-06-17 22:05:03 UTC
After reading through the BZ you linked, I don't suspect any correlation, especially since the operators are scaled to 0 and the OSDs aren't being restarted. That seems like a Ceph issue or perhaps an issue with how it's configured. 

But we should be able to confirm they are/aren't related once there is a 4.8 build with the upstream fix.

Comment 17 Blaine Gardner 2021-06-21 22:55:46 UTC
Fix merged into openshift fork in https://github.com/openshift/rook/pull/261.

Comment 19 Travis Nielsen 2021-06-22 02:57:35 UTC
Yes it really should be marked as a blocker. Without it, the OSDs in a multus cluster will continuously be restarted by the operator. Thanks for the reminder on waiting for full acks before merging at this point.

Comment 20 Mudit Agarwal 2021-06-22 04:01:44 UTC
Thanks Travis, proposing it as a blocker. Will ask QE to ack.

Comment 27 Jenifer Abrams 2021-07-06 17:31:53 UTC
We have not seen any OSD restarts since installing the new build. Fix looks good, thanks!

Comment 28 Ramakrishnan Periyasamy 2021-07-21 15:38:30 UTC
Not observing OSD restarts in scaled setup with 9 nodes, 2 OSD's and 70 PODs running with IO.
Performed restart of master and worker nodes, not observed OSD continuous restarts

Verified version:
OCS version: ocs-operator.v4.8.0-456.ci
OCP version: 4.7.0-0.nightly-2021-07-19-144858
ceph version: 14.2.11-184.el8cp

Moving this bz to verified state.

Comment 29 Shivam Durgbuns 2022-08-25 07:48:59 UTC
It's Multus Related and we will not be able to run it in ODF environment.


Note You need to log in before you can comment on or make changes to this bug.