If the OSDs are being restarted continuously one-by-one, there must be a change in the pod spec that the operator keeps reconciling. We have had similar issues in the past where something in the pod spec was changing order, which then caused the pod to restart. Could you capture a pod spec before and after the pod restart? Then we should be able to diff the changes in the pod spec to see the cause.
Hey, Below are the pod's spec before & after reboots: http://perf148h.perf.lab.eng.bos.redhat.com/share/osds_reboot/osds_before_reboot/ http://perf148h.perf.lab.eng.bos.redhat.com/share/osds_reboot/osds_after_reboot/ 10x.
@bbenshab for clarification, did the workaround you tried (copied below) stop the OSDs from restarting? > oc scale --replicas=0 deploy/ocs-operator -n openshift-storage > oc scale --replicas=0 deploy/rook-ceph-operator -n openshift-storage
yeap, by eliminating the operators the OSD's stop the OSDs from restarting in a loop, but I also lose the operator's functionality.
the last comment was a good example of a terrible copy & paste what I was trying to say is by eliminating the operators it stops the OSDs from restarting in a loop, but I also lost any operator's functionality.
@bbenshab I didn't see anything telling from the OSDs before/after. I need to see logs from the OCS cluster, but the must-gather I see is for OCP (and CNV for some reason). Can you add the OCS must-gather for me?
sure: http://perf148h.perf.lab.eng.bos.redhat.com/share/osds_reboot/must-gather.local.5847861408606822530/
While this seems like some reconcile/update type of issue since the OSD restarts stop when the ocs & rook operators are scaled down, I did want to mention that OSD memory looks a little tight in case that is of interest here.. From the toolbox: osd_memory_target 2684354560 Looking at pod metric: topk (20, sum(container_memory_rss{container!="POD",container!="",pod!=""}) by (pod)) I see some OSDs have reached up to 2.44G in the last week or so.
Thanks, Jenifer. I think that OSD hovering near memory target is to be expected and indicates things are working correctly. If they start getting OOM killed, that would be a different story though. Good news though! I believe I tracked down the source of the issue which is addressed in upstream Rook here: https://github.com/rook/rook/pull/8142. We'll get this merged into openshift/rook as soon as we can. I hope the workaround Boaz was using is good for the next few days until the backport is done and there is an updated 4.8 image to test.
Ah great find, interesting there is a multus sort issue! We've seen some other stability issues (for ex. BZ1973317) after Boaz scaled the operators down, do you think it could confuse the network config if scaled down in the middle of one of these update rollouts? Guessing probably not but worth checking..
After reading through the BZ you linked, I don't suspect any correlation, especially since the operators are scaled to 0 and the OSDs aren't being restarted. That seems like a Ceph issue or perhaps an issue with how it's configured. But we should be able to confirm they are/aren't related once there is a 4.8 build with the upstream fix.
Fix merged into openshift fork in https://github.com/openshift/rook/pull/261.
Yes it really should be marked as a blocker. Without it, the OSDs in a multus cluster will continuously be restarted by the operator. Thanks for the reminder on waiting for full acks before merging at this point.
Thanks Travis, proposing it as a blocker. Will ask QE to ack.
We have not seen any OSD restarts since installing the new build. Fix looks good, thanks!
Not observing OSD restarts in scaled setup with 9 nodes, 2 OSD's and 70 PODs running with IO. Performed restart of master and worker nodes, not observed OSD continuous restarts Verified version: OCS version: ocs-operator.v4.8.0-456.ci OCP version: 4.7.0-0.nightly-2021-07-19-144858 ceph version: 14.2.11-184.el8cp Moving this bz to verified state.
It's Multus Related and we will not be able to run it in ODF environment.