Created attachment 1757046 [details] must gather logs Description of problem (please be detailed as possible and provide log snippets): rook-ceph-rgw pods restart continuously with ocs version 4.6.3 due to liveness probe failure Version of all relevant components (if applicable): OCS 4.6.3-261.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes, cannot use rook-ceph-rgw pods. Is there any workaround available to the best of your knowledge? NO Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install Storage operator (4.6.3) + Local Stroage Operator (4.6) + Openshift storage cluster 2. Observe rook-ceph-rgw pods 3. Actual results: rook-ceph-rgw pods restarts continuously. NAME READY STATUS RESTARTS AGE rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-789f846hx2x7 0/1 CrashLoopBackOff 23 75m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-54c687c656jf 0/1 CrashLoopBackOff 23 75m ---- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m11s default-scheduler Successfully assigned openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-789f846pszwq to worker-1.m1301015ocs.lnxne.boe Normal AddedInterface 2m10s multus Add eth0 [10.131.0.53/23] Normal Pulled 2m10s kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:286db60b15548e662bfbf0a7033dc1d73cc93addd64bcdb4d92a4d4802c76f9e" already present on machine Normal Created 2m9s kubelet Created container chown-container-data-dir Normal Started 2m9s kubelet Started container chown-container-data-dir Normal Pulled 60s (x2 over 2m8s) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:286db60b15548e662bfbf0a7033dc1d73cc93addd64bcdb4d92a4d4802c76f9e" already present on machine Normal Created 60s (x2 over 2m8s) kubelet Created container rgw Normal Started 60s (x2 over 2m8s) kubelet Started container rgw Warning Unhealthy 30s (x6 over 110s) kubelet Liveness probe failed: Get "http://10.131.0.53:8080/swift/healthcheck": dial tcp 10.131.0.53:8080: connect: connection refused Normal Killing 30s (x2 over 90s) kubelet Container rgw failed liveness probe, will be restarted ---- Expected results: All pods run without errors Additional info:
Jiffin PTAL
Travis is this like https://bugzilla.redhat.com/show_bug.cgi?id=1926617 ? would you please make a recommendation on how to proceed here?
Christina, I think the RGW issue in https://bugzilla.redhat.com/show_bug.cgi?id=1926617 is a side-effect of the flapping OSDs.
Abdul, we need to get the OSDs in a healthy state first. Are you able to get a cluster running on 4.6.3 where the PGs are showing as active+clean in the ceph status? Usually they are not active+clean if the OSD pods are failing to start, although Jiffin's analysis above shows that the 3 OSDs are all "up" and "in". For troubleshooting unhealthy PGs from the toolbox, this topic may also have some pointers: https://docs.ceph.com/en/nautilus/rados/operations/placement-groups/
Hi Mark, OSD's are in a healthy state. There are 2 issues in my cluster as of now: 1. rook-ceph-rgw pods restarting continuously (which is this bug for) 2. osd restart due to OOM which I think is related to bug: https://bugzilla.redhat.com/show_bug.cgi?id=1917815 ---------------- [root@m1301015 ~]# oc -n openshift-storage exec rook-ceph-tools-6fdd868f75-686g4 -- ceph health HEALTH_OK [root@m1301015 ~]# oc -n openshift-storage exec rook-ceph-tools-6fdd868f75-686g4 -- ceph -s cluster: id: c6fb52bc-04b0-43f7-8890-a1a7e4f69bee health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3d) mgr: a(active, since 3d) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 3d), 3 in (since 4d) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 30.71k objects, 117 GiB usage: 345 GiB used, 2.7 TiB / 3 TiB avail pgs: 176 active+clean io: client: 3.0 KiB/s rd, 7.7 KiB/s wr, 3 op/s rd, 0 op/s wr [root@m1301015 ~]# [root@m1301015 ~]# oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.3-267.ci OpenShift Container Storage 4.6.3-267.ci ocs-operator.v4.6.2-233.ci Succeeded [root@m1301015 ~]# [root@m1301015 ~]# oc -n openshift-storage get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-589zx 3/3 Running 0 3d22h csi-cephfsplugin-9dp8l 3/3 Running 0 3d22h csi-cephfsplugin-jdtk4 3/3 Running 0 3d22h csi-cephfsplugin-lmrfq 3/3 Running 0 3d22h csi-cephfsplugin-provisioner-86bd8cb497-9pd7j 6/6 Running 0 3d22h csi-cephfsplugin-provisioner-86bd8cb497-d9z2k 6/6 Running 0 3d22h csi-rbdplugin-7549z 3/3 Running 0 3d22h csi-rbdplugin-p5qnw 3/3 Running 0 3d22h csi-rbdplugin-provisioner-6db77bb448-wdfxf 6/6 Running 0 3d22h csi-rbdplugin-provisioner-6db77bb448-zb8nx 6/6 Running 0 3d22h csi-rbdplugin-sg8vc 3/3 Running 0 3d22h csi-rbdplugin-sqd7n 3/3 Running 0 3d22h noobaa-core-0 1/1 Running 0 3d22h noobaa-db-0 1/1 Running 0 3d22h noobaa-endpoint-8f88646bf-75fnz 1/1 Running 0 3d22h noobaa-operator-66dbcf8698-2nv4d 1/1 Running 0 3d22h ocs-metrics-exporter-758dd99c98-67kbn 1/1 Running 0 3d22h ocs-operator-58f8bb8dd8-5sppp 1/1 Running 0 3d22h rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-5dlvx7h 1/1 Running 0 3d22h rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-f8ftlzz 1/1 Running 0 3d22h rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-7czlgrj 1/1 Running 0 3d22h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-55f64484kmjf6 1/1 Running 0 3d22h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-69c57477v79tq 1/1 Running 0 3d22h rook-ceph-mgr-a-6bcc6954cf-l9k72 1/1 Running 0 3d22h rook-ceph-mon-a-65b67f57dd-cmt4s 1/1 Running 0 3d22h rook-ceph-mon-b-57d7fc55bc-jzqp5 1/1 Running 0 3d22h rook-ceph-mon-c-685bf9c59d-746rj 1/1 Running 0 3d22h rook-ceph-operator-88555d7c-2qs6p 1/1 Running 0 3d22h rook-ceph-osd-0-84d767f446-nhmw4 1/1 Running 5 3d22h rook-ceph-osd-1-6c4784b457-ssbdt 1/1 Running 4 3d22h rook-ceph-osd-2-78975b7f45-6jsqm 1/1 Running 2 3d22h rook-ceph-osd-prepare-ocs-deviceset-0-data-0-6r8t2-dj7cd 0/1 Completed 0 4d13h rook-ceph-osd-prepare-ocs-deviceset-1-data-0-7rkzh-bdxxl 0/1 Completed 0 4d13h rook-ceph-osd-prepare-ocs-deviceset-2-data-0-dsvxq-22x25 0/1 Completed 0 4d13h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-56b64f8kcwct 0/1 CrashLoopBackOff 1787 3d22h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-74dc9f947j94 0/1 CrashLoopBackOff 1833 3d22h rook-ceph-tools-6fdd868f75-686g4 1/1 Running 0 4d13h worker-0m1301015ocslnxneboe-debug 0/1 Completed 0 4d13h worker-1m1301015ocslnxneboe-debug 0/1 Completed 0 4d13h worker-2m1301015ocslnxneboe-debug 0/1 Completed 0 4d13h [root@m1301015 ~]# --------------
This one is different from the old cluster reported in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c0, can please u share the logs from rgw pods, rook operator pod And "oc -n openshift-storage describe pods <rgw pods>" as well ?
Created attachment 1758795 [details] requested data Please find attached for requested info.
Please see the attachment in the comment (https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c11) for all the info asked in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c10. Is there any further information needed?
(In reply to Matt Benjamin (redhat) from comment #13) > Hi Jiffin, > > I'm confused how this would happen. > > This error implies that the beast front-end /was not selected to be built/ > when this system-z build was run. > > The Beast frontend is built unless explicitly disabled (top CMakeLists.txt), > so if there was any problem, it should have presented itself as a > compilation error, not a runtime error. > > Can we get some help from the folks who do the system-z builds, please? > > Matt Based on the above, Christina, can you assist with this issue since it seems related to the build? Maybe Boris or Ken? I'm not sure... Thanks! I'm moving this out of Rook since it is not a Rook problem. I've picked build.
In bug 1917592, I found that I accidentally set WITH_BOOST_CONTEXT=OFF for s390x in the RHCS 4.2 Ceph build. This was a regression from RHCS 4.1, and we plan to ship that fix in RHCS 4.2 z1.
Is there a BZ tracking the fix mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c18?
My bad, please ignore the above comment. Moving this BZ to MODIFIED as we have the fix already there in RHCS4.2z1
Yes, this has been fixed (early) in 4.2z1 for a while. All the 4.2z1 based 4.7.0 builds should have the fix in them.
@akandath @tstober can this be moved to VERIFIED?
I just installed OCS 4.7.0-324.ci on OCP 4.7.2 and I am not seeing the "rook-ceph-rgw" pod restarts anymore. [root@m1301015 ~]# oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-324.ci OpenShift Container Storage 4.7.0-324.ci Succeeded [root@m1301015 ~]# [root@m1301015 ~]# oc -n openshift-storage get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-5wgkx 3/3 Running 0 72m csi-cephfsplugin-6lx6g 3/3 Running 0 72m csi-cephfsplugin-cc8vm 3/3 Running 0 72m csi-cephfsplugin-j6kjc 3/3 Running 0 72m csi-cephfsplugin-provisioner-76b7c894b9-6z6wk 6/6 Running 0 72m csi-cephfsplugin-provisioner-76b7c894b9-xq9j2 6/6 Running 0 72m csi-rbdplugin-8z2dp 3/3 Running 0 72m csi-rbdplugin-d8fmx 3/3 Running 0 72m csi-rbdplugin-gxvmj 3/3 Running 0 72m csi-rbdplugin-lmd4q 3/3 Running 0 72m csi-rbdplugin-provisioner-5866f86d44-96wlm 6/6 Running 0 72m csi-rbdplugin-provisioner-5866f86d44-k5f54 6/6 Running 0 72m noobaa-core-0 1/1 Running 0 69m noobaa-db-pg-0 1/1 Running 0 69m noobaa-endpoint-86cffb6848-shn22 1/1 Running 0 67m noobaa-operator-fb44b58b6-rhj24 1/1 Running 0 74m ocs-metrics-exporter-5549d7f894-wgdkw 1/1 Running 0 74m ocs-operator-6b76fb4dff-dsjq2 1/1 Running 0 74m rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-85fzbwh 1/1 Running 0 69m rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-75tptnm 1/1 Running 0 70m rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-7dvx4cc 1/1 Running 0 70m rook-ceph-crashcollector-worker-3.m1301015ocs.lnxne.boe-577vm45 1/1 Running 0 71m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5757dc86mvp6g 2/2 Running 0 69m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6cc95ff82j7sb 2/2 Running 0 69m rook-ceph-mgr-a-689cf446d-mc7st 2/2 Running 0 70m rook-ceph-mon-a-7dccdf8df4-gv8rc 2/2 Running 0 72m rook-ceph-mon-b-9bb666f56-dp8qk 2/2 Running 0 70m rook-ceph-mon-c-57569f8c98-8ngzq 2/2 Running 0 70m rook-ceph-operator-7bd78b8dff-55qzx 1/1 Running 0 74m rook-ceph-osd-0-74b9845b7f-sgbx9 2/2 Running 0 69m rook-ceph-osd-1-cc884cd6-qqjr7 2/2 Running 0 69m rook-ceph-osd-2-7574949756-fhq9l 2/2 Running 0 69m rook-ceph-osd-prepare-ocs-deviceset-0-data-0d59dj-m4n4p 0/1 Completed 0 70m rook-ceph-osd-prepare-ocs-deviceset-1-data-0tgxjx-t7ffn 0/1 Completed 0 70m rook-ceph-osd-prepare-ocs-deviceset-2-data-0wgvg9-5mkdf 0/1 Completed 0 70m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c494495lk8x 2/2 Running 0 68m rook-ceph-tools-76bc89666b-dv9wk 1/1 Running 0 70m [root@m1301015 ~]#
the first bug has already been verified
verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041