Bug 1928642
| Summary: | [IBM Z] rook-ceph-rgw pods restarts continously with ocs version 4.6.3 due to liveness probe failure | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Abdul Kandathil (IBM) <akandath> | ||||||
| Component: | build | Assignee: | Boris Ranto <branto> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Raz Tamir <ratamir> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.6 | CC: | bniver, branto, ebenahar, gmeno, jthottan, kdreyer, lmcfadde, madam, mbenjamin, mkogan, muagarwa, ocs-bugs, ratamir, rcyriac, shan, sostapov, thottanjiffin, tnielsen, tstober | ||||||
| Target Milestone: | --- | Keywords: | AutomationBackLog | ||||||
| Target Release: | OCS 4.7.0 | ||||||||
| Hardware: | s390x | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-05-19 09:20:00 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Abdul Kandathil (IBM)
2021-02-15 09:13:58 UTC
Jiffin PTAL Travis is this like https://bugzilla.redhat.com/show_bug.cgi?id=1926617 ? would you please make a recommendation on how to proceed here? Christina, I think the RGW issue in https://bugzilla.redhat.com/show_bug.cgi?id=1926617 is a side-effect of the flapping OSDs. Abdul, we need to get the OSDs in a healthy state first. Are you able to get a cluster running on 4.6.3 where the PGs are showing as active+clean in the ceph status? Usually they are not active+clean if the OSD pods are failing to start, although Jiffin's analysis above shows that the 3 OSDs are all "up" and "in". For troubleshooting unhealthy PGs from the toolbox, this topic may also have some pointers: https://docs.ceph.com/en/nautilus/rados/operations/placement-groups/ Hi Mark, OSD's are in a healthy state. There are 2 issues in my cluster as of now: 1. rook-ceph-rgw pods restarting continuously (which is this bug for) 2. osd restart due to OOM which I think is related to bug: https://bugzilla.redhat.com/show_bug.cgi?id=1917815 ---------------- [root@m1301015 ~]# oc -n openshift-storage exec rook-ceph-tools-6fdd868f75-686g4 -- ceph health HEALTH_OK [root@m1301015 ~]# oc -n openshift-storage exec rook-ceph-tools-6fdd868f75-686g4 -- ceph -s cluster: id: c6fb52bc-04b0-43f7-8890-a1a7e4f69bee health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3d) mgr: a(active, since 3d) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 3d), 3 in (since 4d) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 30.71k objects, 117 GiB usage: 345 GiB used, 2.7 TiB / 3 TiB avail pgs: 176 active+clean io: client: 3.0 KiB/s rd, 7.7 KiB/s wr, 3 op/s rd, 0 op/s wr [root@m1301015 ~]# [root@m1301015 ~]# oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.3-267.ci OpenShift Container Storage 4.6.3-267.ci ocs-operator.v4.6.2-233.ci Succeeded [root@m1301015 ~]# [root@m1301015 ~]# oc -n openshift-storage get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-589zx 3/3 Running 0 3d22h csi-cephfsplugin-9dp8l 3/3 Running 0 3d22h csi-cephfsplugin-jdtk4 3/3 Running 0 3d22h csi-cephfsplugin-lmrfq 3/3 Running 0 3d22h csi-cephfsplugin-provisioner-86bd8cb497-9pd7j 6/6 Running 0 3d22h csi-cephfsplugin-provisioner-86bd8cb497-d9z2k 6/6 Running 0 3d22h csi-rbdplugin-7549z 3/3 Running 0 3d22h csi-rbdplugin-p5qnw 3/3 Running 0 3d22h csi-rbdplugin-provisioner-6db77bb448-wdfxf 6/6 Running 0 3d22h csi-rbdplugin-provisioner-6db77bb448-zb8nx 6/6 Running 0 3d22h csi-rbdplugin-sg8vc 3/3 Running 0 3d22h csi-rbdplugin-sqd7n 3/3 Running 0 3d22h noobaa-core-0 1/1 Running 0 3d22h noobaa-db-0 1/1 Running 0 3d22h noobaa-endpoint-8f88646bf-75fnz 1/1 Running 0 3d22h noobaa-operator-66dbcf8698-2nv4d 1/1 Running 0 3d22h ocs-metrics-exporter-758dd99c98-67kbn 1/1 Running 0 3d22h ocs-operator-58f8bb8dd8-5sppp 1/1 Running 0 3d22h rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-5dlvx7h 1/1 Running 0 3d22h rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-f8ftlzz 1/1 Running 0 3d22h rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-7czlgrj 1/1 Running 0 3d22h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-55f64484kmjf6 1/1 Running 0 3d22h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-69c57477v79tq 1/1 Running 0 3d22h rook-ceph-mgr-a-6bcc6954cf-l9k72 1/1 Running 0 3d22h rook-ceph-mon-a-65b67f57dd-cmt4s 1/1 Running 0 3d22h rook-ceph-mon-b-57d7fc55bc-jzqp5 1/1 Running 0 3d22h rook-ceph-mon-c-685bf9c59d-746rj 1/1 Running 0 3d22h rook-ceph-operator-88555d7c-2qs6p 1/1 Running 0 3d22h rook-ceph-osd-0-84d767f446-nhmw4 1/1 Running 5 3d22h rook-ceph-osd-1-6c4784b457-ssbdt 1/1 Running 4 3d22h rook-ceph-osd-2-78975b7f45-6jsqm 1/1 Running 2 3d22h rook-ceph-osd-prepare-ocs-deviceset-0-data-0-6r8t2-dj7cd 0/1 Completed 0 4d13h rook-ceph-osd-prepare-ocs-deviceset-1-data-0-7rkzh-bdxxl 0/1 Completed 0 4d13h rook-ceph-osd-prepare-ocs-deviceset-2-data-0-dsvxq-22x25 0/1 Completed 0 4d13h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-56b64f8kcwct 0/1 CrashLoopBackOff 1787 3d22h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-74dc9f947j94 0/1 CrashLoopBackOff 1833 3d22h rook-ceph-tools-6fdd868f75-686g4 1/1 Running 0 4d13h worker-0m1301015ocslnxneboe-debug 0/1 Completed 0 4d13h worker-1m1301015ocslnxneboe-debug 0/1 Completed 0 4d13h worker-2m1301015ocslnxneboe-debug 0/1 Completed 0 4d13h [root@m1301015 ~]# -------------- This one is different from the old cluster reported in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c0, can please u share the logs from rgw pods, rook operator pod And "oc -n openshift-storage describe pods <rgw pods>" as well ? Created attachment 1758795 [details]
requested data
Please find attached for requested info.
Please see the attachment in the comment (https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c11) for all the info asked in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c10. Is there any further information needed? (In reply to Matt Benjamin (redhat) from comment #13) > Hi Jiffin, > > I'm confused how this would happen. > > This error implies that the beast front-end /was not selected to be built/ > when this system-z build was run. > > The Beast frontend is built unless explicitly disabled (top CMakeLists.txt), > so if there was any problem, it should have presented itself as a > compilation error, not a runtime error. > > Can we get some help from the folks who do the system-z builds, please? > > Matt Based on the above, Christina, can you assist with this issue since it seems related to the build? Maybe Boris or Ken? I'm not sure... Thanks! I'm moving this out of Rook since it is not a Rook problem. I've picked build. In bug 1917592, I found that I accidentally set WITH_BOOST_CONTEXT=OFF for s390x in the RHCS 4.2 Ceph build. This was a regression from RHCS 4.1, and we plan to ship that fix in RHCS 4.2 z1. Is there a BZ tracking the fix mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c18? My bad, please ignore the above comment. Moving this BZ to MODIFIED as we have the fix already there in RHCS4.2z1 Yes, this has been fixed (early) in 4.2z1 for a while. All the 4.2z1 based 4.7.0 builds should have the fix in them. @akandath @tstober can this be moved to VERIFIED? I just installed OCS 4.7.0-324.ci on OCP 4.7.2 and I am not seeing the "rook-ceph-rgw" pod restarts anymore. [root@m1301015 ~]# oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-324.ci OpenShift Container Storage 4.7.0-324.ci Succeeded [root@m1301015 ~]# [root@m1301015 ~]# oc -n openshift-storage get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-5wgkx 3/3 Running 0 72m csi-cephfsplugin-6lx6g 3/3 Running 0 72m csi-cephfsplugin-cc8vm 3/3 Running 0 72m csi-cephfsplugin-j6kjc 3/3 Running 0 72m csi-cephfsplugin-provisioner-76b7c894b9-6z6wk 6/6 Running 0 72m csi-cephfsplugin-provisioner-76b7c894b9-xq9j2 6/6 Running 0 72m csi-rbdplugin-8z2dp 3/3 Running 0 72m csi-rbdplugin-d8fmx 3/3 Running 0 72m csi-rbdplugin-gxvmj 3/3 Running 0 72m csi-rbdplugin-lmd4q 3/3 Running 0 72m csi-rbdplugin-provisioner-5866f86d44-96wlm 6/6 Running 0 72m csi-rbdplugin-provisioner-5866f86d44-k5f54 6/6 Running 0 72m noobaa-core-0 1/1 Running 0 69m noobaa-db-pg-0 1/1 Running 0 69m noobaa-endpoint-86cffb6848-shn22 1/1 Running 0 67m noobaa-operator-fb44b58b6-rhj24 1/1 Running 0 74m ocs-metrics-exporter-5549d7f894-wgdkw 1/1 Running 0 74m ocs-operator-6b76fb4dff-dsjq2 1/1 Running 0 74m rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-85fzbwh 1/1 Running 0 69m rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-75tptnm 1/1 Running 0 70m rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-7dvx4cc 1/1 Running 0 70m rook-ceph-crashcollector-worker-3.m1301015ocs.lnxne.boe-577vm45 1/1 Running 0 71m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5757dc86mvp6g 2/2 Running 0 69m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6cc95ff82j7sb 2/2 Running 0 69m rook-ceph-mgr-a-689cf446d-mc7st 2/2 Running 0 70m rook-ceph-mon-a-7dccdf8df4-gv8rc 2/2 Running 0 72m rook-ceph-mon-b-9bb666f56-dp8qk 2/2 Running 0 70m rook-ceph-mon-c-57569f8c98-8ngzq 2/2 Running 0 70m rook-ceph-operator-7bd78b8dff-55qzx 1/1 Running 0 74m rook-ceph-osd-0-74b9845b7f-sgbx9 2/2 Running 0 69m rook-ceph-osd-1-cc884cd6-qqjr7 2/2 Running 0 69m rook-ceph-osd-2-7574949756-fhq9l 2/2 Running 0 69m rook-ceph-osd-prepare-ocs-deviceset-0-data-0d59dj-m4n4p 0/1 Completed 0 70m rook-ceph-osd-prepare-ocs-deviceset-1-data-0tgxjx-t7ffn 0/1 Completed 0 70m rook-ceph-osd-prepare-ocs-deviceset-2-data-0wgvg9-5mkdf 0/1 Completed 0 70m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c494495lk8x 2/2 Running 0 68m rook-ceph-tools-76bc89666b-dv9wk 1/1 Running 0 70m [root@m1301015 ~]# the first bug has already been verified verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041 |