1928642 – [IBM Z] rook-ceph-rgw pods restarts continously with ocs version 4.6.3 due to liveness probe failure

Bug 1928642 - [IBM Z] rook-ceph-rgw pods restarts continously with ocs version 4.6.3 due to liveness probe failure

Summary: [IBM Z] rook-ceph-rgw pods restarts continously with ocs version 4.6.3 due to...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	build
Sub Component:
Version:	4.6
Hardware:	s390x
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Boris Ranto
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-15 09:13 UTC by Abdul Kandathil (IBM)
Modified:	2021-06-01 08:49 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:20:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
must gather logs (9.32 MB, application/zip) 2021-02-15 09:13 UTC, Abdul Kandathil (IBM)	no flags	Details
requested data (60.21 KB, application/zip) 2021-02-23 09:14 UTC, Abdul Kandathil (IBM)	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:20:29 UTC

Description Abdul Kandathil (IBM) 2021-02-15 09:13:58 UTC

Created attachment 1757046 [details]
must gather logs

Description of problem (please be detailed as possible and provide log
snippets): 
rook-ceph-rgw pods restart continuously with ocs version 4.6.3 due to liveness probe failure


Version of all relevant components (if applicable): 
OCS 4.6.3-261.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? 
yes, cannot use rook-ceph-rgw pods.


Is there any workaround available to the best of your knowledge? 
NO


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install Storage operator (4.6.3) + Local Stroage Operator (4.6) + Openshift storage cluster
2. Observe rook-ceph-rgw pods
3.


Actual results:
rook-ceph-rgw pods restarts continuously.
NAME                                                              READY   STATUS             RESTARTS   AGE
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-789f846hx2x7   0/1     CrashLoopBackOff   23         75m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-54c687c656jf   0/1     CrashLoopBackOff   23         75m

----
Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       2m11s               default-scheduler  Successfully assigned openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-789f846pszwq to worker-1.m1301015ocs.lnxne.boe
  Normal   AddedInterface  2m10s               multus             Add eth0 [10.131.0.53/23]
  Normal   Pulled          2m10s               kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:286db60b15548e662bfbf0a7033dc1d73cc93addd64bcdb4d92a4d4802c76f9e" already present on machine
  Normal   Created         2m9s                kubelet            Created container chown-container-data-dir
  Normal   Started         2m9s                kubelet            Started container chown-container-data-dir
  Normal   Pulled          60s (x2 over 2m8s)  kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:286db60b15548e662bfbf0a7033dc1d73cc93addd64bcdb4d92a4d4802c76f9e" already present on machine
  Normal   Created         60s (x2 over 2m8s)  kubelet            Created container rgw
  Normal   Started         60s (x2 over 2m8s)  kubelet            Started container rgw
  Warning  Unhealthy       30s (x6 over 110s)  kubelet            Liveness probe failed: Get "http://10.131.0.53:8080/swift/healthcheck": dial tcp 10.131.0.53:8080: connect: connection refused
  Normal   Killing         30s (x2 over 90s)   kubelet            Container rgw failed liveness probe, will be restarted
----


Expected results:
All pods run without errors


Additional info:

Comment 2 Sébastien Han 2021-02-15 16:45:27 UTC

Jiffin PTAL

Comment 4 Christina Meno 2021-02-16 18:39:29 UTC

Travis is this like https://bugzilla.redhat.com/show_bug.cgi?id=1926617 ? would you please make a recommendation on how to proceed here?

Comment 5 Sébastien Han 2021-02-17 14:39:58 UTC

Christina, I think the RGW issue in https://bugzilla.redhat.com/show_bug.cgi?id=1926617 is a side-effect of the flapping OSDs.

Comment 8 Travis Nielsen 2021-02-22 20:03:06 UTC

Abdul, we need to get the OSDs in a healthy state first. Are you able to get a cluster running on 4.6.3 where the PGs are showing as active+clean in the ceph status? 

Usually they are not active+clean if the OSD pods are failing to start, although Jiffin's analysis above shows that the 3 OSDs are all "up" and "in".

For troubleshooting unhealthy PGs from the toolbox, this topic may also have some pointers: https://docs.ceph.com/en/nautilus/rados/operations/placement-groups/

Comment 9 Abdul Kandathil (IBM) 2021-02-23 08:47:22 UTC

Hi Mark,

OSD's are in a healthy state. There are 2 issues in my cluster as of now:
1. rook-ceph-rgw pods restarting continuously (which is this bug for)
2. osd restart due to OOM which I think is related to bug: https://bugzilla.redhat.com/show_bug.cgi?id=1917815

----------------
[root@m1301015 ~]# oc -n openshift-storage exec rook-ceph-tools-6fdd868f75-686g4 -- ceph health
HEALTH_OK
[root@m1301015 ~]# oc -n openshift-storage exec rook-ceph-tools-6fdd868f75-686g4 -- ceph -s
  cluster:
    id:     c6fb52bc-04b0-43f7-8890-a1a7e4f69bee
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 3d)
    mgr: a(active, since 3d)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 3d), 3 in (since 4d)

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   10 pools, 176 pgs
    objects: 30.71k objects, 117 GiB
    usage:   345 GiB used, 2.7 TiB / 3 TiB avail
    pgs:     176 active+clean

  io:
    client:   3.0 KiB/s rd, 7.7 KiB/s wr, 3 op/s rd, 0 op/s wr

[root@m1301015 ~]#
[root@m1301015 ~]# oc -n openshift-storage get csv
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.6.3-267.ci   OpenShift Container Storage   4.6.3-267.ci   ocs-operator.v4.6.2-233.ci   Succeeded
[root@m1301015 ~]#

[root@m1301015 ~]# oc -n openshift-storage get po
NAME                                                              READY   STATUS             RESTARTS   AGE
csi-cephfsplugin-589zx                                            3/3     Running            0          3d22h
csi-cephfsplugin-9dp8l                                            3/3     Running            0          3d22h
csi-cephfsplugin-jdtk4                                            3/3     Running            0          3d22h
csi-cephfsplugin-lmrfq                                            3/3     Running            0          3d22h
csi-cephfsplugin-provisioner-86bd8cb497-9pd7j                     6/6     Running            0          3d22h
csi-cephfsplugin-provisioner-86bd8cb497-d9z2k                     6/6     Running            0          3d22h
csi-rbdplugin-7549z                                               3/3     Running            0          3d22h
csi-rbdplugin-p5qnw                                               3/3     Running            0          3d22h
csi-rbdplugin-provisioner-6db77bb448-wdfxf                        6/6     Running            0          3d22h
csi-rbdplugin-provisioner-6db77bb448-zb8nx                        6/6     Running            0          3d22h
csi-rbdplugin-sg8vc                                               3/3     Running            0          3d22h
csi-rbdplugin-sqd7n                                               3/3     Running            0          3d22h
noobaa-core-0                                                     1/1     Running            0          3d22h
noobaa-db-0                                                       1/1     Running            0          3d22h
noobaa-endpoint-8f88646bf-75fnz                                   1/1     Running            0          3d22h
noobaa-operator-66dbcf8698-2nv4d                                  1/1     Running            0          3d22h
ocs-metrics-exporter-758dd99c98-67kbn                             1/1     Running            0          3d22h
ocs-operator-58f8bb8dd8-5sppp                                     1/1     Running            0          3d22h
rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-5dlvx7h   1/1     Running            0          3d22h
rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-f8ftlzz   1/1     Running            0          3d22h
rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-7czlgrj   1/1     Running            0          3d22h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-55f64484kmjf6   1/1     Running            0          3d22h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-69c57477v79tq   1/1     Running            0          3d22h
rook-ceph-mgr-a-6bcc6954cf-l9k72                                  1/1     Running            0          3d22h
rook-ceph-mon-a-65b67f57dd-cmt4s                                  1/1     Running            0          3d22h
rook-ceph-mon-b-57d7fc55bc-jzqp5                                  1/1     Running            0          3d22h
rook-ceph-mon-c-685bf9c59d-746rj                                  1/1     Running            0          3d22h
rook-ceph-operator-88555d7c-2qs6p                                 1/1     Running            0          3d22h
rook-ceph-osd-0-84d767f446-nhmw4                                  1/1     Running            5          3d22h
rook-ceph-osd-1-6c4784b457-ssbdt                                  1/1     Running            4          3d22h
rook-ceph-osd-2-78975b7f45-6jsqm                                  1/1     Running            2          3d22h
rook-ceph-osd-prepare-ocs-deviceset-0-data-0-6r8t2-dj7cd          0/1     Completed          0          4d13h
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-7rkzh-bdxxl          0/1     Completed          0          4d13h
rook-ceph-osd-prepare-ocs-deviceset-2-data-0-dsvxq-22x25          0/1     Completed          0          4d13h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-56b64f8kcwct   0/1     CrashLoopBackOff   1787       3d22h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-74dc9f947j94   0/1     CrashLoopBackOff   1833       3d22h
rook-ceph-tools-6fdd868f75-686g4                                  1/1     Running            0          4d13h
worker-0m1301015ocslnxneboe-debug                                 0/1     Completed          0          4d13h
worker-1m1301015ocslnxneboe-debug                                 0/1     Completed          0          4d13h
worker-2m1301015ocslnxneboe-debug                                 0/1     Completed          0          4d13h
[root@m1301015 ~]#

--------------

Comment 10 Jiffin 2021-02-23 08:56:27 UTC

This one is different from the old cluster reported in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c0, can please u share the logs from rgw pods, rook operator pod 
And "oc -n openshift-storage describe pods <rgw pods>" as well ?

Comment 11 Abdul Kandathil (IBM) 2021-02-23 09:14:07 UTC

Created attachment 1758795 [details]
requested data

Please find attached for requested info.

Comment 15 Abdul Kandathil (IBM) 2021-02-24 13:58:09 UTC

Please see the attachment in the comment (https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c11) for all the info asked in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c10.
Is there any further information needed?

Comment 16 Sébastien Han 2021-02-24 14:00:42 UTC

(In reply to Matt Benjamin (redhat) from comment #13)
> Hi Jiffin,
> 
> I'm confused how this would happen.
> 
> This error implies that the beast front-end /was not selected to be built/
> when this system-z build was run.
> 
> The Beast frontend is built unless explicitly disabled (top CMakeLists.txt),
> so if there was any problem, it should have presented itself as a
> compilation error, not a runtime error.
> 
> Can we get some help from the folks who do the system-z builds, please?
> 
> Matt

Based on the above, Christina, can you assist with this issue since it seems related to the build? Maybe Boris or Ken? I'm not sure...
Thanks!

I'm moving this out of Rook since it is not a Rook problem. I've picked build.

Comment 18 Ken Dreyer (Red Hat) 2021-02-24 18:03:15 UTC

In bug 1917592, I found that I accidentally set WITH_BOOST_CONTEXT=OFF for s390x in the RHCS 4.2 Ceph build. This was a regression from RHCS 4.1, and we plan to ship that fix in RHCS 4.2 z1.

Comment 20 Mudit Agarwal 2021-03-04 04:37:47 UTC

Is there a BZ tracking the fix mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1928642#c18?

Comment 21 Mudit Agarwal 2021-03-04 04:46:38 UTC

My bad, please ignore the above comment.

Moving this BZ to MODIFIED as we have the fix already there in RHCS4.2z1

Comment 23 Boris Ranto 2021-03-04 23:59:10 UTC

Yes, this has been fixed (early) in 4.2z1 for a while. All the 4.2z1 based 4.7.0 builds should have the fix in them.

Comment 24 lmcfadde 2021-03-10 19:30:31 UTC

@akandath @tstober can this be moved to VERIFIED?

Comment 26 Abdul Kandathil (IBM) 2021-03-29 14:11:24 UTC

I just installed OCS 4.7.0-324.ci on OCP 4.7.2 and I am not seeing the "rook-ceph-rgw" pod restarts anymore.


[root@m1301015 ~]# oc -n openshift-storage get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.7.0-324.ci   OpenShift Container Storage   4.7.0-324.ci              Succeeded
[root@m1301015 ~]#

[root@m1301015 ~]# oc -n openshift-storage get po
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-5wgkx                                            3/3     Running     0          72m
csi-cephfsplugin-6lx6g                                            3/3     Running     0          72m
csi-cephfsplugin-cc8vm                                            3/3     Running     0          72m
csi-cephfsplugin-j6kjc                                            3/3     Running     0          72m
csi-cephfsplugin-provisioner-76b7c894b9-6z6wk                     6/6     Running     0          72m
csi-cephfsplugin-provisioner-76b7c894b9-xq9j2                     6/6     Running     0          72m
csi-rbdplugin-8z2dp                                               3/3     Running     0          72m
csi-rbdplugin-d8fmx                                               3/3     Running     0          72m
csi-rbdplugin-gxvmj                                               3/3     Running     0          72m
csi-rbdplugin-lmd4q                                               3/3     Running     0          72m
csi-rbdplugin-provisioner-5866f86d44-96wlm                        6/6     Running     0          72m
csi-rbdplugin-provisioner-5866f86d44-k5f54                        6/6     Running     0          72m
noobaa-core-0                                                     1/1     Running     0          69m
noobaa-db-pg-0                                                    1/1     Running     0          69m
noobaa-endpoint-86cffb6848-shn22                                  1/1     Running     0          67m
noobaa-operator-fb44b58b6-rhj24                                   1/1     Running     0          74m
ocs-metrics-exporter-5549d7f894-wgdkw                             1/1     Running     0          74m
ocs-operator-6b76fb4dff-dsjq2                                     1/1     Running     0          74m
rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-85fzbwh   1/1     Running     0          69m
rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-75tptnm   1/1     Running     0          70m
rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-7dvx4cc   1/1     Running     0          70m
rook-ceph-crashcollector-worker-3.m1301015ocs.lnxne.boe-577vm45   1/1     Running     0          71m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5757dc86mvp6g   2/2     Running     0          69m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6cc95ff82j7sb   2/2     Running     0          69m
rook-ceph-mgr-a-689cf446d-mc7st                                   2/2     Running     0          70m
rook-ceph-mon-a-7dccdf8df4-gv8rc                                  2/2     Running     0          72m
rook-ceph-mon-b-9bb666f56-dp8qk                                   2/2     Running     0          70m
rook-ceph-mon-c-57569f8c98-8ngzq                                  2/2     Running     0          70m
rook-ceph-operator-7bd78b8dff-55qzx                               1/1     Running     0          74m
rook-ceph-osd-0-74b9845b7f-sgbx9                                  2/2     Running     0          69m
rook-ceph-osd-1-cc884cd6-qqjr7                                    2/2     Running     0          69m
rook-ceph-osd-2-7574949756-fhq9l                                  2/2     Running     0          69m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0d59dj-m4n4p           0/1     Completed   0          70m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0tgxjx-t7ffn           0/1     Completed   0          70m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0wgvg9-5mkdf           0/1     Completed   0          70m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c494495lk8x   2/2     Running     0          68m
rook-ceph-tools-76bc89666b-dv9wk                                  1/1     Running     0          70m
[root@m1301015 ~]#

Comment 27 tstober 2021-04-28 13:30:32 UTC

the first bug has already been verified

Comment 28 tstober 2021-04-28 13:45:36 UTC

verified

Comment 30 errata-xmlrpc 2021-05-19 09:20:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.