Bug 1870631

Summary: OCS 4.6 Deployment : RGW pods went into 'CrashLoopBackOff' state on Z Platform
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Venkat <vpiniset>
Component: cephAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED ERRATA QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: bniver, ebenahar, jthottan, kdreyer, madam, muagarwa, ocs-bugs, sostapov, tdesala, uweigand
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.6.0Flags: mbenjamin: needinfo-
mbenjamin: needinfo-
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-17 06:23:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Venkat 2020-08-20 13:41:05 UTC
Description of problem (please be detailed as possible and provide log
snippests):

RGW pods went into 'CrashLoopBackOff' state on Z Platform. Initial 5 mins 
the pods were in running state. However, after some time observed this behavior.

For installation I've used  quay.io/organization/rhceph-dev builds. And we didn't observe this issue with quay.io/organization/multi-arch builds.

Version of all relevant components (if applicable):
OCP Version : 4.5.4
OCS Version : 4.6
OCS Builds : https://quay.io/organization/rhceph-dev

[root@ocplnx31 ~]# oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.0-533.ci   OpenShift Container Storage   4.6.0-533.ci              Installing

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. Unable to proceed OCS-CI Tier1 test execution.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3 Considering the failures from Tier1 execution

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
https://github.com/venkat-pinisetti/ocs-installation


Actual results:

rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-75c7557vvfq8   0/1     CrashLoopBackOff   21         56m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-bfc97c99c2tf   0/1     CrashLoopBackOff   21         56m

Expected results:

rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-75c7557vvfq8   1/1     Running     5          4m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-bfc97c99c2tf   1/1     Running     5          3m54s

Additional info:

[root@ocplnx51 ~]# oc describe po rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bc54fdmlx9d | tail
Events:
  Type     Reason     Age                   From                                Message
  ----     ------     ----                  ----                                -------
  Normal   Pulled     75m (x237 over 13h)   kubelet, worker-2.cluster6.ibm.com  Container image "quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8" already present on machine
  Warning  Unhealthy  10m (x774 over 13h)   kubelet, worker-2.cluster6.ibm.com  Liveness probe failed: Get http://10.128.2.141:8080/swift/healthcheck: dial tcp 10.128.2.141:8080: connect: connection refused
  Warning  BackOff    56s (x3133 over 13h)  kubelet, worker-2.cluster6.ibm.com  Back-off restarting failed container
[root@ocplnx51 ~]#
----------------------------------------------------

[root@ocplnx51 ~]# oc logs rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bc54fdmlx9d
debug 2020-08-18 07:22:09.703 3ffbb8df110  0 framework: beast
debug 2020-08-18 07:22:09.713 3ffbb8df110  0 framework conf key: port, val: 8080
debug 2020-08-18 07:22:09.713 3ffbb8df110  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2020-08-18 07:22:09.713 3ffbb8df110  0 ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable), process radosgw, pid 1
debug 2020-08-18 07:22:12.763 3ffbb8df110  0 WARNING: skipping unknown framework: beast
debug 2020-08-18 07:22:12.763 3ffbb8df110  1 mgrc service_daemon_register rgw.ocs.storagecluster.cephobjectstore.a metadata {arch=s390x,ceph_release=nautilus,ceph_version=ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable),ceph_version_short=14.2.8-91.el8cp,container_hostname=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bc54fdmlx9d,container_image=quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8,distro=rhel,distro_description=Red Hat Enterprise Linux 8.2 (Ootpa),distro_version=8.2,frontend_config#0=beast port=8080,frontend_type#0=beast,hostname=worker-2.cluster6.ibm.com,kernel_description=#1 SMP Thu Jul 2 11:57:29 EDT 2020,kernel_version=4.18.0-193.12.1.el8_2.s390x,mem_swap_kb=0,mem_total_kb=33006876,num_handles=1,os=Linux,pid=1,pod_name=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bc54fdmlx9d,pod_namespace=openshift-storage,zone_id=6f766d5c-f8f5-43eb-8ce1-064c336d9c71,zone_name=ocs-storagecluster-cephobjectstore,zonegroup_id=af572c8b-a831-4d12-8d03-f133df9c48aa,zonegroup_name=ocs-storagecluster-cephobjectstore}
debug 2020-08-18 07:22:47.653 3ff9f7fe910 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
debug 2020-08-18 07:22:47.653 3ff9f7fe910  1 handle_sigterm
debug 2020-08-18 07:22:47.653 3ff9f7fe910  1 handle_sigterm set alarm for 120
debug 2020-08-18 07:22:47.663 3ffbb8df110 -1 shutting down
debug 2020-08-18 07:22:48.543 3ffbb8df110  1 final shutdown
[root@ocplnx51 ~]#

Comment 2 Venkat 2020-08-20 13:47:43 UTC
Mark Kogan <mkogan>
Aug 18, 2020, 5:55 PM (2 days ago)
to Yaniv, me, rhocs-eng, poornima.nayak, Chidanand, Ulrich, OCS-QE, Nourhane, Matt

Hello, 

From the log:
"debug 2020-08-18 07:22:12.763 3ffbb8df110  0 WARNING: skipping unknown framework: beast"

Last checked the Boost library version that we used (1.67), did not support the Boost.Context library on the Z platform
thus s390x builds are configured with the -DWITH_BOOST_CONTEXT=OFF CMake flag which disables the beast frontend.

Please check that changing the rook-ceph-rgw* pods configuration to use the civetweb frontend resolves the issue.


Following this mail, re-checked the current status of this limitation and the circumstances have changed,
ceph version 14.2.8-91.el8cp was updated to build with a newer version of the boost library - boost 1.72
which per the boost context library documentation[1] has added support for s390x architecture.

If it's possible to arrange access to an s390x development VM (with RHEL or Fedora) for vstart environment, 
would re-test the Beast framework for compilation and functional issues with boost 1.72 on the Z platform.


[1]https://www.boost.org/doc/libs/1_72_0/libs/context/doc/html/context/architectures.html


Regards,
Mark



Ulrich Weigand
Aug 18, 2020, 9:02 PM (2 days ago)
to Ken, Mark, Chidanand, Matt, Nourhane, OCS-QE, poornima.nayak, rhocs-eng, me, Yaniv

Mark Kogan <mkogan> wrote on 18.08.2020 14:25:12:

> From the log:
> "debug 2020-08-18 07:22:12.763 3ffbb8df110  0 WARNING: skipping
> unknown framework: beast"
[snip]
> Following this mail, re-checked the current status of this
> limitation and the circumstances have changed,
> ceph version 14.2.8-91.el8cp was updated to build with a newer
> version of the boost library - boost 1.72
> which per the boost context library documentation[1] has added
> support for s390x architecture.

Turns out the support in 1.72 was incomplete, we've added full
support in 1.73.  But for the RH Ceph builds, we provided a backport
of the necessary changes as a patch against 1.72, which I understand
should have make boost context (and therefore the beast frontend)
work properly on Z.  Ken Dreyer worked on integrating this into the
latest Ceph builds.  Ken, is this supposed to be working now?

> If it's possible to arrange access to an s390x development VM (with
> RHEL or Fedora) for vstart environment, 
> would re-test the Beast framework for compilation and functional
> issues with boost 1.72 on the Z platform.

There is supposed to be a dev environment available to Christina
Meno's team, but I'm not sure this is already fully set up ...

Bye,
Ulrich

Comment 3 Mudit Agarwal 2020-10-07 10:08:06 UTC
Elad, is this really a blocker? Is it consistently seen or QE is blocked because of this?

There is even a workaround avaialble if I am not wrong and seems more to be a retest issue.
>> Please check that changing the rook-ceph-rgw* pods configuration to use the civetweb frontend resolves the issue.

Comment 5 Mudit Agarwal 2020-10-07 12:12:03 UTC
Hi Scott, Can someone help us in determining the correct build and therefore appropriate bug state and release.
Thanks

Comment 6 Elad 2020-10-07 12:15:29 UTC
Mudit, OCS QE are not the ones who actively test over IBM Z. I added the blocker? flag so this but will not be pushed out to 4.7, in order to allow IBM team to have a successful deployment in their test executions

Comment 7 Venkat 2020-10-07 13:25:12 UTC
apologies for the delay. I've tested with latest build and issue is not hitting now.

[root@ocsvm2 ~]# oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.0-585.ci   OpenShift Container Storage   4.6.0-585.ci              Succeeded

[root@ocsvm2 ~]# oc get pods|grep rgw
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-78d6f7dw9vbd   1/1     Running     0          31h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-769755drwjtb   1/1     Running     0          31h
[root@ocsvm2 ~]#

Comment 9 Mudit Agarwal 2020-10-07 13:49:27 UTC
Clearing the needinfo.

Providing dev_ack based on https://bugzilla.redhat.com/show_bug.cgi?id=1870631#c8, there is no fix from OCS side though.

Comment 14 Mudit Agarwal 2020-10-07 14:56:21 UTC
Thanks Matt, I think it is clear now.

OCS 4.6 is already based on RHCS 4.1z2 which means this issue should have been fixed by now and that is what is being reflected from Venkat's update. Moving the BZ to ON_QA, QE can mark it VERIFIED.

Comment 20 Venkat 2020-10-21 13:19:45 UTC
I've verified this on latest version of ocs 4.6 and its fixed now. Hence, this can be closed.

[root@ocplnx31 ~]# oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.0-607.ci   OpenShift Container Storage   4.6.0-607.ci              Succeeded

[root@ocplnx31 ~]# oc get pods|grep rgw
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-59cbf87jrcn7   1/1     Running     0          4m49s
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-7d6fdfdjvqxg   1/1     Running     0          4m49s

[root@ocplnx31 ~]# oc version
Client Version: 4.5.16
Server Version: 4.5.15
Kubernetes Version: v1.18.3+2fbd7c7
[root@ocplnx31 ~]#

Comment 21 Prasad Desala 2020-10-21 14:58:33 UTC
Thank you Venkat. Moving this BZ to verified state based on Comment20.

Comment 24 errata-xmlrpc 2020-12-17 06:23:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605