Description of problem (please be detailed as possible and provide log snippests): RGW pods went into 'CrashLoopBackOff' state on Z Platform. Initial 5 mins the pods were in running state. However, after some time observed this behavior. For installation I've used quay.io/organization/rhceph-dev builds. And we didn't observe this issue with quay.io/organization/multi-arch builds. Version of all relevant components (if applicable): OCP Version : 4.5.4 OCS Version : 4.6 OCS Builds : https://quay.io/organization/rhceph-dev [root@ocplnx31 ~]# oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-533.ci OpenShift Container Storage 4.6.0-533.ci Installing Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. Unable to proceed OCS-CI Tier1 test execution. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Considering the failures from Tier1 execution Can this issue reproducible? Yes Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: N/A Steps to Reproduce: https://github.com/venkat-pinisetti/ocs-installation Actual results: rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-75c7557vvfq8 0/1 CrashLoopBackOff 21 56m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-bfc97c99c2tf 0/1 CrashLoopBackOff 21 56m Expected results: rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-75c7557vvfq8 1/1 Running 5 4m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-bfc97c99c2tf 1/1 Running 5 3m54s Additional info: [root@ocplnx51 ~]# oc describe po rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bc54fdmlx9d | tail Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 75m (x237 over 13h) kubelet, worker-2.cluster6.ibm.com Container image "quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8" already present on machine Warning Unhealthy 10m (x774 over 13h) kubelet, worker-2.cluster6.ibm.com Liveness probe failed: Get http://10.128.2.141:8080/swift/healthcheck: dial tcp 10.128.2.141:8080: connect: connection refused Warning BackOff 56s (x3133 over 13h) kubelet, worker-2.cluster6.ibm.com Back-off restarting failed container [root@ocplnx51 ~]# ---------------------------------------------------- [root@ocplnx51 ~]# oc logs rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bc54fdmlx9d debug 2020-08-18 07:22:09.703 3ffbb8df110 0 framework: beast debug 2020-08-18 07:22:09.713 3ffbb8df110 0 framework conf key: port, val: 8080 debug 2020-08-18 07:22:09.713 3ffbb8df110 0 deferred set uid:gid to 167:167 (ceph:ceph) debug 2020-08-18 07:22:09.713 3ffbb8df110 0 ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable), process radosgw, pid 1 debug 2020-08-18 07:22:12.763 3ffbb8df110 0 WARNING: skipping unknown framework: beast debug 2020-08-18 07:22:12.763 3ffbb8df110 1 mgrc service_daemon_register rgw.ocs.storagecluster.cephobjectstore.a metadata {arch=s390x,ceph_release=nautilus,ceph_version=ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable),ceph_version_short=14.2.8-91.el8cp,container_hostname=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bc54fdmlx9d,container_image=quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8,distro=rhel,distro_description=Red Hat Enterprise Linux 8.2 (Ootpa),distro_version=8.2,frontend_config#0=beast port=8080,frontend_type#0=beast,hostname=worker-2.cluster6.ibm.com,kernel_description=#1 SMP Thu Jul 2 11:57:29 EDT 2020,kernel_version=4.18.0-193.12.1.el8_2.s390x,mem_swap_kb=0,mem_total_kb=33006876,num_handles=1,os=Linux,pid=1,pod_name=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bc54fdmlx9d,pod_namespace=openshift-storage,zone_id=6f766d5c-f8f5-43eb-8ce1-064c336d9c71,zone_name=ocs-storagecluster-cephobjectstore,zonegroup_id=af572c8b-a831-4d12-8d03-f133df9c48aa,zonegroup_name=ocs-storagecluster-cephobjectstore} debug 2020-08-18 07:22:47.653 3ff9f7fe910 -1 received signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 debug 2020-08-18 07:22:47.653 3ff9f7fe910 1 handle_sigterm debug 2020-08-18 07:22:47.653 3ff9f7fe910 1 handle_sigterm set alarm for 120 debug 2020-08-18 07:22:47.663 3ffbb8df110 -1 shutting down debug 2020-08-18 07:22:48.543 3ffbb8df110 1 final shutdown [root@ocplnx51 ~]#
Mark Kogan <mkogan> Aug 18, 2020, 5:55 PM (2 days ago) to Yaniv, me, rhocs-eng, poornima.nayak, Chidanand, Ulrich, OCS-QE, Nourhane, Matt Hello, From the log: "debug 2020-08-18 07:22:12.763 3ffbb8df110 0 WARNING: skipping unknown framework: beast" Last checked the Boost library version that we used (1.67), did not support the Boost.Context library on the Z platform thus s390x builds are configured with the -DWITH_BOOST_CONTEXT=OFF CMake flag which disables the beast frontend. Please check that changing the rook-ceph-rgw* pods configuration to use the civetweb frontend resolves the issue. Following this mail, re-checked the current status of this limitation and the circumstances have changed, ceph version 14.2.8-91.el8cp was updated to build with a newer version of the boost library - boost 1.72 which per the boost context library documentation[1] has added support for s390x architecture. If it's possible to arrange access to an s390x development VM (with RHEL or Fedora) for vstart environment, would re-test the Beast framework for compilation and functional issues with boost 1.72 on the Z platform. [1]https://www.boost.org/doc/libs/1_72_0/libs/context/doc/html/context/architectures.html Regards, Mark Ulrich Weigand Aug 18, 2020, 9:02 PM (2 days ago) to Ken, Mark, Chidanand, Matt, Nourhane, OCS-QE, poornima.nayak, rhocs-eng, me, Yaniv Mark Kogan <mkogan> wrote on 18.08.2020 14:25:12: > From the log: > "debug 2020-08-18 07:22:12.763 3ffbb8df110 0 WARNING: skipping > unknown framework: beast" [snip] > Following this mail, re-checked the current status of this > limitation and the circumstances have changed, > ceph version 14.2.8-91.el8cp was updated to build with a newer > version of the boost library - boost 1.72 > which per the boost context library documentation[1] has added > support for s390x architecture. Turns out the support in 1.72 was incomplete, we've added full support in 1.73. But for the RH Ceph builds, we provided a backport of the necessary changes as a patch against 1.72, which I understand should have make boost context (and therefore the beast frontend) work properly on Z. Ken Dreyer worked on integrating this into the latest Ceph builds. Ken, is this supposed to be working now? > If it's possible to arrange access to an s390x development VM (with > RHEL or Fedora) for vstart environment, > would re-test the Beast framework for compilation and functional > issues with boost 1.72 on the Z platform. There is supposed to be a dev environment available to Christina Meno's team, but I'm not sure this is already fully set up ... Bye, Ulrich
Elad, is this really a blocker? Is it consistently seen or QE is blocked because of this? There is even a workaround avaialble if I am not wrong and seems more to be a retest issue. >> Please check that changing the rook-ceph-rgw* pods configuration to use the civetweb frontend resolves the issue.
Hi Scott, Can someone help us in determining the correct build and therefore appropriate bug state and release. Thanks
Mudit, OCS QE are not the ones who actively test over IBM Z. I added the blocker? flag so this but will not be pushed out to 4.7, in order to allow IBM team to have a successful deployment in their test executions
apologies for the delay. I've tested with latest build and issue is not hitting now. [root@ocsvm2 ~]# oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-585.ci OpenShift Container Storage 4.6.0-585.ci Succeeded [root@ocsvm2 ~]# oc get pods|grep rgw rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-78d6f7dw9vbd 1/1 Running 0 31h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-769755drwjtb 1/1 Running 0 31h [root@ocsvm2 ~]#
Clearing the needinfo. Providing dev_ack based on https://bugzilla.redhat.com/show_bug.cgi?id=1870631#c8, there is no fix from OCS side though.
Thanks Matt, I think it is clear now. OCS 4.6 is already based on RHCS 4.1z2 which means this issue should have been fixed by now and that is what is being reflected from Venkat's update. Moving the BZ to ON_QA, QE can mark it VERIFIED.
I've verified this on latest version of ocs 4.6 and its fixed now. Hence, this can be closed. [root@ocplnx31 ~]# oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-607.ci OpenShift Container Storage 4.6.0-607.ci Succeeded [root@ocplnx31 ~]# oc get pods|grep rgw rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-59cbf87jrcn7 1/1 Running 0 4m49s rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-7d6fdfdjvqxg 1/1 Running 0 4m49s [root@ocplnx31 ~]# oc version Client Version: 4.5.16 Server Version: 4.5.15 Kubernetes Version: v1.18.3+2fbd7c7 [root@ocplnx31 ~]#
Thank you Venkat. Moving this BZ to verified state based on Comment20.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605