Description of problem (please be detailed as possible and provide log snippests): http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-093vuf1cs36-t4a/j-093vuf1cs36-t4a_20211019T123728/logs/deployment_1634647516/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-0619998acac82e7a758421be7fe47a985142f0cf9f2400e89b7f5782a5eab00c/namespaces/openshift-storage/oc_output/pods_-owide rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7bc5987xwj4t 1/2 CrashLoopBackOff 3 (16s ago) 4m21s 10.129.2.10 compute-0 <none> <none> We saw in the CI this log message: Ceph cluster health is not OK. Health: HEALTH_WARN 9 daemons have recently crashed Version of all relevant components (if applicable): 4.9.0-193.ci OCP 4.9 nightly Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Haven't tried but I guess it will cause the issues. Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Haven't tried yet. Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install VSPHERE UPI FIPS 1AZ RHCOS VSAN 3Masters 6Workers Cluster 2. We see the CLBO on rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod 3. Actual results: CLBO on rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod Expected results: No CLBO on rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod Additional info: Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-093vuf1cs36-t4a/j-093vuf1cs36-t4a_20211019T123728/logs/failed_testcase_ocs_logs_1634647516/test_deployment_ocs_logs/ Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/2095/console
Trying to re-produce here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-fips-1az-rhcos-vsan-3m-6w-tier4a/94/
The crash looks what we seen in https://bugzilla.redhat.com/show_bug.cgi?id=2002220, only difference crash happened in the RGW pods than in the command executed in `radosgw-admin` via rook operator pod So for the time being mark it as duplicate of the tracker bug in ODF for ceph fix *** This bug has been marked as a duplicate of bug 2013326 ***
I agree also. It seems pretty clearly to be a duplicate. As an aside, I do wonder why this test wasn't using the latest ODF build. The ODF chage related to how rook applies the period udate got into release 4.9-204.2e8a02b.release_4.9, but this test uses build 193.
Blaine, the execution we did and which was mentioned was https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-fips-1az-rhcos-vsan-3m-6w-tier4a/94/ this job Started 6 days 0 hr ago as I see from the job. From here: https://quay.io/repository/rhceph-dev/ocs-registry?tab=tags the latest available build was: 4.9.0-193.ci 7 days ago This one was produced probably after we triggered the job or was not marked as stable yet. 4.9.0-194.ci 6 days ago Now the latest build is: 4.9.0-201.ci 7 hours ago Previous: 4.9.0-196.ci 4 days ago So not sure where you got: 204 build? But from what I asked Mudit today here: https://chat.google.com/room/AAAAREGEba8/ygaFVSxp2ME We are still waiting for Ceph image with fix - so we don't have a build yet with the fix. But our production pipeline still contains the job also the fips one - so when we run pipeline we are getting a lot of failed jobs because of those FIPS related issues in CEPH.