2015581 – rook-ceph-rgw-ocs-storagecluster-cephobjectstore in CLBO: ERROR: sync_user() failed, user=ocs-storagecluster-cephobjectstoreuser ret=-2

Bug 2015581 - rook-ceph-rgw-ocs-storagecluster-cephobjectstore in CLBO: ERROR: sync_user() failed, user=ocs-storagecluster-cephobjectstoreuser ret=-2

Summary: rook-ceph-rgw-ocs-storagecluster-cephobjectstore in CLBO: ERROR: sync_user() ...

Keywords:
Status:	CLOSED DUPLICATE of bug 2013326
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Scott Ostapovicz
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-19 14:59 UTC by Petr Balogh
Modified:	2023-08-09 16:37 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-20 06:33:20 UTC
Embargoed:

Attachments	(Terms of Use)

Description Petr Balogh 2021-10-19 14:59:54 UTC

Description of problem (please be detailed as possible and provide log
snippests):
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-093vuf1cs36-t4a/j-093vuf1cs36-t4a_20211019T123728/logs/deployment_1634647516/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-0619998acac82e7a758421be7fe47a985142f0cf9f2400e89b7f5782a5eab00c/namespaces/openshift-storage/oc_output/pods_-owide

rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7bc5987xwj4t   1/2     CrashLoopBackOff   3 (16s ago)   4m21s   10.129.2.10    compute-0   <none>           <none>

We saw in the CI this log message:
Ceph cluster health is not OK. Health: HEALTH_WARN 9 daemons have recently crashed


Version of all relevant components (if applicable):
4.9.0-193.ci
OCP 4.9 nightly

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Haven't tried but I guess it will cause the issues.

Is there any workaround available to the best of your knowledge?
no


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Haven't tried yet.


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install VSPHERE UPI FIPS 1AZ RHCOS VSAN 3Masters 6Workers Cluster
2. We see the CLBO on rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod
3.


Actual results:
CLBO on rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod

Expected results:
No CLBO on rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a pod

Additional info:
Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-093vuf1cs36-t4a/j-093vuf1cs36-t4a_20211019T123728/logs/failed_testcase_ocs_logs_1634647516/test_deployment_ocs_logs/
Job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/2095/console

Comment 3 Petr Balogh 2021-10-19 15:02:31 UTC

Trying to re-produce here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-fips-1az-rhcos-vsan-3m-6w-tier4a/94/

Comment 5 Jiffin 2021-10-20 06:33:20 UTC

The crash looks what we seen in https://bugzilla.redhat.com/show_bug.cgi?id=2002220, only difference crash happened in the RGW pods than in the command executed in `radosgw-admin` via rook operator pod
So for the time being mark it as duplicate of the tracker bug in ODF for ceph fix

*** This bug has been marked as a duplicate of bug 2013326 ***

Comment 6 Blaine Gardner 2021-10-21 19:55:27 UTC

I agree also. It seems pretty clearly to be a duplicate. 

As an aside, I do wonder why this test wasn't using the latest ODF build. The ODF chage related to how rook applies the period udate got into release 4.9-204.2e8a02b.release_4.9, but this test uses build 193.

Comment 7 Petr Balogh 2021-10-25 16:08:10 UTC

Blaine,

the execution we did and which was mentioned was https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-fips-1az-rhcos-vsan-3m-6w-tier4a/94/

this job Started 6 days 0 hr ago as I see from the job.

From here:
https://quay.io/repository/rhceph-dev/ocs-registry?tab=tags

the latest available build was:
4.9.0-193.ci	7 days ago
This one was produced probably after we triggered the job or was not marked as stable yet.
4.9.0-194.ci	6 days ago


Now the latest build is:
4.9.0-201.ci	7 hours ago

Previous:
4.9.0-196.ci	4 days ago

So not sure where you got: 204 build?

But from what I asked Mudit today here:
https://chat.google.com/room/AAAAREGEba8/ygaFVSxp2ME

We are still waiting for Ceph image with fix - so we don't have a build yet with the fix.

But our production pipeline still contains the job also the fips one - so when we run pipeline we are getting a lot of failed jobs because of those FIPS related issues in CEPH.

Note You need to log in before you can comment on or make changes to this bug.