1885971 – ocs-storagecluster-cephobjectstore doesn't report true state of RGW

Bug 1885971 - ocs-storagecluster-cephobjectstore doesn't report true state of RGW

Summary: ocs-storagecluster-cephobjectstore doesn't report true state of RGW

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Sébastien Han
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-07 11:37 UTC by Filip Balák
Modified:	2021-06-01 08:48 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.6.0-127.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-17 06:24:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 133	None	closed	Bug 1885971: ceph: reduce s3 max retry	2021-01-12 07:53:53 UTC
Github	rook rook pull 6408	None	closed	ceph: reduce s3 max retry	2021-01-12 07:53:53 UTC
Red Hat Product Errata	RHSA-2020:5605	None	None	None	2020-12-17 06:25:26 UTC

Description Filip Balák 2020-10-07 11:37:50 UTC

Description of problem (please be detailed as possible and provide log
snippests):
When no rgw pods are available, ocs-storagecluster-cephobjectstore reports:
(...)
status:
bucketStatus:
health: Connected
lastChecked: '2020-10-07T11:14:48Z'
info:
endpoint: >-
http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.[some-cluster-address]:80
phase: Connected

Version of all relevant components (if applicable):
OCS: ocs-operator.v4.6.0-113.ci
OCP: 4.6.0-0.nightly-2020-10-07-002702
Platform: VMware

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Due to this issue is ClusterObjectStoreState alert not triggered when RGW is not available.

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. In OCP UI navigate to Workloads -> Deployments and filter 'rgw' deployments.
2. Scale all 'rgw' deployments to 0 pods.
3. Navigate to Storage -> Object Bucket Claims.
4. Create OBC that uses ocs-storagecluster-ceph-rgw storageclass.
5. Navigate to Home -> Overview and in Object Service tab click on Object Service.
6. Navigate to Home -> Explore. Search for CephObjectStore. Inspect ocs-storagecluster-cephobjectstore instance.

Actual results:
Object Gateway (RGW) is in healthy state when no rgw pod is available. The state is not changed even when OBCs that use rgw storageclass are stuck because there is no RGW available.

Expected results:
Status of RGW should be correctly reported when no RGW pod is available.

Additional info:

Comment 4 Sébastien Han 2020-10-08 08:57:15 UTC

Filip, I'm not sure what's going on but I see rgw deployments here http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak-vm7/fbalak-vm7_20201007T061706/logs/testcases_1602073024/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-9dfb64c63dd8f8ee033aff511a4ffd2906ffe2a7b637deb5c81d50b8c20eaffa/namespaces/openshift-storage/apps/deployments.yaml

However they never get ready:

NAME                                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a   0/0     0            0           5h26m
deployment.apps/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b   0/0     0            0           5h26m


So they are no pods running...

Is it possible to access the env?
Thanks

Comment 5 Filip Balák 2020-10-08 09:14:00 UTC

Hi,

there are no pods because I scaled those deployments to 0 (as part of reproducer '2. Scale all 'rgw' deployments to 0 pods.'). The problem is that there is nowhere reported that RGW is unavailable.

I don't have the env at the moment and I am not sure that I will have it this week but if I will then I will ping you.

Comment 6 Sébastien Han 2020-10-08 09:17:39 UTC

Oh ok, that helps a lot actually. How long did you wait?
The check runs every minute so one minute after scaling down the deployment, the status should be updated properly.

Comment 9 Filip Balák 2020-10-14 08:59:22 UTC

Monitoring for RGW health seems fixed now. When RGW pods are scaled to 0 then RGW is displayed as red and in error state on Object Service dashboard and alert ClusterObjectStoreState is triggered.

Tested with:
OCP: 4.6.0-0.nightly-2020-10-13-064047
OCS: ocs-operator.v4.6.0-131.ci

Comment 10 Sébastien Han 2020-10-14 09:58:31 UTC

Thanks for verifying Filip.

Comment 13 errata-xmlrpc 2020-12-17 06:24:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Note You need to log in before you can comment on or make changes to this bug.