Bug 2153008

Summary:	[GSS] False alert of "Cluster Object Store is in unhealthy state for more than 15s"
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Sonal <sarora>
Component:	ceph	Assignee:	Matt Benjamin (redhat) <mbenjamin>
ceph sub component:	RGW	QA Contact:	Elad <ebenahar>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	high	CC:	bniver, brgardne, hnallurv, jthottan, mbenjamin, mkasturi, muagarwa, nojha, odf-bz-bot, rzarzyns, sostapov, tnielsen, vumrao
Version:	4.11
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-22 12:02:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sonal 2022-12-13 18:45:23 UTC

Description of problem (please be detailed as possible and provide log
snippests):

False alert is observed in OCS console: "Cluster Object Store is in unhealthy state for more than 15s"


Version of all relevant components (if applicable):
4.11.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes, in customer's environment

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
-

Steps to Reproduce:
Issue is observed after upgrading to OCS 4.11

Actual results:
Storagesystem in degraded state

Expected results:
Storagesystem should be 

Additional info:
In next private comment

Comment 5 Blaine Gardner 2022-12-20 18:34:48 UTC

RGW is returning error 500. I'm not sure why. I copied a section of RGW logs below, but we need to get someone experienced with RGW to take a look I think. I don't see anything misconfigured so far.


2022-11-22T09:36:30.191955329+01:00 debug 2022-11-22T08:36:30.190+0000 7fa5ac5f3700  1 ====== starting new request req=0x7fa4d5c45630 =====
2022-11-22T09:36:30.197800379+01:00 debug 2022-11-22T08:36:30.196+0000 7fa53d515700  0 WARNING: set_req_state_err err_no=5 resorting to 500
2022-11-22T09:36:30.198185831+01:00 debug 2022-11-22T08:36:30.197+0000 7fa53d515700  1 ====== req done req=0x7fa4d5c45630 op status=-5 http_status=500 latency=0.007000081s ======
2022-11-22T09:36:30.198280655+01:00 debug 2022-11-22T08:36:30.197+0000 7fa53d515700  1 beast: 0x7fa4d5c45630: 10.128.4.21 - noobaa-ceph-objectstore-user [22/Nov/2022:08:36:30.190 +0000] "PUT /nb.1637066608462.apps.ocp-test.openshift-dpc.local/noobaa_blocks/6193a7700e500e002327f8d6/blocks_tree/other.blocks/_test_store_perf HTTP/1.1" 500 1371 - "aws-sdk-nodejs/2.1127.0 linux/v14.18.2 promise" - latency=0.007000081s
2022-11-22T09:36:30.338275062+01:00 debug 2022-11-22T08:36:30.337+0000 7fa552d40700  1 ====== starting new request req=0x7fa4d5c45630 =====
2022-11-22T09:36:30.345020431+01:00 debug 2022-11-22T08:36:30.344+0000 7fa52f4f9700  0 WARNING: set_req_state_err err_no=5 resorting to 500
2022-11-22T09:36:30.345648597+01:00 debug 2022-11-22T08:36:30.344+0000 7fa52f4f9700  1 ====== req done req=0x7fa4d5c45630 op status=-5 http_status=500 latency=0.007000081s ======
2022-11-22T09:36:30.345983860+01:00 debug 2022-11-22T08:36:30.344+0000 7fa52f4f9700  1 beast: 0x7fa4d5c45630: 10.128.4.21 - noobaa-ceph-objectstore-user [22/Nov/2022:08:36:30.337 +0000] "PUT /nb.1637066608462.apps.ocp-test.openshift-dpc.local/noobaa_blocks/6193a7700e500e002327f8d6/blocks_tree/other.blocks/_test_store_perf HTTP/1.1" 500 1371 - "aws-sdk-nodejs/2.1127.0 linux/v14.18.2 promise" - latency=0.007000081s
2022-11-22T09:36:30.607642277+01:00 debug 2022-11-22T08:36:30.606+0000 7fa4f6c88700  1 ====== starting new request req=0x7fa4d5c45630 =====
2022-11-22T09:36:30.614114065+01:00 debug 2022-11-22T08:36:30.613+0000 7fa544523700  0 WARNING: set_req_state_err err_no=5 resorting to 500
2022-11-22T09:36:30.614479947+01:00 debug 2022-11-22T08:36:30.613+0000 7fa544523700  1 ====== req done req=0x7fa4d5c45630 op status=-5 http_status=500 latency=0.007000081s ======
2022-11-22T09:36:30.614586529+01:00 debug 2022-11-22T08:36:30.613+0000 7fa544523700  1 beast: 0x7fa4d5c45630: 10.128.4.21 - noobaa-ceph-objectstore-user [22/Nov/2022:08:36:30.606 +0000] "PUT /nb.1637066608462.apps.ocp-test.openshift-dpc.local/noobaa_blocks/6193a7700e500e002327f8d6/blocks_tree/other.blocks/_test_store_perf HTTP/1.1" 500 1371 - "aws-sdk-nodejs/2.1127.0 linux/v14.18.2 promise" - latency=0.007000081s
2022-11-22T09:36:30.628885237+01:00 debug 2022-11-22T08:36:30.628+0000 7fa5a95ed700  1 ====== starting new request req=0x7fa4d5c45630 =====
2022-11-22T09:36:30.631082542+01:00 debug 2022-11-22T08:36:30.630+0000 7fa5a95ed700  1 ====== req done req=0x7fa4d5c45630 op status=-2 http_status=204 latency=0.002000023s ======
2022-11-22T09:36:30.631371320+01:00 debug 2022-11-22T08:36:30.630+0000 7fa5a95ed700  1 beast: 0x7fa4d5c45630: 10.128.4.21 - noobaa-ceph-objectstore-user [22/Nov/2022:08:36:30.628 +0000] "DELETE /nb.1637066608462.apps.ocp-test.openshift-dpc.local/noobaa_blocks/6193a7700e500e002327f8d6/blocks_tree/other.blocks/test-delete-non-existing-key-1669106190622 HTTP/1.1" 204 0 - "aws-sdk-nodejs/2.1127.0 linux/v14.18.2 promise" - latency=0.002000023s
2022-11-22T09:36:37.089010119+01:00 debug 2022-11-22T08:36:37.088+0000 7fa58c5b3700  1 ====== starting new request req=0x7fa61c6ce630 =====
2022-11-22T09:36:37.089444074+01:00 debug 2022-11-22T08:36:37.088+0000 7fa58c5b3700  1 ====== req done req=0x7fa61c6ce630 op status=0 http_status=200 latency=0.000000000s ======
2022-11-22T09:36:37.089621302+01:00 debug 2022-11-22T08:36:37.088+0000 7fa58c5b3700  1 beast: 0x7fa61c6ce630: 10.128.4.1 - - [22/Nov/2022:08:36:37.088 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s
2022-11-22T09:36:47.088527876+01:00 debug 2022-11-22T08:36:47.086+0000 7fa50b4b1700  1 ====== starting new request req=0x7fa61c6ce630 =====
2022-11-22T09:36:47.088995912+01:00 debug 2022-11-22T08:36:47.087+0000 7fa50b4b1700  1 ====== req done req=0x7fa61c6ce630 op status=0 http_status=200 latency=0.001000012s ======
2022-11-22T09:36:47.089135565+01:00 debug 2022-11-22T08:36:47.087+0000 7fa50b4b1700  1 beast: 0x7fa61c6ce630: 10.128.4.1 - - [22/Nov/2022:08:36:47.086 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.001000012s
2022-11-22T09:36:47.317367259+01:00 debug 2022-11-22T08:36:47.315+0000 7fa607eaa700  0 rgw UsageLogger: WARNING: RGWRados::log_usage(): user name empty (bucket=), skipping
2022-11-22T09:36:57.087894797+01:00 debug 2022-11-22T08:36:57.085+0000 7fa55954d700  1 ====== starting new request req=0x7fa61c6ce630 =====
2022-11-22T09:36:57.088352163+01:00 debug 2022-11-22T08:36:57.086+0000 7fa55954d700  1 ====== req done req=0x7fa61c6ce630 op status=0 http_status=200 latency=0.000000000s ======
2022-11-22T09:36:57.088488432+01:00 debug 2022-11-22T08:36:57.086+0000 7fa55954d700  1 beast: 0x7fa61c6ce630: 10.128.4.1 - - [22/Nov/2022:08:36:57.085 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s
2022-11-22T09:37:07.088412567+01:00 debug 2022-11-22T09:37:07.088630767+01:00 2022-11-22T08:37:07.087+0000 7fa51ccd4700  1 ====== starting new request req=0x7fa4d5c45630 =====2022-11-22T09:37:07.088702317+01:00 
2022-11-22T09:37:07.088918407+01:00 debug 2022-11-22T09:37:07.088993275+01:00 2022-11-22T08:37:07.087+0000 7fa51ccd4700  1 ====== req done req=0x7fa4d5c45630 op status=0 http_status=200 latency=0.000000000s ======2022-11-22T09:37:07.089046187+01:00 
2022-11-22T09:37:07.089134150+01:00 debug 2022-11-22T09:37:07.089188270+01:00 2022-11-22T08:37:07.087+0000 7fa51ccd4700  1 beast: 0x7fa4d5c45630: 10.128.4.1 - - [22/Nov/2022:08:37:07.087 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.24" - latency=0.000000000s2022-11-22T09:37:07.089236159+01:00

Comment 6 Blaine Gardner 2022-12-20 19:44:21 UTC

In the meantime, could you increase the log level for the RGW, then restart the RGW pod, and then collect a new must-gather after 10 minutes or so?