Login
Log in using an SSO provider:
Fedora Account System
Red Hat Associate
Red Hat Customer
Login using a Red Hat Bugzilla account
Forgot Password
Create an Account
Red Hat Bugzilla – Bug 2165395
Home
New
Search
Simple Search
Advanced Search
My Links
Browse
Requests
Reports
Current State
Search
Tabular reports
Graphical reports
Duplicates
Other Reports
User Changes
Plotly Reports
Bug Status
Bug Severity
Non-Defaults
Product Dashboard
Help
Page Help!
Bug Writing Guidelines
What's new
Browser Support Policy
5.0.4.rh89 Release notes
FAQ
Guides index
User guide
Web Services
Contact
Legal
[?]
This site requires JavaScript to be enabled to function correctly, please enable it.
Bug 2165395
-
RGW Errors on baremetal compact-mode OCP clusters: "Received failed: Disconnected", "brokers are down"
Summary:
RGW Errors on baremetal compact-mode OCP clusters: "Received failed: Disconne...
Keywords
:
Automation
Reopened
Status
:
CLOSED WORKSFORME
Alias:
None
Product:
Red Hat OpenShift Data Foundation
Classification:
Red Hat Storage
Component:
Multi-Cloud Object Gateway
Sub Component:
---
Version:
4.12
Hardware:
Unspecified
OS:
Unspecified
Priority:
unspecified
Severity:
unspecified
Target Milestone:
---
Target Release
:
---
Assignee:
Nimrod Becker
QA Contact:
krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+
depends on
/
blocked
Reported:
2023-01-29 19:45 UTC by
Sagi Hirshfeld
Modified:
2023-08-09 16:49 UTC (
History
)
CC List:
9 users
(
show
)
bniver
jthottan
kramdoss
mbenjamin
muagarwa
nbecker
ocs-bugs
odf-bz-bot
sostapov
Fixed In Version:
Doc Type:
If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Description of problem (please be detailed as possible and provide log snippests): —————————————————————————————————————————————- RGW-based MCG NamespaceStores have shown inconsistent stability in the recent ODF-QE tests on baremetal compact-mode OCP clusters on NVME-intel. In some cases, they've failed to reach the OPTIMAL state while waiting for 3 minutes during setup, and in others, they've failed while IO operations were made. While some of the logs in the must-gather logs show the RGW-NSS in a healthy state, the "oc get events" logs of these cases all show that at some point, the RGW NSS has reached the Rejected or IO_ERRORS phases: ...output emitted... 19m Warning RejectedNamespaceStore bucketclass/oc-bucketclass-3f2cc10ea3eb497db48d4dad3 NooBaa NamespaceStore "rgw-ns-store-e7794ffc2c4640a68c918b43f19" is in rejected phase ...output emitted... 20m Warning NamespaceStorePhaseRejected namespacestore/rgw-ns-store-e7794ffc2c4640a68c918b43f19 Namespace store mode: IO_ERRORS ...output emitted... Upon further investigation of the RGW logs, we found the following errors around the time our test failed: … output emitted … 2023-01-12T07:04:00.624573970Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 ====== starting new request req=0x7fe326e1c630 ===== 2023-01-12T07:04:00.624981105Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 ====== req done req=0x7fe326e1c630 op status=0 http_status=200 latency=0.000000000s ====== 2023-01-12T07:04:00.625019756Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 beast: 0x7fe326e1c630: 10.131.0.1 - - [12/Jan/2023:07:04:00.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000000000s 2023-01-12T07:04:07.458514503Z %3|1673507047.458|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Receive failed: Disconnected 2023-01-12T07:04:07.458514503Z %3|1673507047.458|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Receive failed: Disconnected 2023-01-12T07:04:07.466265915Z %3|1673507047.466|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Receive failed: Disconnected 2023-01-12T07:04:07.466265915Z %3|1673507047.466|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Receive failed: Disconnected 2023-01-12T07:04:07.466442247Z %3|1673507047.466|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Receive failed: Disconnected 2023-01-12T07:04:07.466442247Z %3|1673507047.466|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Receive failed: Disconnected 2023-01-12T07:04:07.479628654Z %3|1673507047.479|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Receive failed: Disconnected 2023-01-12T07:04:07.479628654Z %3|1673507047.479|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Receive failed: Disconnected 2023-01-12T07:04:07.479628654Z %3|1673507047.479|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: 4/4 brokers are down 2023-01-12T07:04:07.568715942Z %3|1673507047.568|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Failed to resolve 'my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.568715942Z %3|1673507047.568|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Failed to resolve 'my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.569285598Z %3|1673507047.569|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092': Name or service not known 2023-01-12T07:04:07.569285598Z %3|1673507047.569|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092': Name or service not known 2023-01-12T07:04:07.576259250Z %3|1673507047.576|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Failed to resolve 'my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.576259250Z %3|1673507047.576|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Failed to resolve 'my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.588800431Z %3|1673507047.588|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Failed to resolve 'my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.588800431Z %3|1673507047.588|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Failed to resolve 'my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:10.624492268Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 ====== starting new request req=0x7fe47029e630 ===== 2023-01-12T07:04:10.624767406Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 ====== req done req=0x7fe47029e630 op status=0 http_status=200 latency=0.000000000s ====== 2023-01-12T07:04:10.624808709Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 beast: 0x7fe47029e630: 10.131.0.1 - - [12/Jan/2023:07:04:10.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000000000s 2023-01-12T07:04:13.607976978Z debug 2023-01-12T07:04:13.606+0000 7fe432713700 0 could not get bucket info for bucket=:rgw-oc-bucket-f3ad456cce9243418cf423148d[e2f56d98-464b-446e-8061-d489bbf92468.24737.3]) r=-2 2023-01-12T07:04:13.607976978Z debug 2023-01-12T07:04:13.606+0000 7fe432713700 0 WARNING: sync_bucket() returned r=-2 2023-01-12T07:04:13.701403520Z debug 2023-01-12T07:04:13.700+0000 7fe45b765700 0 rgw UsageLogger: WARNING: RGWRados::log_usage(): user name empty (bucket=), skipping 2023-01-12T07:04:14.608330105Z debug 2023-01-12T07:04:14.607+0000 7fe3cde4a700 1 ====== starting new request req=0x7fe47029e600 ===== 2023-01-12T07:04:14.609928165Z debug 2023-01-12T07:04:14.608+0000 7fe3cde4a700 1 ====== req done req=0x7fe47029e600 op status=-2 http_status=204 latency=0.000999992s ====== 2023-01-12T07:04:14.609970519Z debug 2023-01-12T07:04:14.608+0000 7fe3cde4a700 1 beast: 0x7fe47029e600: 10.128.2.19 - noobaa-ceph-objectstore-user [12/Jan/2023:07:04:14.607 +0000] "DELETE /nb.1673505526029.apps.ocp-baremetal-auto.qe.rh-ocs.com/noobaa_blocks/63bfaaf69486b1002914c3b5/blocks_tree/other.blocks/test-delete-non-existing-key-1673507054600 HTTP/1.1" 204 0 - "aws-sdk-nodejs/2.1127.0 linux/v14.18.2 promise" - latency=0.000999992s 2023-01-12T07:04:20.624878849Z debug 2023-01-12T07:04:20.623+0000 7fe3c5639700 1 ====== starting new request req=0x7fe32ad03630 ===== 2023-01-12T07:04:20.625385209Z debug 2023-01-12T07:04:20.624+0000 7fe3c5639700 1 ====== req done req=0x7fe32ad03630 op status=0 http_status=200 latency=0.000999992s ====== 2023-01-12T07:04:20.625385209Z debug 2023-01-12T07:04:20.624+0000 7fe3c5639700 1 beast: 0x7fe32ad03630: 10.131.0.1 - - [12/Jan/2023:07:04:20.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000999992s … output emitted … Versions of all relevant components (if applicable): —————————————————————————————————————————————- ODF 4.12.0-122.stable Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? —————————————————————————————————————————————- This instability is consistent enough to interrupt IO on 1MB files. Is there any workaround available to the best of your knowledge? —————————————————————————————————————————————- Retry the failed IO operations. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? —————————————————————————————————————————————- 2 Is this issue reproducible? —————————————————————————————————————————————- Yes, but not consistently. Can this issue reproduce from the UI? —————————————————————————————————————————————- NA Steps to Reproduce: —————————————————————————————————————————————- 1. Deploy a baremetal compact-mode OCP 4.12 cluster with ODF 4.12 2. Create an RGW-based MCG NameSpace Store 3. Create a bucketclass and a bucket (OBC) on top of the NameSpace Store 4. Upload 10 1MB sized files (or larger) onto the bucket Actual results: —————————————————————————————————————————————- Some of the uploads fail with the following error: “An error occurred (InternalError) when calling the GetObject operation (reached max retries: 2): We encountered an internal error. Please try again.” Expected results: —————————————————————————————————————————————- All of the files are successfully uploaded onto the bucket without any errors. Additional info: —————————————————————————————————————————————- So far we’ve only seen these errors in our automatic regression testing, and we’ve yet to reproduce this bug manually.
Last Closed:
2023-07-14 12:31:26 UTC
Embargoed:
Attachments
(Terms of Use)
Note
You need to
log in
before you can comment on or make changes to this bug.