Bug 2165395 - RGW Errors on baremetal compact-mode OCP clusters: "Received failed: Disconnected", "brokers are down"
Summary: RGW Errors on baremetal compact-mode OCP clusters: "Received failed: Disconne...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Nimrod Becker
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-01-29 19:45 UTC by Sagi Hirshfeld
Modified: 2023-08-09 16:49 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Description of problem (please be detailed as possible and provide log snippests): —————————————————————————————————————————————- RGW-based MCG NamespaceStores have shown inconsistent stability in the recent ODF-QE tests on baremetal compact-mode OCP clusters on NVME-intel. In some cases, they've failed to reach the OPTIMAL state while waiting for 3 minutes during setup, and in others, they've failed while IO operations were made. While some of the logs in the must-gather logs show the RGW-NSS in a healthy state, the "oc get events" logs of these cases all show that at some point, the RGW NSS has reached the Rejected or IO_ERRORS phases: ...output emitted... 19m Warning RejectedNamespaceStore bucketclass/oc-bucketclass-3f2cc10ea3eb497db48d4dad3 NooBaa NamespaceStore "rgw-ns-store-e7794ffc2c4640a68c918b43f19" is in rejected phase ...output emitted... 20m Warning NamespaceStorePhaseRejected namespacestore/rgw-ns-store-e7794ffc2c4640a68c918b43f19 Namespace store mode: IO_ERRORS ...output emitted... Upon further investigation of the RGW logs, we found the following errors around the time our test failed: … output emitted … 2023-01-12T07:04:00.624573970Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 ====== starting new request req=0x7fe326e1c630 ===== 2023-01-12T07:04:00.624981105Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 ====== req done req=0x7fe326e1c630 op status=0 http_status=200 latency=0.000000000s ====== 2023-01-12T07:04:00.625019756Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 beast: 0x7fe326e1c630: 10.131.0.1 - - [12/Jan/2023:07:04:00.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000000000s 2023-01-12T07:04:07.458514503Z %3|1673507047.458|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Receive failed: Disconnected 2023-01-12T07:04:07.458514503Z %3|1673507047.458|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Receive failed: Disconnected 2023-01-12T07:04:07.466265915Z %3|1673507047.466|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Receive failed: Disconnected 2023-01-12T07:04:07.466265915Z %3|1673507047.466|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Receive failed: Disconnected 2023-01-12T07:04:07.466442247Z %3|1673507047.466|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Receive failed: Disconnected 2023-01-12T07:04:07.466442247Z %3|1673507047.466|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Receive failed: Disconnected 2023-01-12T07:04:07.479628654Z %3|1673507047.479|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Receive failed: Disconnected 2023-01-12T07:04:07.479628654Z %3|1673507047.479|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Receive failed: Disconnected 2023-01-12T07:04:07.479628654Z %3|1673507047.479|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: 4/4 brokers are down 2023-01-12T07:04:07.568715942Z %3|1673507047.568|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Failed to resolve 'my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.568715942Z %3|1673507047.568|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Failed to resolve 'my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.569285598Z %3|1673507047.569|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092': Name or service not known 2023-01-12T07:04:07.569285598Z %3|1673507047.569|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092': Name or service not known 2023-01-12T07:04:07.576259250Z %3|1673507047.576|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Failed to resolve 'my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.576259250Z %3|1673507047.576|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Failed to resolve 'my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.588800431Z %3|1673507047.588|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Failed to resolve 'my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:07.588800431Z %3|1673507047.588|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Failed to resolve 'my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known 2023-01-12T07:04:10.624492268Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 ====== starting new request req=0x7fe47029e630 ===== 2023-01-12T07:04:10.624767406Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 ====== req done req=0x7fe47029e630 op status=0 http_status=200 latency=0.000000000s ====== 2023-01-12T07:04:10.624808709Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 beast: 0x7fe47029e630: 10.131.0.1 - - [12/Jan/2023:07:04:10.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000000000s 2023-01-12T07:04:13.607976978Z debug 2023-01-12T07:04:13.606+0000 7fe432713700 0 could not get bucket info for bucket=:rgw-oc-bucket-f3ad456cce9243418cf423148d[e2f56d98-464b-446e-8061-d489bbf92468.24737.3]) r=-2 2023-01-12T07:04:13.607976978Z debug 2023-01-12T07:04:13.606+0000 7fe432713700 0 WARNING: sync_bucket() returned r=-2 2023-01-12T07:04:13.701403520Z debug 2023-01-12T07:04:13.700+0000 7fe45b765700 0 rgw UsageLogger: WARNING: RGWRados::log_usage(): user name empty (bucket=), skipping 2023-01-12T07:04:14.608330105Z debug 2023-01-12T07:04:14.607+0000 7fe3cde4a700 1 ====== starting new request req=0x7fe47029e600 ===== 2023-01-12T07:04:14.609928165Z debug 2023-01-12T07:04:14.608+0000 7fe3cde4a700 1 ====== req done req=0x7fe47029e600 op status=-2 http_status=204 latency=0.000999992s ====== 2023-01-12T07:04:14.609970519Z debug 2023-01-12T07:04:14.608+0000 7fe3cde4a700 1 beast: 0x7fe47029e600: 10.128.2.19 - noobaa-ceph-objectstore-user [12/Jan/2023:07:04:14.607 +0000] "DELETE /nb.1673505526029.apps.ocp-baremetal-auto.qe.rh-ocs.com/noobaa_blocks/63bfaaf69486b1002914c3b5/blocks_tree/other.blocks/test-delete-non-existing-key-1673507054600 HTTP/1.1" 204 0 - "aws-sdk-nodejs/2.1127.0 linux/v14.18.2 promise" - latency=0.000999992s 2023-01-12T07:04:20.624878849Z debug 2023-01-12T07:04:20.623+0000 7fe3c5639700 1 ====== starting new request req=0x7fe32ad03630 ===== 2023-01-12T07:04:20.625385209Z debug 2023-01-12T07:04:20.624+0000 7fe3c5639700 1 ====== req done req=0x7fe32ad03630 op status=0 http_status=200 latency=0.000999992s ====== 2023-01-12T07:04:20.625385209Z debug 2023-01-12T07:04:20.624+0000 7fe3c5639700 1 beast: 0x7fe32ad03630: 10.131.0.1 - - [12/Jan/2023:07:04:20.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000999992s … output emitted … Versions of all relevant components (if applicable): —————————————————————————————————————————————- ODF 4.12.0-122.stable Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? —————————————————————————————————————————————- This instability is consistent enough to interrupt IO on 1MB files. Is there any workaround available to the best of your knowledge? —————————————————————————————————————————————- Retry the failed IO operations. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? —————————————————————————————————————————————- 2 Is this issue reproducible? —————————————————————————————————————————————- Yes, but not consistently. Can this issue reproduce from the UI? —————————————————————————————————————————————- NA Steps to Reproduce: —————————————————————————————————————————————- 1. Deploy a baremetal compact-mode OCP 4.12 cluster with ODF 4.12 2. Create an RGW-based MCG NameSpace Store 3. Create a bucketclass and a bucket (OBC) on top of the NameSpace Store 4. Upload 10 1MB sized files (or larger) onto the bucket Actual results: —————————————————————————————————————————————- Some of the uploads fail with the following error: “An error occurred (InternalError) when calling the GetObject operation (reached max retries: 2): We encountered an internal error. Please try again.” Expected results: —————————————————————————————————————————————- All of the files are successfully uploaded onto the bucket without any errors. Additional info: —————————————————————————————————————————————- So far we’ve only seen these errors in our automatic regression testing, and we’ve yet to reproduce this bug manually.
Last Closed: 2023-07-14 12:31:26 UTC
Embargoed:


Attachments (Terms of Use)


Note You need to log in before you can comment on or make changes to this bug.