Bug 2165395
| Summary: | RGW Errors on baremetal compact-mode OCP clusters: "Received failed: Disconnected", "brokers are down" | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sagi Hirshfeld <shirshfe> |
| Component: | Multi-Cloud Object Gateway | Assignee: | Nimrod Becker <nbecker> |
| Status: | CLOSED WORKSFORME | QA Contact: | krishnaram Karthick <kramdoss> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.12 | CC: | bniver, jthottan, kramdoss, mbenjamin, muagarwa, nbecker, ocs-bugs, odf-bz-bot, sostapov |
| Target Milestone: | --- | Keywords: | Automation, Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
Description of problem (please be detailed as possible and provide log
snippests):
—————————————————————————————————————————————-
RGW-based MCG NamespaceStores have shown inconsistent stability in the recent ODF-QE tests on baremetal compact-mode OCP clusters on NVME-intel.
In some cases, they've failed to reach the OPTIMAL state while waiting for 3 minutes during setup, and in others, they've failed while IO operations were made.
While some of the logs in the must-gather logs show the RGW-NSS in a healthy state, the "oc get events" logs of these cases all show that at some point, the RGW NSS has reached the Rejected or IO_ERRORS phases:
...output emitted...
19m Warning RejectedNamespaceStore bucketclass/oc-bucketclass-3f2cc10ea3eb497db48d4dad3 NooBaa NamespaceStore "rgw-ns-store-e7794ffc2c4640a68c918b43f19" is in rejected phase
...output emitted...
20m Warning NamespaceStorePhaseRejected namespacestore/rgw-ns-store-e7794ffc2c4640a68c918b43f19 Namespace store mode: IO_ERRORS
...output emitted...
Upon further investigation of the RGW logs, we found the following errors around the time our test failed:
… output emitted …
2023-01-12T07:04:00.624573970Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 ====== starting new request req=0x7fe326e1c630 =====
2023-01-12T07:04:00.624981105Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 ====== req done req=0x7fe326e1c630 op status=0 http_status=200 latency=0.000000000s ======
2023-01-12T07:04:00.625019756Z debug 2023-01-12T07:04:00.623+0000 7fe3f2693700 1 beast: 0x7fe326e1c630: 10.131.0.1 - - [12/Jan/2023:07:04:00.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000000000s
2023-01-12T07:04:07.458514503Z %3|1673507047.458|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Receive failed: Disconnected
2023-01-12T07:04:07.458514503Z %3|1673507047.458|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Receive failed: Disconnected
2023-01-12T07:04:07.466265915Z %3|1673507047.466|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Receive failed: Disconnected
2023-01-12T07:04:07.466265915Z %3|1673507047.466|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Receive failed: Disconnected
2023-01-12T07:04:07.466442247Z %3|1673507047.466|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Receive failed: Disconnected
2023-01-12T07:04:07.466442247Z %3|1673507047.466|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Receive failed: Disconnected
2023-01-12T07:04:07.479628654Z %3|1673507047.479|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Receive failed: Disconnected
2023-01-12T07:04:07.479628654Z %3|1673507047.479|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Receive failed: Disconnected
2023-01-12T07:04:07.479628654Z %3|1673507047.479|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: 4/4 brokers are down
2023-01-12T07:04:07.568715942Z %3|1673507047.568|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Failed to resolve 'my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known
2023-01-12T07:04:07.568715942Z %3|1673507047.568|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092/2: Failed to resolve 'my-cluster-kafka-2.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known
2023-01-12T07:04:07.569285598Z %3|1673507047.569|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092': Name or service not known
2023-01-12T07:04:07.569285598Z %3|1673507047.569|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/boo]: my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap.myproject.svc.cluster.local:9092': Name or service not known
2023-01-12T07:04:07.576259250Z %3|1673507047.576|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Failed to resolve 'my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known
2023-01-12T07:04:07.576259250Z %3|1673507047.576|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092/1: Failed to resolve 'my-cluster-kafka-1.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known
2023-01-12T07:04:07.588800431Z %3|1673507047.588|FAIL|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Failed to resolve 'my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known
2023-01-12T07:04:07.588800431Z %3|1673507047.588|ERROR|rdkafka#producer-1| [thrd:my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/]: my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092/0: Failed to resolve 'my-cluster-kafka-0.my-cluster-kafka-brokers.myproject.svc:9092': Name or service not known
2023-01-12T07:04:10.624492268Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 ====== starting new request req=0x7fe47029e630 =====
2023-01-12T07:04:10.624767406Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 ====== req done req=0x7fe47029e630 op status=0 http_status=200 latency=0.000000000s ======
2023-01-12T07:04:10.624808709Z debug 2023-01-12T07:04:10.623+0000 7fe3e7e7e700 1 beast: 0x7fe47029e630: 10.131.0.1 - - [12/Jan/2023:07:04:10.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000000000s
2023-01-12T07:04:13.607976978Z debug 2023-01-12T07:04:13.606+0000 7fe432713700 0 could not get bucket info for bucket=:rgw-oc-bucket-f3ad456cce9243418cf423148d[e2f56d98-464b-446e-8061-d489bbf92468.24737.3]) r=-2
2023-01-12T07:04:13.607976978Z debug 2023-01-12T07:04:13.606+0000 7fe432713700 0 WARNING: sync_bucket() returned r=-2
2023-01-12T07:04:13.701403520Z debug 2023-01-12T07:04:13.700+0000 7fe45b765700 0 rgw UsageLogger: WARNING: RGWRados::log_usage(): user name empty (bucket=), skipping
2023-01-12T07:04:14.608330105Z debug 2023-01-12T07:04:14.607+0000 7fe3cde4a700 1 ====== starting new request req=0x7fe47029e600 =====
2023-01-12T07:04:14.609928165Z debug 2023-01-12T07:04:14.608+0000 7fe3cde4a700 1 ====== req done req=0x7fe47029e600 op status=-2 http_status=204 latency=0.000999992s ======
2023-01-12T07:04:14.609970519Z debug 2023-01-12T07:04:14.608+0000 7fe3cde4a700 1 beast: 0x7fe47029e600: 10.128.2.19 - noobaa-ceph-objectstore-user [12/Jan/2023:07:04:14.607 +0000] "DELETE /nb.1673505526029.apps.ocp-baremetal-auto.qe.rh-ocs.com/noobaa_blocks/63bfaaf69486b1002914c3b5/blocks_tree/other.blocks/test-delete-non-existing-key-1673507054600 HTTP/1.1" 204 0 - "aws-sdk-nodejs/2.1127.0 linux/v14.18.2 promise" - latency=0.000999992s
2023-01-12T07:04:20.624878849Z debug 2023-01-12T07:04:20.623+0000 7fe3c5639700 1 ====== starting new request req=0x7fe32ad03630 =====
2023-01-12T07:04:20.625385209Z debug 2023-01-12T07:04:20.624+0000 7fe3c5639700 1 ====== req done req=0x7fe32ad03630 op status=0 http_status=200 latency=0.000999992s ======
2023-01-12T07:04:20.625385209Z debug 2023-01-12T07:04:20.624+0000 7fe3c5639700 1 beast: 0x7fe32ad03630: 10.131.0.1 - - [12/Jan/2023:07:04:20.623 +0000] "GET /swift/healthcheck HTTP/1.1" 200 0 - "kube-probe/1.25" - latency=0.000999992s
… output emitted …
Versions of all relevant components (if applicable):
—————————————————————————————————————————————-
ODF 4.12.0-122.stable
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
—————————————————————————————————————————————-
This instability is consistent enough to interrupt IO on 1MB files.
Is there any workaround available to the best of your knowledge?
—————————————————————————————————————————————-
Retry the failed IO operations.
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
—————————————————————————————————————————————-
2
Is this issue reproducible?
—————————————————————————————————————————————-
Yes, but not consistently.
Can this issue reproduce from the UI?
—————————————————————————————————————————————-
NA
Steps to Reproduce:
—————————————————————————————————————————————-
1. Deploy a baremetal compact-mode OCP 4.12 cluster with ODF 4.12
2. Create an RGW-based MCG NameSpace Store
3. Create a bucketclass and a bucket (OBC) on top of the NameSpace Store
4. Upload 10 1MB sized files (or larger) onto the bucket
Actual results:
—————————————————————————————————————————————-
Some of the uploads fail with the following error:
“An error occurred (InternalError) when calling the GetObject operation (reached max retries: 2): We encountered an internal error. Please try again.”
Expected results:
—————————————————————————————————————————————-
All of the files are successfully uploaded onto the bucket without any errors.
Additional info:
—————————————————————————————————————————————-
So far we’ve only seen these errors in our automatic regression testing, and we’ve yet to reproduce this bug manually.
|
|
| Last Closed: | 2023-07-14 12:31:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||