Created attachment 1983453 [details] noobaa-diagnostics Description of problem (please be detailed as possible and provide log snippets): The case started with the customer dealing with poor NooBaa/MCG performance. The MCG tuning guide was provided to the customer. The customer tuned the Endpoint, Core, and DB... and still, the customer stated that there were issues with MCG performance. The current use case for NooBaa and observations is as followed: 1. The customer's backingstore is s3-compatible and pointing to RGW. 2. They are using the buckets provided by the ODF stack as a backend of Thanos. 3. Thanos is also a datasource of Grafana. 4. There are times when a Grafana dashboard takes 3+ minutes to load. For context, the customer provided errors derived from the NooBaa core pod (posted below). It was a bit odd that we were seeing the awkward date/time. Not sure if that's an issue to be concerned with, however, the particular snippet is as followed: "node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time))" We inspected the backend storage (Ceph) which was nearfull ~76%. Additionally, the RGW bucket data was getting high too ~85% which is where all of their buckets are. The customer deleted data and freed up space, and the space is ~50% at the moment, however, they're still seeing the same errors. Since storage seemed to be freed up, to address the NooBaa performance issue again, we had them delete the NooBaa pods (Restore MCG from Troubleshooting Doc) and wait for them to come back up but they are still getting the same error with updated timestamp. The last item of inspection was the actual VMWare datastore itself. Although it showed 600GB of free space, in retrospect, it's actually 92% full (see capacities below): IDS_EQ_CENTRAL datastore: <------------- Datastore Storage Free: 642.07 GB Used: 7.23 TB Capacity: 7.86 TB I am unsure if the VMWare datastore being 92% full will affect anything. Upon further research from what I understand, VMWare let's you take the capacity on the datastore pretty high, to almost full. I will post some logs below: NOOBAA-CORE LOGS: 2023-08-11T11:10:01.738737570Z Aug-11 11:10:01.737 [WebServer/42] [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not readable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds] 2023-08-11T11:10:01.738737570Z Aug-11 11:10:01.737 [WebServer/42] [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not writable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds] 2023-08-11T11:10:32.206999911Z Aug-11 11:10:32.205 [WebServer/42] [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 item has issues. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) 2023-08-11T11:10:32.206999911Z Aug-11 11:10:32.205 [WebServer/42] [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not readable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) 2023-08-11T11:10:32.206999911Z Aug-11 11:10:32.206 [WebServer/42] [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not writable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) 2023-08-11T11:10:32.210610455Z Aug-11 11:10:32.209 [WebServer/42] [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 item has issues. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds] 2023-08-11T11:10:32.210610455Z Aug-11 11:10:32.210 [WebServer/42] [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not readable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds] 2023-08-11T11:10:32.210610455Z Aug-11 11:10:32.210 [WebServer/42] [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not writable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds] Lastly, if running $ oc get ob or $ oc get obc, there will not be any buckets listed. This is because the noobaa-default backingstore is currently pointed at RGW. Usually, when I see there are no OBCs, it's usually a green light to rebuild NooBaa. However, although the buckets are in RGW... It's unknown if rebuilding NooBaa will cause a scenario where the buckets can no longer be accessed (e.g. account in noobaa-db no longer exists as it was re-created), so we decided not to follow through. Version of all relevant components (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 178d Cluster version is 4.11.5 $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.8 NooBaa Operator 4.11.8 mcg-operator.v4.10.10 Succeeded ocs-operator.v4.11.8 OpenShift Container Storage 4.11.8 ocs-operator.v4.10.10 Succeeded odf-csi-addons-operator.v4.11.8 CSI Addons 4.11.8 odf-csi-addons-operator.v4.10.10 Succeeded odf-operator.v4.11.8 OpenShift Data Foundation 4.11.8 odf-operator.v4.11.5 Succeeded $ ceph versions { "mon": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1 }, "osd": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 6 }, "mds": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1 }, "overall": { "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 13 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, it has been frustrating for the customer as there are extended loading times and overall poor performance. Is there any workaround available to the best of your knowledge? We attempted an MCG rebuild in a test environment with an s3-compaible backingstore pointed at RGW however, oddly enough, all NooBaa resources came back, but the noobaa-default-backingstore never came back and remained gone as the targetbucket was never re-created. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? Tried to in a test environment, but couldn't