Bug 2232240 - Poor Performance Persists After Tuning MCG [NEEDINFO]
Summary: Poor Performance Persists After Tuning MCG
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Nimrod Becker
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-15 21:31 UTC by Craig Wayman
Modified: 2023-08-16 13:48 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
crwayman: needinfo? (nbecker)
crwayman: needinfo? (nbecker)


Attachments (Terms of Use)

Description Craig Wayman 2023-08-15 21:31:07 UTC
Created attachment 1983453 [details]
noobaa-diagnostics

Description of problem (please be detailed as possible and provide log
snippets):

  The case started with the customer dealing with poor NooBaa/MCG performance. The MCG tuning guide was provided to the customer. The customer tuned the Endpoint, Core, and DB... and still, the customer stated that there were issues with MCG performance. The current use case for NooBaa and observations is as followed:

1. The customer's backingstore is s3-compatible and pointing to RGW.

2. They are using the buckets provided by the ODF stack as a backend of Thanos.

3. Thanos is also a datasource of Grafana.

4. There are times when a Grafana dashboard takes 3+ minutes to load.


  For context, the customer provided errors derived from the NooBaa core pod (posted below). It was a bit odd that we were seeing the awkward date/time. Not sure if that's an issue to be concerned with, however, the particular snippet is as followed: 

"node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time))"

  We inspected the backend storage (Ceph) which was nearfull ~76%. Additionally, the RGW bucket data was getting high too ~85% which is where all of their buckets are. The customer deleted data and freed up space, and the space is ~50% at the moment, however, they're still seeing the same errors. 

  Since storage seemed to be freed up, to address the NooBaa performance issue again, we had them delete the NooBaa pods (Restore MCG from Troubleshooting Doc) and wait for them to come back up but they are still getting the same error with updated timestamp. 

  The last item of inspection was the actual VMWare datastore itself. Although it showed 600GB of free space, in retrospect, it's actually 92% full (see capacities below): 
 
IDS_EQ_CENTRAL datastore: <------------- Datastore
Storage Free: 642.07 GB
Used: 7.23 TB 
Capacity: 7.86 TB

  I am unsure if the VMWare datastore being 92% full will affect anything. Upon further research from what I understand, VMWare let's you take the capacity on the datastore pretty high, to almost full. I will post some logs below: 

NOOBAA-CORE LOGS:
2023-08-11T11:10:01.738737570Z Aug-11 11:10:01.737 [WebServer/42]    [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not readable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds]
2023-08-11T11:10:01.738737570Z Aug-11 11:10:01.737 [WebServer/42]    [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not writable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds]
2023-08-11T11:10:32.206999911Z Aug-11 11:10:32.205 [WebServer/42]    [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 item has issues. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time))
2023-08-11T11:10:32.206999911Z Aug-11 11:10:32.205 [WebServer/42]    [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not readable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time))
2023-08-11T11:10:32.206999911Z Aug-11 11:10:32.206 [WebServer/42]    [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not writable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time))
2023-08-11T11:10:32.210610455Z Aug-11 11:10:32.209 [WebServer/42]    [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 item has issues. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds]
2023-08-11T11:10:32.210610455Z Aug-11 11:10:32.210 [WebServer/42]    [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not readable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds]
2023-08-11T11:10:32.210610455Z Aug-11 11:10:32.210 [WebServer/42]    [L0] core.server.node_services.nodes_monitor:: noobaa-internal-agent-607e9947d19dcf00239536c3 not writable. reasons: node offline, storage is full (Thu Jan 01 1970 00:00:00 GMT+0000 (Coordinated Universal Time)) - [Duplicated message. Suppressing for 30 seconds]


  Lastly, if running $ oc get ob  or $ oc get obc, there will not be any buckets listed. This is because the noobaa-default backingstore is currently pointed at RGW. Usually, when I see there are no OBCs, it's usually a green light to rebuild NooBaa. However, although the buckets are in RGW... It's unknown if rebuilding NooBaa will cause a scenario where the buckets can no longer be accessed (e.g. account in noobaa-db no longer exists as it was re-created), so we decided not to follow through.



Version of all relevant components (if applicable):

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.5    True        False         178d    Cluster version is 4.11.5


$ oc get csv -n openshift-storage

NAME                              DISPLAY                       VERSION   REPLACES                           PHASE
mcg-operator.v4.11.8              NooBaa Operator               4.11.8    mcg-operator.v4.10.10              Succeeded
ocs-operator.v4.11.8              OpenShift Container Storage   4.11.8    ocs-operator.v4.10.10              Succeeded
odf-csi-addons-operator.v4.11.8   CSI Addons                    4.11.8    odf-csi-addons-operator.v4.10.10   Succeeded
odf-operator.v4.11.8              OpenShift Data Foundation     4.11.8    odf-operator.v4.11.5               Succeeded


$ ceph versions

{
    "mon": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 6
    },
    "mds": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)": 13
    }
}




Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

  Yes, it has been frustrating for the customer as there are extended loading times and overall poor performance.

Is there any workaround available to the best of your knowledge?

  We attempted an MCG rebuild in a test environment with an s3-compaible backingstore pointed at RGW however, oddly enough, all NooBaa resources came back, but the noobaa-default-backingstore never came back and remained gone as the targetbucket was never re-created.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4


Can this issue reproducible?

Tried to in a test environment, but couldn't


Note You need to log in before you can comment on or make changes to this bug.