Created attachment 2009631 [details] resource constraint to perform resource profile change Description of problem (please be detailed as possible and provide log snippests): Deploy ODF and storagecluster is set to ready state. use UI option to change the resource profile to performance mode. get error - "Aggregate resource requirements for the selected performance profile not met" (attached Screenshot1.png) added additional memory requirements (attached Screenshot2.png) node details shows the newly added changes are reflecting in the worker node detials. (Screenshot2.png) Now try to change the resource profile to performance, but still we get popup saying resource requirements are not enough. the new memory additions are not reflecting (Screenshot3.png) Version of all relevant components (if applicable): [root@nara7-aacc-bastion-0 ~]# oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-local-storage local-storage-operator.v4.14.0-202311031050 Local Storage 4.14.0-202311031050 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.0.1-snapshot Succeeded openshift-storage mcg-operator.v4.15.0-120.stable NooBaa Operator 4.15.0-120.stable mcg-operator.v4.14.3-rhodf Succeeded openshift-storage ocs-operator.v4.15.0-120.stable OpenShift Container Storage 4.15.0-120.stable ocs-operator.v4.14.3-rhodf Succeeded openshift-storage odf-csi-addons-operator.v4.15.0-120.stable CSI Addons 4.15.0-120.stable odf-csi-addons-operator.v4.14.3-rhodf Succeeded openshift-storage odf-operator.v4.15.0-120.stable OpenShift Data Foundation 4.15.0-120.stable odf-operator.v4.14.3-rhodf Succeeded [root@nara7-aacc-bastion-0 ~]# [root@nara7-aacc-bastion-0 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-rc.1 True False 3d6h Cluster version is 4.15.0-rc.1 [root@nara7-aacc-bastion-0 ~]# Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? NO Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy ODF and try to perform resource profile change with less than 96GB of aggregate. 2. increase the worker node memory and we can see the worker nodes memory is reflecting in worker node details. 3. perform again to change the resource profile to performance mode, we still see the memory changes are not reflecting and showing old values. Actual results: memory changes for worker nodes are not getting reflecting during resource profile changes Expected results: memory changes for worker nodes should reflect during resource profile changes and able to change resource profile to performance mode. Additional info:
thanks for sharing the YAMLs... can you please also run following under "Observe > Metrics" and share the output: 1. sum by (instance) (node_memory_MemTotal_bytes) 2. sum by (instance) (node_memory_MemAvailable_bytes) 3. sum by (instance) (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
also, what's the cluster infra (BareMetal, vSphere etc) ??
there are 3 ways to determine a node's capacity: 1. "status.allocatable.memory" field in the Node's CR >> Which represents the resources of a node that are available for scheduling. 2. "status.capacity.memory" field in the Node's CR >> Which represents the total resources of a node. 3. "node_memory_MemTotal_bytes" metric from the Prometheus. Typically option "2" and "3" should report similar values, and option "1" should be slightly lower than other two. OCP "Compute > Nodes" list page uses option "3", whereas in ODF (Performance profile modal and even during StorageSystem deployment) we use option "1". Checking the YAML shared above, Node's CR is reporting around 30.7GiB of allocatable capacity and around 31.8GiB of total capacity, whereas "node_" metric is reporting around 36.8GiB. Hence, the mismatch between what's seen/calculated on OCP and ODF. May I know the exact steps used for increasing resources on the nodes ?? Also, https://bugzilla.redhat.com/show_bug.cgi?id=2259616#c6 got missed, can you plz answer this as well ??
(In reply to Sanjal Katiyar from comment #8) > Typically option "2" and "3" should report similar values, and option "1" > should be slightly lower than other two. OCP "Compute > Nodes" list page > uses option "3", whereas in ODF (Performance profile modal and even during > StorageSystem deployment) we use option "1". > Checking the YAML shared above, Node's CR is reporting around 30.7GiB of > allocatable capacity and around 31.8GiB of total capacity, whereas "node_" > metric is reporting around 36.8GiB. Hence, the mismatch between what's > seen/calculated on OCP and ODF. Checked AWS/BareMetal clusters for testing, and both were reporting correct node capacities (CR & metric were reporting almost similar values). But, for some reason in the above "PowerVM" cluster, CR is reporting different value than the "node_" metric. Moving it to 4.16 for now, please raise it as a blocker if we are sure that this is a bug and something which needs to be fixed in 4.15.0 version itself.
as a fix/enhancement we will rely on metric for this nodes table as well, instead of CR (just like OCP list page and our Topology page)...
Moving the BZ to verified state based on comment 16.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591