Description of problem: When installing ODF v4.15 now we option to specify "Lean mode" , or "Balanced Mode" , or "Performance Mode" which will depending on what user specify allocate different values for CPU/Memory ( particularity OSDs ) when ODF is installed. This approach is OK, but it has below noticeable issue - At this stage we do not know how many CPU is present on OCP node and if user specify "Performance Mode" what allocates 4 CPU / OSD and if system has many OSDs where numOSD x 4 > TotalCPU_Available - then many OSDs will fail to start / ODF installation will fail. Version-Release number of selected component (if applicable): OCP v4.15 oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.15.0-130.stable NooBaa Operator 4.15.0-130.stable Succeeded ocs-operator.v4.15.0-130.stable OpenShift Container Storage 4.15.0-130.stable Succeeded odf-csi-addons-operator.v4.15.0-130.stable CSI Addons 4.15.0-130.stable Succeeded odf-operator.v4.15.0-130.stable OpenShift Data Foundation 4.15.0-130.stable Succeeded How reproducible: always Steps to Reproduce: easy to reproduce if "Performance Profile" cpu requests exceed available CPU resources on node - 1. Try to install ODF v4.15 2. select Performance profile 3. ODF installation will fail Actual results: ODF installation fails Expected results: ODF Installation to work Additional info: NA
We had a lot of discussion about this situation during the design phase of the feature itself. We did think of adding some validation on the UI before choosing a profile. The validation was intended to calculate & see if a selected profile can be applied successfully on a cluster by its available resources. But after a long discussion & digging, we found it's impossible to predict whether a pod can schedule before actually doing it. K8s scheduling is a hard computational problem that we can't solve. We decided we would do 2 things. 1. Basic validation on UI by just checking the basic total resources available on the cluster and if that meets a certain basic number we have decided. 2. Clear messaging in the doc mentioning that if you don't have sufficient resources choosing a higher resource profile will result in failure, so please choose accordingly.
This warning if implemented would be on the console/UI. Moving there and assigning to Alfonso as he has the context.
Verified with OCP 4.16.0-0.nightly-2024-05-23-173505 and ODF 4.16.0-113.stable provided by Red Hat Steps performed upon verification: 1. Installed ODF on ibm cloud machine with instance type bx2-16x64, Thus having total resources of 48 CPUs and 188.7 GiB on 3 zone 2. Increased number of OSD by add capacity option from existing 3 to 12 3. Upon adding 12 osds, the total requested resources by performance profile was greater than available resources hence proper error message was displayed and mode selection option was greyed out, Refer 12osd_resource_req_not met.png 4. Also the CPUs and Memory required values for each performance profile kept on incraesing as #of osds were increased post 6 osds, refer 6_osd.png and 9_osd.png 5. Tried applying 'balanced Mode" performance profile, which was successful.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591