Description of problem (please be detailed as possible and provide log snippests): Customer has 5 worker nodes, but due to constraints can not dedicate ODF nodes. Due to the workload on the cluster customer states the following: We have an OCP/ODF cluster in which there are many vCPU requests A machineconfig update is unable to complete due to OCS/ODF rook-ceph-mgr pod being stuck Pending due to insufficient CPU. However, this pod really should have higher priority to prevent these sorts of issues; infrastructure, storage, these are cluster critical and should schedule first in case of vCPU contention. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? only workaround would be to remove some of the other apps to remove the vCPU contention and then the mgr pod will schedule Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
OCS-op is setting the priority class of mon/mgr/osd to "system-node-critical". Is there anything else, we can do to increase the priority? Moving back to OCS-Op since it is assigning the priority class and needinfo. Kevan, any recommendation here? Thanks.
(In reply to Sébastien Han from comment #4) > OCS-op is setting the priority class of mon/mgr/osd to > "system-node-critical". > Is there anything else, we can do to increase the priority? > > Moving back to OCS-Op since it is assigning the priority class and needinfo. > > > Kevan, any recommendation here? > Thanks. I did see the ocs pods had preemptionPolicy: PreemptLowerPriority But, I could not find where the priority class was being set in OCS-Op or storagecluster yaml. So I was not sure if the pods were set "system-node-critical" by default. I had the customer try the following in their test cluster, but now that I know the default, it wont make any difference. # oc patch StorageCluster ocs-storagecluster --type='merge' -p '{"spec":{"priorityClassName":{"mon":"system-node-critical","mgr":"system-node-critical","osd":"system-node-critical","mds":"system-node-critical"}}}' Need further assistance on this for cu
Now I remember I worked on the PR for the same here it is https://github.com/red-hat-storage/ocs-operator/pull/1173/files We are setting all the priority classes by default in the ceph resources via ocs-operator. We can not feed these values from the storageCluster at all. Pls, find the list of pods and priority classes we do set. Mgr: systemNodeCritical KeyMon: systemNodeCritical OSD: systemNodeCritical MDS: openshiftUserCritical RGW: openshiftUserCritical
Customer is on OCS Version is : 4.6.5 in the test cluster. Asked if he could upgrade to 4.8