Bug 2028984

Summary: [GSS][RFE] Option for ODF pods to be system-node-critical
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: khover
Component: ocs-operatorAssignee: Jose A. Rivera <jrivera>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: high    
Version: 4.7CC: hnallurv, jrivera, madam, mmuench, muagarwa, nigoyal, ocs-bugs, odf-bz-bot, rcyriac, shan, sostapov, tnielsen
Target Milestone: ---Keywords: FutureFeature
Target Release: ---Flags: khover: needinfo? (jrivera)
muagarwa: needinfo? (khover)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-30 10:53:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description khover 2021-12-03 22:08:20 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Customer has 5 worker nodes, but due to constraints can not dedicate ODF nodes.
Due to the workload on the cluster customer states the following:

We have an OCP/ODF cluster in which there are many vCPU requests

A machineconfig update is unable to complete due to OCS/ODF rook-ceph-mgr pod being stuck Pending due to insufficient CPU. However, this pod really should have higher priority to prevent these sorts of issues; infrastructure, storage, these are cluster critical and should schedule first in case of vCPU contention.


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?



Is there any workaround available to the best of your knowledge?

only workaround would be to remove some of the other apps to remove the vCPU contention and then the mgr pod will schedule

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 Sébastien Han 2021-12-06 09:24:19 UTC
OCS-op is setting the priority class of mon/mgr/osd to "system-node-critical".
Is there anything else, we can do to increase the priority?

Moving back to OCS-Op since it is assigning the priority class and needinfo.


Kevan, any recommendation here?
Thanks.

Comment 5 khover 2021-12-06 12:04:00 UTC
(In reply to Sébastien Han from comment #4)
> OCS-op is setting the priority class of mon/mgr/osd to
> "system-node-critical".
> Is there anything else, we can do to increase the priority?
> 
> Moving back to OCS-Op since it is assigning the priority class and needinfo.
> 
> 
> Kevan, any recommendation here?
> Thanks.

I did see the ocs pods had preemptionPolicy: PreemptLowerPriority

But, I could not find where the priority class was being set in OCS-Op or storagecluster yaml.

So I was not sure if the pods were set "system-node-critical" by default.

I had the customer try the following in their test cluster, but now that I know the default, it wont make any difference.

# oc patch StorageCluster ocs-storagecluster --type='merge' -p '{"spec":{"priorityClassName":{"mon":"system-node-critical","mgr":"system-node-critical","osd":"system-node-critical","mds":"system-node-critical"}}}'



Need further assistance on this for cu

Comment 6 Nitin Goyal 2021-12-06 12:58:40 UTC
Now I remember I worked on the PR for the same here it is https://github.com/red-hat-storage/ocs-operator/pull/1173/files

We are setting all the priority classes by default in the ceph resources via ocs-operator. We can not feed these values from the storageCluster at all.

Pls, find the list of pods and priority classes we do set.

Mgr: systemNodeCritical
KeyMon: systemNodeCritical
OSD: systemNodeCritical
MDS: openshiftUserCritical
RGW: openshiftUserCritical

Comment 7 khover 2021-12-06 17:22:24 UTC
Customer is on OCS Version is : 4.6.5 in the test cluster.

Asked if he could upgrade to 4.8