2028984 – [GSS][RFE] Option for ODF pods to be system-node-critical

Bug 2028984 - [GSS][RFE] Option for ODF pods to be system-node-critical

Summary: [GSS][RFE] Option for ODF pods to be system-node-critical

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Jose A. Rivera
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-03 22:08 UTC by khover
Modified:	2023-12-08 04:26 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-30 10:53:24 UTC
Embargoed:

Attachments	(Terms of Use)

Description khover 2021-12-03 22:08:20 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Customer has 5 worker nodes, but due to constraints can not dedicate ODF nodes.
Due to the workload on the cluster customer states the following:

We have an OCP/ODF cluster in which there are many vCPU requests

A machineconfig update is unable to complete due to OCS/ODF rook-ceph-mgr pod being stuck Pending due to insufficient CPU. However, this pod really should have higher priority to prevent these sorts of issues; infrastructure, storage, these are cluster critical and should schedule first in case of vCPU contention.


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?



Is there any workaround available to the best of your knowledge?

only workaround would be to remove some of the other apps to remove the vCPU contention and then the mgr pod will schedule

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 Sébastien Han 2021-12-06 09:24:19 UTC

OCS-op is setting the priority class of mon/mgr/osd to "system-node-critical".
Is there anything else, we can do to increase the priority?

Moving back to OCS-Op since it is assigning the priority class and needinfo.


Kevan, any recommendation here?
Thanks.

Comment 5 khover 2021-12-06 12:04:00 UTC

(In reply to Sébastien Han from comment #4)
> OCS-op is setting the priority class of mon/mgr/osd to
> "system-node-critical".
> Is there anything else, we can do to increase the priority?
> 
> Moving back to OCS-Op since it is assigning the priority class and needinfo.
> 
> 
> Kevan, any recommendation here?
> Thanks.

I did see the ocs pods had preemptionPolicy: PreemptLowerPriority

But, I could not find where the priority class was being set in OCS-Op or storagecluster yaml.

So I was not sure if the pods were set "system-node-critical" by default.

I had the customer try the following in their test cluster, but now that I know the default, it wont make any difference.

# oc patch StorageCluster ocs-storagecluster --type='merge' -p '{"spec":{"priorityClassName":{"mon":"system-node-critical","mgr":"system-node-critical","osd":"system-node-critical","mds":"system-node-critical"}}}'



Need further assistance on this for cu

Comment 6 Nitin Goyal 2021-12-06 12:58:40 UTC

Now I remember I worked on the PR for the same here it is https://github.com/red-hat-storage/ocs-operator/pull/1173/files

We are setting all the priority classes by default in the ceph resources via ocs-operator. We can not feed these values from the storageCluster at all.

Pls, find the list of pods and priority classes we do set.

Mgr: systemNodeCritical
KeyMon: systemNodeCritical
OSD: systemNodeCritical
MDS: openshiftUserCritical
RGW: openshiftUserCritical

Comment 7 khover 2021-12-06 17:22:24 UTC

Customer is on OCS Version is : 4.6.5 in the test cluster.

Asked if he could upgrade to 4.8

Comment 9 Red Hat Bugzilla 2023-12-08 04:26:55 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.