2273039 – Storage nodes run out of capacity for ceph osds when a node-role with the `infra` is applied

Bug 2273039 - Storage nodes run out of capacity for ceph osds when a node-role with the `infra` is applied

Summary: Storage nodes run out of capacity for ceph osds when a node-role with the `in...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	documentation
Sub Component:
Version:	4.12
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Anjana Suparna Sriram
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-04-03 19:12 UTC by Sam Yangsao
Modified:	2024-04-03 19:23 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description Sam Yangsao 2024-04-03 19:12:54 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Storage nodes run out of capacity for ceph osds when a node role with the `infra` is applied causing infra related components to run on these nodes which take up additional CPU/memory resources

Version of all relevant components (if applicable):

4.12+

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Customer applies the `infra` node role to the storage worker nodes following best practices so these nodes do not count towards their OCP subscription/entitlement per [1]

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.15/html-single/managing_and_allocating_storage_resources/index#how-to-use-dedicated-worker-nodes-for-openshift-data-foundation_rhodf 

Is there any workaround available to the best of your knowledge?

None

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

2

Can this issue reproducible?

Always

Can this issue reproduce from the UI?

Unsure

If this is a regression, please provide more details to justify this:

Steps to Reproduce:

1.  Configure ODF and a Storage system and projects using the Ceph luns
2.  Setup node labels to use the `infra` role on the storage nodes
3.  Setup the default router pods to run only on `infra` nodes, migrate the router pod workloads to the storage nodes
4.  Bump up the traffic going to the router pods which take up more CPU/memory from other pods that are running the ODF storage components, Ceph OSDs for example, causing instability to ODF

Actual results:

Ceph becomes unstable, dropping OSDs since it does not have enough CPU/memory to 

Expected results:

Include instructions on how to not run any additional `infra` workloads when this node-role is applied to storage nodes.

Additional info:

Quick workaround would be to remove the `infra` node-role on these storage nodes so any `infra` related workload will not run on the storage nodes.  Would this be a possible supported and documented option other than having the customer set up taints/tolerations across all `infra` related components?

Note You need to log in before you can comment on or make changes to this bug.