2249678 – the multus network address detection job does not derive placement configs from CephCluster "all" placement

Bug 2249678 - the multus network address detection job does not derive placement configs from CephCluster "all" placement

Summary: the multus network address detection job does not derive placement configs fr...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Blaine Gardner
QA Contact:	Coady LaCroix
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2246375 2249735
TreeView+	depends on / blocked

Reported:	2023-11-14 19:54 UTC by Blaine Gardner
Modified:	2024-03-19 15:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.15.0-103
Doc Type:	Bug Fix
Doc Text:	.Incorrect pod placement configurations while detecting Multus Network Attachment Definition CIDRS Previously, some OpenShift Data Foundation clusters failed where the network "canary" pods were scheduled on nodes without Multus cluster networks, as OpenShift Data Foundation did not process pod placement configurations correctly while detecting Multus Network Attachment Definition CIDRS. With this fix, OpenShift Data Foundation was fixed to process pod placement for Multus network "canary" pods. As a result, network "canary" scheduling errors are no longer experienced.
Clone Of:
Clones:	2249735 (view as bug list)
Environment:
Last Closed:	2024-03-19 15:28:51 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rook rook issues 13138	None	open	rook-ceph-network-cluster-canary pod does not get toleration settings from the cephcluster spec.	2023-11-14 19:54:10 UTC
Github	rook rook pull 13206	None	open	multus: fix placement error for net addr detect job	2023-11-14 19:54:10 UTC
Red Hat Knowledge Base (Solution)	7044764	None	None	None	2023-11-20 02:45:41 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:29:03 UTC

Description Blaine Gardner 2023-11-14 19:54:11 UTC

Description of problem (please be detailed as possible and provide log
snippests):

The multus network address detection job does not derive placement from the CephCluster's "all" placement, only from "osd". This is a bug reported upstream here: https://github.com/rook/rook/issues/13138

This is also in the process of being fixed upstream here: https://github.com/rook/rook/pull/13206

Version of all relevant components (if applicable): ODF v4.14.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No, but it might be an upgrade issue for some existing multus supportex customers.

Is there any workaround available to the best of your knowledge?

A valid workaround is to have a user experiencing issues who is using the 'all' placement to manually specify cephcluster.spec.network.addressRanges for cluster/public networks. This will cause rook to skip its network address autodetection process.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3 - somewhat complex since this requires multus AND CephCluster 'all' placement configs

Can this issue reproducible?

Yes.

Can this issue reproduce from the UI?

Not sure

If this is a regression, please provide more details to justify this:

I believe this is a regression. Customers who are currently using Multus and the 'all' placement spec might hit this issue. Not all users will hit the issue; that depends on if the spec allows the detection job to run on another node in the cluster that has the requisite host networks.

Steps to Reproduce:

Taint all nodes in the openshift cluster, and then only set the toleration for said taint to the "all" section of the CephCluster.

For example, Use this taint...

kubectl taint nodes --all node-role.kubernetes.io/storage=true:NoSchedule

And this placement spec on CephCluster...

placement:
all:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/storage
operator: Equal
value: "true"

Actual results:

rook-ceph-network-*-canary jobs will remain in pending with an error event like below:

Warning FailedScheduling 12s default-scheduler 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/storage: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..

Expected results:

rook-ceph-network-*-canary jobs should be schedulable with 'all' placement settings.

Comment 3 Blaine Gardner 2023-11-15 16:36:58 UTC

I'd like to offer an alternate, and better workaround for 4.14.0:

The StorageCluster has various placement options for different components. The StorageCluster defaults are safe. If users do not modify StorageCluster placement configs, nothing needs to be done.

If the customer is using Multus and custom placement options are specified in the StorageCluster, then users need to consider the workaround here:
  Any placement configs in the 'all' section should be duplicated in the 'osd' section to prevent issues with multus network detection.

Comment 4 Blaine Gardner 2023-11-15 16:43:36 UTC

This is now fixed in the upstream code that will become 4.15. Moving to MODIFIED.

Comment 8 Coady LaCroix 2024-02-09 01:35:00 UTC

I have manually verified this issue have been fixed. I've been working on automation in ocs-ci to ensure we check that this doesn't regress but we have been hitting issues with our deployment environments. I wanted to wait until automation was in place before closing this but in the interest of time I am marking this as verified and will open a ticket on our own board to implement the changes to our automation for this.

Comment 10 Sunil Kumar Acharya 2024-03-06 12:43:23 UTC

Please provide the Doc text.

Comment 11 Blaine Gardner 2024-03-07 19:25:05 UTC

Done. Sorry for the delay.

Comment 12 errata-xmlrpc 2024-03-19 15:28:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.