Bug 2249678 - the multus network address detection job does not derive placement configs from CephCluster "all" placement
Summary: the multus network address detection job does not derive placement configs fr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ODF 4.15.0
Assignee: Blaine Gardner
QA Contact: Coady LaCroix
URL:
Whiteboard:
Depends On:
Blocks: 2246375 2249735
TreeView+ depends on / blocked
 
Reported: 2023-11-14 19:54 UTC by Blaine Gardner
Modified: 2024-03-19 15:29 UTC (History)
6 users (show)

Fixed In Version: 4.15.0-103
Doc Type: Bug Fix
Doc Text:
.Incorrect pod placement configurations while detecting Multus Network Attachment Definition CIDRS Previously, some OpenShift Data Foundation clusters failed where the network "canary" pods were scheduled on nodes without Multus cluster networks, as OpenShift Data Foundation did not process pod placement configurations correctly while detecting Multus Network Attachment Definition CIDRS. With this fix, OpenShift Data Foundation was fixed to process pod placement for Multus network "canary" pods. As a result, network "canary" scheduling errors are no longer experienced.
Clone Of:
: 2249735 (view as bug list)
Environment:
Last Closed: 2024-03-19 15:28:51 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook issues 13138 0 None open rook-ceph-network-cluster-canary pod does not get toleration settings from the cephcluster spec. 2023-11-14 19:54:10 UTC
Github rook rook pull 13206 0 None open multus: fix placement error for net addr detect job 2023-11-14 19:54:10 UTC
Red Hat Knowledge Base (Solution) 7044764 0 None None None 2023-11-20 02:45:41 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:29:03 UTC

Description Blaine Gardner 2023-11-14 19:54:11 UTC
Description of problem (please be detailed as possible and provide log
snippests):

The multus network address detection job does not derive placement from the CephCluster's "all" placement, only from "osd". This is a bug reported upstream here: https://github.com/rook/rook/issues/13138


This is also in the process of being fixed upstream here: https://github.com/rook/rook/pull/13206



Version of all relevant components (if applicable): ODF v4.14.0


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No, but it might be an upgrade issue for some existing multus supportex customers. 


Is there any workaround available to the best of your knowledge?

A valid workaround is to have a user experiencing issues who is using the 'all' placement to manually specify cephcluster.spec.network.addressRanges for cluster/public networks. This will cause rook to skip its network address autodetection process.



Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3 - somewhat complex since this requires multus AND CephCluster 'all' placement configs


Can this issue reproducible?

Yes. 


Can this issue reproduce from the UI?

Not sure


If this is a regression, please provide more details to justify this:

I believe this is a regression. Customers who are currently using Multus and the 'all' placement spec might hit this issue. Not all users will hit the issue; that depends on if the spec allows the detection job to run on another node in the cluster that has the requisite host networks.


Steps to Reproduce:

Taint all nodes in the openshift cluster, and then only set the toleration for said taint to the "all" section of the CephCluster.

For example, Use this taint...

kubectl taint nodes --all node-role.kubernetes.io/storage=true:NoSchedule


And this placement spec on CephCluster...

  placement:
    all:
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/storage
          operator: Equal
          value: "true"


Actual results:

rook-ceph-network-*-canary jobs will remain in pending with an error event like below:

Warning  FailedScheduling  12s   default-scheduler  0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/storage: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..



Expected results:

rook-ceph-network-*-canary jobs should be schedulable with 'all' placement settings.

Comment 3 Blaine Gardner 2023-11-15 16:36:58 UTC
I'd like to offer an alternate, and better workaround for 4.14.0:

The StorageCluster has various placement options for different components. The StorageCluster defaults are safe. If users do not modify StorageCluster placement configs, nothing needs to be done.

If the customer is using Multus and custom placement options are specified in the StorageCluster, then users need to consider the workaround here:
  Any placement configs in the 'all' section should be duplicated in the 'osd' section to prevent issues with multus network detection.

Comment 4 Blaine Gardner 2023-11-15 16:43:36 UTC
This is now fixed in the upstream code that will become 4.15. Moving to MODIFIED.

Comment 8 Coady LaCroix 2024-02-09 01:35:00 UTC
I have manually verified this issue have been fixed. I've been working on automation in ocs-ci to ensure we check that this doesn't regress but we have been hitting issues with our deployment environments. I wanted to wait until automation was in place before closing this but in the interest of time I am marking this as verified and will open a ticket on our own board to implement the changes to our automation for this.

Comment 10 Sunil Kumar Acharya 2024-03-06 12:43:23 UTC
Please provide the Doc text.

Comment 11 Blaine Gardner 2024-03-07 19:25:05 UTC
Done. Sorry for the delay.

Comment 12 errata-xmlrpc 2024-03-19 15:28:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.