Bug 2249678

Summary: the multus network address detection job does not derive placement configs from CephCluster "all" placement
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Blaine Gardner <brgardne>
Component: rookAssignee: Blaine Gardner <brgardne>
Status: CLOSED ERRATA QA Contact: Coady LaCroix <clacroix>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.14CC: ebenahar, kbg, mcaldeir, muagarwa, odf-bz-bot, tnielsen
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-103 Doc Type: Bug Fix
Doc Text:
.Incorrect pod placement configurations while detecting Multus Network Attachment Definition CIDRS Previously, some OpenShift Data Foundation clusters failed where the network "canary" pods were scheduled on nodes without Multus cluster networks, as OpenShift Data Foundation did not process pod placement configurations correctly while detecting Multus Network Attachment Definition CIDRS. With this fix, OpenShift Data Foundation was fixed to process pod placement for Multus network "canary" pods. As a result, network "canary" scheduling errors are no longer experienced.
Story Points: ---
Clone Of:
: 2249735 (view as bug list) Environment:
Last Closed: 2024-03-19 15:28:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2246375, 2249735    

Description Blaine Gardner 2023-11-14 19:54:11 UTC
Description of problem (please be detailed as possible and provide log
snippests):

The multus network address detection job does not derive placement from the CephCluster's "all" placement, only from "osd". This is a bug reported upstream here: https://github.com/rook/rook/issues/13138


This is also in the process of being fixed upstream here: https://github.com/rook/rook/pull/13206



Version of all relevant components (if applicable): ODF v4.14.0


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No, but it might be an upgrade issue for some existing multus supportex customers. 


Is there any workaround available to the best of your knowledge?

A valid workaround is to have a user experiencing issues who is using the 'all' placement to manually specify cephcluster.spec.network.addressRanges for cluster/public networks. This will cause rook to skip its network address autodetection process.



Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3 - somewhat complex since this requires multus AND CephCluster 'all' placement configs


Can this issue reproducible?

Yes. 


Can this issue reproduce from the UI?

Not sure


If this is a regression, please provide more details to justify this:

I believe this is a regression. Customers who are currently using Multus and the 'all' placement spec might hit this issue. Not all users will hit the issue; that depends on if the spec allows the detection job to run on another node in the cluster that has the requisite host networks.


Steps to Reproduce:

Taint all nodes in the openshift cluster, and then only set the toleration for said taint to the "all" section of the CephCluster.

For example, Use this taint...

kubectl taint nodes --all node-role.kubernetes.io/storage=true:NoSchedule


And this placement spec on CephCluster...

  placement:
    all:
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/storage
          operator: Equal
          value: "true"


Actual results:

rook-ceph-network-*-canary jobs will remain in pending with an error event like below:

Warning  FailedScheduling  12s   default-scheduler  0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/storage: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..



Expected results:

rook-ceph-network-*-canary jobs should be schedulable with 'all' placement settings.

Comment 3 Blaine Gardner 2023-11-15 16:36:58 UTC
I'd like to offer an alternate, and better workaround for 4.14.0:

The StorageCluster has various placement options for different components. The StorageCluster defaults are safe. If users do not modify StorageCluster placement configs, nothing needs to be done.

If the customer is using Multus and custom placement options are specified in the StorageCluster, then users need to consider the workaround here:
  Any placement configs in the 'all' section should be duplicated in the 'osd' section to prevent issues with multus network detection.

Comment 4 Blaine Gardner 2023-11-15 16:43:36 UTC
This is now fixed in the upstream code that will become 4.15. Moving to MODIFIED.

Comment 8 Coady LaCroix 2024-02-09 01:35:00 UTC
I have manually verified this issue have been fixed. I've been working on automation in ocs-ci to ensure we check that this doesn't regress but we have been hitting issues with our deployment environments. I wanted to wait until automation was in place before closing this but in the interest of time I am marking this as verified and will open a ticket on our own board to implement the changes to our automation for this.

Comment 10 Sunil Kumar Acharya 2024-03-06 12:43:23 UTC
Please provide the Doc text.

Comment 11 Blaine Gardner 2024-03-07 19:25:05 UTC
Done. Sorry for the delay.

Comment 12 errata-xmlrpc 2024-03-19 15:28:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383