Description of problem (please be detailed as possible and provide log snippests): The multus network address detection job does not derive placement from the CephCluster's "all" placement, only from "osd". This is a bug reported upstream here: https://github.com/rook/rook/issues/13138 This is also in the process of being fixed upstream here: https://github.com/rook/rook/pull/13206 Version of all relevant components (if applicable): ODF v4.14.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No, but it might be an upgrade issue for some existing multus supportex customers. Is there any workaround available to the best of your knowledge? A valid workaround is to have a user experiencing issues who is using the 'all' placement to manually specify cephcluster.spec.network.addressRanges for cluster/public networks. This will cause rook to skip its network address autodetection process. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 - somewhat complex since this requires multus AND CephCluster 'all' placement configs Can this issue reproducible? Yes. Can this issue reproduce from the UI? Not sure If this is a regression, please provide more details to justify this: I believe this is a regression. Customers who are currently using Multus and the 'all' placement spec might hit this issue. Not all users will hit the issue; that depends on if the spec allows the detection job to run on another node in the cluster that has the requisite host networks. Steps to Reproduce: Taint all nodes in the openshift cluster, and then only set the toleration for said taint to the "all" section of the CephCluster. For example, Use this taint... kubectl taint nodes --all node-role.kubernetes.io/storage=true:NoSchedule And this placement spec on CephCluster... placement: all: tolerations: - effect: NoSchedule key: node-role.kubernetes.io/storage operator: Equal value: "true" Actual results: rook-ceph-network-*-canary jobs will remain in pending with an error event like below: Warning FailedScheduling 12s default-scheduler 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/storage: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. Expected results: rook-ceph-network-*-canary jobs should be schedulable with 'all' placement settings.
I'd like to offer an alternate, and better workaround for 4.14.0: The StorageCluster has various placement options for different components. The StorageCluster defaults are safe. If users do not modify StorageCluster placement configs, nothing needs to be done. If the customer is using Multus and custom placement options are specified in the StorageCluster, then users need to consider the workaround here: Any placement configs in the 'all' section should be duplicated in the 'osd' section to prevent issues with multus network detection.
This is now fixed in the upstream code that will become 4.15. Moving to MODIFIED.
I have manually verified this issue have been fixed. I've been working on automation in ocs-ci to ensure we check that this doesn't regress but we have been hitting issues with our deployment environments. I wanted to wait until automation was in place before closing this but in the interest of time I am marking this as verified and will open a ticket on our own board to implement the changes to our automation for this.
Please provide the Doc text.
Done. Sorry for the delay.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383