Description of problem: After installing the performance-addon-operator CSV in an OCP cluster, a PerformanceProfile was applied. After waiting some time, several nodes remained in SchedulingDisabled status, and the current MachineConfig applied did not transition towards the new, desired MachineConfig. Moreover, some ovnkube-node pods were failing due to the absence of br-ex link, which was not found. We use Ansible community.kubernetes.k8s module to automate the creation of the resources, and running twice or more causes this behavior. This is only happening with OCP 4.9.38. With the previous version, OCP 4.9.37, we are not experiencing this issue. In fact, the issue we are having is similar to the one reported in BZ 2077900 (which duplicates 2078866), but these BZ are happening on OCP 4.11, and this one in OCP 4.9.38. Version-Release number of selected component (if applicable): OCP 4.9.38 How reproducible: 100% so far with the tests we've done in our labs. Steps to Reproduce: 1. Deploy OCP 4.9.38 in a cluster composed by 3 master nodes and 4 worker nodes, using IPI installation and Ansible playbooks from baremetal-deployment. 2. Install performance-addon-operator 3. Create the following PerformanceProfile resource: --- kind: PerformanceProfile apiVersion: "performance.openshift.io/v2" metadata: name: cnf-basic-profile namespace: openshift-performance-addon-operator spec: additionalKernelArgs: - "nmi_watchdog=0" - "audit=0" - "mce=off" - "processor.max_cstate=1" - "idle=poll" - "intel_idle.max_cstate=0" cpu: isolated: "2-19,22-39,42-59,62-79" reserved: "0,1,40,41,20,21,60,61" hugepages: pages: - size: "1G" count: 32 node: 0 - size: "1G" count: 32 node: 1 - size: "2M" count: 12000 node: 0 - size: "2M" count: 12000 node: 1 numa: topologyPolicy: "single-numa-node" nodeSelector: node-role.kubernetes.io/worker: "" machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/worker: "" ... 4. Wait some time to check if the PerformanceProfile is applied correctly. Actual results: This report is based in the following OCP installation done with Distributed CI (DCI): https://www.distributed-ci.io/jobs/2f2dd76f-21f9-4a5a-9f77-c188f03b591c/jobStates. After deploying OCP, performance-addon-operator and the PerformanceProfile commented before, and waiting some time, if we check the MCP status, we will see that all the MCP created are not in Ready status. If going to the node status, some of them are in SchedulingDisabled status, and even in NotReady status: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master-0 Ready master 108m v1.22.8+f34b40c 192.168.12.11 <none> Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 master-1 Ready master 108m v1.22.8+f34b40c 192.168.12.12 <none> Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 master-2 Ready master 108m v1.22.8+f34b40c 192.168.12.13 <none> Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 worker-0 NotReady,SchedulingDisabled worker 71m v1.22.8+f34b40c 192.168.12.20 <none> Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 worker-1 Ready worker 70m v1.22.8+f34b40c 192.168.12.21 <none> Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 worker-2 Ready,SchedulingDisabled worker 72m v1.22.8+f34b40c 192.168.12.22 <none> Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 worker-3 Ready,SchedulingDisabled worker 73m v1.22.8+f34b40c 192.168.12.23 <none> Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa) 4.18.0-305.49.1.el8_4.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 When this happens, we try to uncordon the nodes with SchedulingDisabled status in order to move them to Ready status, and it usually works, but we only try this if ovnkube pods are Ready and in a correct status, because it would not work in that case. In fact, in this issue, there are some ovnkube-node pods that were not working fine: ovnkube-node-kc649 3/4 CrashLoopBackOff 20 (2m15s ago) 72m ovnkube-node-pcbt4 3/4 CrashLoopBackOff 20 (2m23s ago) 71m And also, by checking the events that happened in the pods created in the system, we can see a lot of pods failing with this message, already described in BZ 2077900 for OCP 4.11: "error adding pod XXX to CNI network "multus-cni-network" (...) /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition". These are some examples of this: NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE openshift-network-diagnostics 53m Warning FailedCreatePodSandBox pod/network-check-target-wl4x6 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-wl4x6_openshift-network-diagnostics_e3530250-1e62-4ddc-b46d-13c75f59982b_0(608a03d4358a98df9f66d5b9d1815f234de76295858c7a8d3f7019a24539845f): error adding pod openshift-network-diagnostics_network-check-target-wl4x6 to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-wl4x6/e3530250-1e62-4ddc-b46d-13c75f59982b]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-dns 53m Warning FailedCreatePodSandBox pod/dns-default-4nfnd Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-4nfnd_openshift-dns_bb18e302-9658-4723-a3d8-e57b55a6ac56_0(9ebb3a7b84bf341567ae3536cc8d98c2efc0b0ea0d82ddf4bbe86b802a58173f): error adding pod openshift-dns_dns-default-4nfnd to CNI network "multus-cni-network": Multus: [openshift-dns/dns-default-4nfnd/bb18e302-9658-4723-a3d8-e57b55a6ac56]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-multus 53m Warning FailedCreatePodSandBox pod/network-metrics-daemon-bj478 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-bj478_openshift-multus_09df9c99-451b-4e41-940e-011dc8cb3974_0(76392a2caf518e485d171ef066b637e863a2f457e1b89405e34cc1e9c4adac76): error adding pod openshift-multus_network-metrics-daemon-bj478 to CNI network "multus-cni-network": Multus: [openshift-multus/network-metrics-daemon-bj478/09df9c99-451b-4e41-940e-011dc8cb3974]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-ingress-canary 53m Warning FailedCreatePodSandBox pod/ingress-canary-knhzk Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ingress-canary-knhzk_openshift-ingress-canary_ab6917e8-8956-41fc-bb8d-7a7e77f1da47_0(76b4e49b301fc9bacff876c932eae8818356495f32e649435c6714441b6a3a5a): error adding pod openshift-ingress-canary_ingress-canary-knhzk to CNI network "multus-cni-network": Multus: [openshift-ingress-canary/ingress-canary-knhzk/ab6917e8-8956-41fc-bb8d-7a7e77f1da47]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-network-diagnostics 53m Warning FailedCreatePodSandBox pod/network-check-target-rz2j4 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-rz2j4_openshift-network-diagnostics_41553745-d6ae-4132-bf6d-b6c70f6bd528_0(17540bf4ea1c259eaa72d1ce157fb5b5423aca344380485a6c66679aa19e5ffb): error adding pod openshift-network-diagnostics_network-check-target-rz2j4 to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-rz2j4/41553745-d6ae-4132-bf6d-b6c70f6bd528]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-ingress-canary 53m Warning FailedCreatePodSandBox pod/ingress-canary-574w7 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ingress-canary-574w7_openshift-ingress-canary_eb6fc672-66f0-43df-8884-a73012863687_0(d00c5b25a63119ba31d3255cf348918d402773f964376844ea03c552747d28b7): error adding pod openshift-ingress-canary_ingress-canary-574w7 to CNI network "multus-cni-network": Multus: [openshift-ingress-canary/ingress-canary-574w7/eb6fc672-66f0-43df-8884-a73012863687]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-dns 53m Warning FailedCreatePodSandBox pod/dns-default-ghfxm Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-ghfxm_openshift-dns_c835c4c8-430f-492c-a485-4f5e953d1d57_0(a2bf07778b234fcacf3ef36908d7c7135ef59331a136516cda5a411a71ac4716): error adding pod openshift-dns_dns-default-ghfxm to CNI network "multus-cni-network": Multus: [openshift-dns/dns-default-ghfxm/c835c4c8-430f-492c-a485-4f5e953d1d57]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-multus 52m Warning FailedCreatePodSandBox pod/network-metrics-daemon-4lkwt Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-4lkwt_openshift-multus_74f2048f-6701-4567-a4b9-5a54d4215968_0(52d20c06c0a1a4f0838f1cca75dbf82aa5e8b34d583dd7c66c2c465f9c0a64f5): error adding pod openshift-multus_network-metrics-daemon-4lkwt to CNI network "multus-cni-network": Multus: [openshift-multus/network-metrics-daemon-4lkwt/74f2048f-6701-4567-a4b9-5a54d4215968]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-dns 52m Warning FailedCreatePodSandBox pod/dns-default-4nfnd Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-4nfnd_openshift-dns_bb18e302-9658-4723-a3d8-e57b55a6ac56_0(9c1640162aece94a0bd95ed688cdd288147bc6b6933132b60ac9d3c5c5bc81e0): error adding pod openshift-dns_dns-default-4nfnd to CNI network "multus-cni-network": Multus: [openshift-dns/dns-default-4nfnd/bb18e302-9658-4723-a3d8-e57b55a6ac56]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-ingress-canary 52m Warning FailedCreatePodSandBox pod/ingress-canary-knhzk Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ingress-canary-knhzk_openshift-ingress-canary_ab6917e8-8956-41fc-bb8d-7a7e77f1da47_0(47a613d8221eff780d29e132546cfb2dbcdf80a315e3b631b67f15b327defde7): error adding pod openshift-ingress-canary_ingress-canary-knhzk to CNI network "multus-cni-network": Multus: [openshift-ingress-canary/ingress-canary-knhzk/ab6917e8-8956-41fc-bb8d-7a7e77f1da47]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-network-diagnostics 52m Warning FailedCreatePodSandBox pod/network-check-target-wl4x6 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-wl4x6_openshift-network-diagnostics_e3530250-1e62-4ddc-b46d-13c75f59982b_0(1fb62bfcaac24a0f60ea654fcd688c28df814e9c328adb536c72048e5c3c26e5): error adding pod openshift-network-diagnostics_network-check-target-wl4x6 to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-wl4x6/e3530250-1e62-4ddc-b46d-13c75f59982b]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-multus 52m Warning FailedCreatePodSandBox pod/network-metrics-daemon-bj478 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-bj478_openshift-multus_09df9c99-451b-4e41-940e-011dc8cb3974_0(4f5861770b1dfe9e3290786e8c05849518f96698d321494e7a225ad2c7b00878): error adding pod openshift-multus_network-metrics-daemon-bj478 to CNI network "multus-cni-network": Multus: [openshift-multus/network-metrics-daemon-bj478/09df9c99-451b-4e41-940e-011dc8cb3974]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-network-diagnostics 51m Warning FailedCreatePodSandBox pod/network-check-target-rz2j4 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-rz2j4_openshift-network-diagnostics_41553745-d6ae-4132-bf6d-b6c70f6bd528_0(3a0c06c4f8ac3bb9b36da49d37239f0e79fe487d08a72bea09e8a33e2be36d40): error adding pod openshift-network-diagnostics_network-check-target-rz2j4 to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-rz2j4/41553745-d6ae-4132-bf6d-b6c70f6bd528]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-ingress-canary 51m Warning FailedCreatePodSandBox pod/ingress-canary-574w7 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ingress-canary-574w7_openshift-ingress-canary_eb6fc672-66f0-43df-8884-a73012863687_0(2603ca2981c436e1e0453345f4ae77010018d1130542bb9a1d031add59d3abdb): error adding pod openshift-ingress-canary_ingress-canary-574w7 to CNI network "multus-cni-network": Multus: [openshift-ingress-canary/ingress-canary-574w7/eb6fc672-66f0-43df-8884-a73012863687]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition openshift-dns 51m Warning FailedCreatePodSandBox pod/dns-default-ghfxm Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-ghfxm_openshift-dns_c835c4c8-430f-492c-a485-4f5e953d1d57_0(8a1ae6725c6d528aa8ac7b9f058f104f3d40fd66ecc755fe271d6a71e8f706f4): error adding pod openshift-dns_dns-default-ghfxm to CNI network "multus-cni-network": Multus: [openshift-dns/dns-default-ghfxm/c835c4c8-430f-492c-a485-4f5e953d1d57]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition By inspecting the logs of ovnkube pods that were failing, we can see they are complaining about br-ex link not found. For example, with ovnkube-node-kc649): 2022-06-15T02:57:51.566341230Z I0615 02:57:51.566298 189133 gateway_localnet.go:173] Node local addresses initialized to: map[10.131.0.2:{10.131.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 16.1.15.2:{16.1.15.0 fffffffc} 172.22.2.196:{172.22.0.0 fffff800} 192.168.12.23:{192.168.12.0 ffffff00} 192.168.13.20:{192.168.13.0 ffffff00} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::440c:84ff:fe02:80fc:{fe80:: ffffffffffffffff0000000000000000} fe80::4cdd:55ff:fe87:975f:{fe80:: ffffffffffffffff0000000000000000} fe80::cc8d:ddff:fe76:ed26:{fe80:: ffffffffffffffff0000000000000000} fe80::f603:43ff:fecc:c0b0:{fe80:: ffffffffffffffff0000000000000000}] 2022-06-15T02:57:51.566518040Z I0615 02:57:51.566508 189133 helper_linux.go:73] Found default gateway interface bond0 192.168.12.1 2022-06-15T02:57:51.566551517Z F0615 02:57:51.566544 189133 ovnkube.go:130] could not find IP addresses: failed to lookup link br-ex: Link not found Expected results: PerformanceProfile should be applied correctly, with a correct transition towards the new MachineConfig in all nodes, and also having all the nodes in Ready status and the ovnkube pods working fine. Again, this was working on OCP 4.9.37, but not working on OCP 4.9.38. Additional info: In the DCI job reported in this BZ, you have access to the must-gather of the installation in the following link: https://www.distributed-ci.io/jobs/2f2dd76f-21f9-4a5a-9f77-c188f03b591c/files -> look for the filename called must_gather.tar.gz and download it.
We introduced this issue with the first PR attached to https://bugzilla.redhat.com/show_bug.cgi?id=2089763. A solution is already attached to that BZ as well and being considered on its verification. Linking the solving PR to this BZ for reference and marking as duplicate. *** This bug has been marked as a duplicate of bug 2089763 ***