2097315 – OCP 4.9.38 - Some ovnkube-node pods are not in a running state after applying a PerformanceProfile - br-ex link not found

Bug 2097315 - OCP 4.9.38 - Some ovnkube-node pods are not in a running state after applying a PerformanceProfile - br-ex link not found

Summary: OCP 4.9.38 - Some ovnkube-node pods are not in a running state after applying...

Keywords:
Status:	CLOSED DUPLICATE of bug 2089763
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Jaime Caamaño Ruiz
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-15 12:32 UTC by Ramon Perez
Modified:	2022-06-15 14:59 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-15 14:59:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3183	0	None	Merged	[release-4.9] Bug 2089763: configure-ovs: persist profiles after auto-connect has been set	2022-06-15 14:54:31 UTC

Description Ramon Perez 2022-06-15 12:32:14 UTC

Description of problem:

After installing the performance-addon-operator CSV in an OCP cluster, a PerformanceProfile was applied. After waiting some time, several nodes remained in SchedulingDisabled status, and the current MachineConfig applied did not transition towards the new, desired MachineConfig. Moreover, some ovnkube-node pods were failing due to the absence of br-ex link, which was not found.

We use Ansible community.kubernetes.k8s module to automate the creation of the resources, and running twice or more causes this behavior.

This is only happening with OCP 4.9.38. With the previous version, OCP 4.9.37, we are not experiencing this issue.

In fact, the issue we are having is similar to the one reported in BZ 2077900 (which duplicates 2078866), but these BZ are happening on OCP 4.11, and this one in OCP 4.9.38.

Version-Release number of selected component (if applicable):

OCP 4.9.38

How reproducible:

100% so far with the tests we've done in our labs.

Steps to Reproduce:

1. Deploy OCP 4.9.38 in a cluster composed by 3 master nodes and 4 worker nodes, using IPI installation and Ansible playbooks from baremetal-deployment.

2. Install performance-addon-operator

3. Create the following PerformanceProfile resource:

---
kind: PerformanceProfile
apiVersion: "performance.openshift.io/v2"
metadata:
  name: cnf-basic-profile
  namespace: openshift-performance-addon-operator
spec:
  additionalKernelArgs:
    - "nmi_watchdog=0"
    - "audit=0"
    - "mce=off"
    - "processor.max_cstate=1"
    - "idle=poll"
    - "intel_idle.max_cstate=0"
  cpu:
    isolated: "2-19,22-39,42-59,62-79"
    reserved: "0,1,40,41,20,21,60,61"
  hugepages:
    pages:
      - size: "1G"
        count: 32
        node: 0
      - size: "1G"
        count: 32
        node: 1
      - size: "2M"
        count: 12000
        node: 0
      - size: "2M"
        count: 12000
        node: 1
  numa:
    topologyPolicy: "single-numa-node"
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/worker: ""

...

4. Wait some time to check if the PerformanceProfile is applied correctly.

Actual results:

This report is based in the following OCP installation done with Distributed CI (DCI): https://www.distributed-ci.io/jobs/2f2dd76f-21f9-4a5a-9f77-c188f03b591c/jobStates.

After deploying OCP, performance-addon-operator and the PerformanceProfile commented before, and waiting some time, if we check the MCP status, we will see that all the MCP created are not in Ready status. If going to the node status, some of them are in SchedulingDisabled status, and even in NotReady status:

NAME       STATUS                        ROLES    AGE    VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
master-0   Ready                         master   108m   v1.22.8+f34b40c   192.168.12.11   <none>        Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8
master-1   Ready                         master   108m   v1.22.8+f34b40c   192.168.12.12   <none>        Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8
master-2   Ready                         master   108m   v1.22.8+f34b40c   192.168.12.13   <none>        Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8
worker-0   NotReady,SchedulingDisabled   worker   71m    v1.22.8+f34b40c   192.168.12.20   <none>        Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8
worker-1   Ready                         worker   70m    v1.22.8+f34b40c   192.168.12.21   <none>        Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8
worker-2   Ready,SchedulingDisabled      worker   72m    v1.22.8+f34b40c   192.168.12.22   <none>        Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8
worker-3   Ready,SchedulingDisabled      worker   73m    v1.22.8+f34b40c   192.168.12.23   <none>        Red Hat Enterprise Linux CoreOS 49.84.202206082248-0 (Ootpa)   4.18.0-305.49.1.el8_4.x86_64   cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8

When this happens, we try to uncordon the nodes with SchedulingDisabled status in order to move them to Ready status, and it usually works, but we only try this if ovnkube pods are Ready and in a correct status, because it would not work in that case. In fact, in this issue, there are some ovnkube-node pods that were not working fine:

ovnkube-node-kc649     3/4   CrashLoopBackOff   20 (2m15s ago)   72m
ovnkube-node-pcbt4     3/4   CrashLoopBackOff   20 (2m23s ago)   71m

And also, by checking the events that happened in the pods created in the system, we can see a lot of pods failing with this message, already described in BZ 2077900 for OCP 4.11: "error adding pod XXX to CNI network "multus-cni-network" (...) /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition". These are some examples of this:

NAMESPACE                                          LAST SEEN   TYPE      REASON                                        OBJECT                                                                MESSAGE
openshift-network-diagnostics                      53m         Warning   FailedCreatePodSandBox                        pod/network-check-target-wl4x6                                        Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-wl4x6_openshift-network-diagnostics_e3530250-1e62-4ddc-b46d-13c75f59982b_0(608a03d4358a98df9f66d5b9d1815f234de76295858c7a8d3f7019a24539845f): error adding pod openshift-network-diagnostics_network-check-target-wl4x6 to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-wl4x6/e3530250-1e62-4ddc-b46d-13c75f59982b]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-dns                                      53m         Warning   FailedCreatePodSandBox                        pod/dns-default-4nfnd                                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-4nfnd_openshift-dns_bb18e302-9658-4723-a3d8-e57b55a6ac56_0(9ebb3a7b84bf341567ae3536cc8d98c2efc0b0ea0d82ddf4bbe86b802a58173f): error adding pod openshift-dns_dns-default-4nfnd to CNI network "multus-cni-network": Multus: [openshift-dns/dns-default-4nfnd/bb18e302-9658-4723-a3d8-e57b55a6ac56]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-multus                                   53m         Warning   FailedCreatePodSandBox                        pod/network-metrics-daemon-bj478                                      Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-bj478_openshift-multus_09df9c99-451b-4e41-940e-011dc8cb3974_0(76392a2caf518e485d171ef066b637e863a2f457e1b89405e34cc1e9c4adac76): error adding pod openshift-multus_network-metrics-daemon-bj478 to CNI network "multus-cni-network": Multus: [openshift-multus/network-metrics-daemon-bj478/09df9c99-451b-4e41-940e-011dc8cb3974]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-ingress-canary                           53m         Warning   FailedCreatePodSandBox                        pod/ingress-canary-knhzk                                              Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ingress-canary-knhzk_openshift-ingress-canary_ab6917e8-8956-41fc-bb8d-7a7e77f1da47_0(76b4e49b301fc9bacff876c932eae8818356495f32e649435c6714441b6a3a5a): error adding pod openshift-ingress-canary_ingress-canary-knhzk to CNI network "multus-cni-network": Multus: [openshift-ingress-canary/ingress-canary-knhzk/ab6917e8-8956-41fc-bb8d-7a7e77f1da47]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-network-diagnostics                      53m         Warning   FailedCreatePodSandBox                        pod/network-check-target-rz2j4                                        Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-rz2j4_openshift-network-diagnostics_41553745-d6ae-4132-bf6d-b6c70f6bd528_0(17540bf4ea1c259eaa72d1ce157fb5b5423aca344380485a6c66679aa19e5ffb): error adding pod openshift-network-diagnostics_network-check-target-rz2j4 to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-rz2j4/41553745-d6ae-4132-bf6d-b6c70f6bd528]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-ingress-canary                           53m         Warning   FailedCreatePodSandBox                        pod/ingress-canary-574w7                                              Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ingress-canary-574w7_openshift-ingress-canary_eb6fc672-66f0-43df-8884-a73012863687_0(d00c5b25a63119ba31d3255cf348918d402773f964376844ea03c552747d28b7): error adding pod openshift-ingress-canary_ingress-canary-574w7 to CNI network "multus-cni-network": Multus: [openshift-ingress-canary/ingress-canary-574w7/eb6fc672-66f0-43df-8884-a73012863687]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-dns                                      53m         Warning   FailedCreatePodSandBox                        pod/dns-default-ghfxm                                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-ghfxm_openshift-dns_c835c4c8-430f-492c-a485-4f5e953d1d57_0(a2bf07778b234fcacf3ef36908d7c7135ef59331a136516cda5a411a71ac4716): error adding pod openshift-dns_dns-default-ghfxm to CNI network "multus-cni-network": Multus: [openshift-dns/dns-default-ghfxm/c835c4c8-430f-492c-a485-4f5e953d1d57]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-multus                                   52m         Warning   FailedCreatePodSandBox                        pod/network-metrics-daemon-4lkwt                                      Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-4lkwt_openshift-multus_74f2048f-6701-4567-a4b9-5a54d4215968_0(52d20c06c0a1a4f0838f1cca75dbf82aa5e8b34d583dd7c66c2c465f9c0a64f5): error adding pod openshift-multus_network-metrics-daemon-4lkwt to CNI network "multus-cni-network": Multus: [openshift-multus/network-metrics-daemon-4lkwt/74f2048f-6701-4567-a4b9-5a54d4215968]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-dns                                      52m         Warning   FailedCreatePodSandBox                        pod/dns-default-4nfnd                                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-4nfnd_openshift-dns_bb18e302-9658-4723-a3d8-e57b55a6ac56_0(9c1640162aece94a0bd95ed688cdd288147bc6b6933132b60ac9d3c5c5bc81e0): error adding pod openshift-dns_dns-default-4nfnd to CNI network "multus-cni-network": Multus: [openshift-dns/dns-default-4nfnd/bb18e302-9658-4723-a3d8-e57b55a6ac56]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-ingress-canary                           52m         Warning   FailedCreatePodSandBox                        pod/ingress-canary-knhzk                                              Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ingress-canary-knhzk_openshift-ingress-canary_ab6917e8-8956-41fc-bb8d-7a7e77f1da47_0(47a613d8221eff780d29e132546cfb2dbcdf80a315e3b631b67f15b327defde7): error adding pod openshift-ingress-canary_ingress-canary-knhzk to CNI network "multus-cni-network": Multus: [openshift-ingress-canary/ingress-canary-knhzk/ab6917e8-8956-41fc-bb8d-7a7e77f1da47]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-network-diagnostics                      52m         Warning   FailedCreatePodSandBox                        pod/network-check-target-wl4x6                                        Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-wl4x6_openshift-network-diagnostics_e3530250-1e62-4ddc-b46d-13c75f59982b_0(1fb62bfcaac24a0f60ea654fcd688c28df814e9c328adb536c72048e5c3c26e5): error adding pod openshift-network-diagnostics_network-check-target-wl4x6 to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-wl4x6/e3530250-1e62-4ddc-b46d-13c75f59982b]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-multus                                   52m         Warning   FailedCreatePodSandBox                        pod/network-metrics-daemon-bj478                                      Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-bj478_openshift-multus_09df9c99-451b-4e41-940e-011dc8cb3974_0(4f5861770b1dfe9e3290786e8c05849518f96698d321494e7a225ad2c7b00878): error adding pod openshift-multus_network-metrics-daemon-bj478 to CNI network "multus-cni-network": Multus: [openshift-multus/network-metrics-daemon-bj478/09df9c99-451b-4e41-940e-011dc8cb3974]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-network-diagnostics                      51m         Warning   FailedCreatePodSandBox                        pod/network-check-target-rz2j4                                        Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-rz2j4_openshift-network-diagnostics_41553745-d6ae-4132-bf6d-b6c70f6bd528_0(3a0c06c4f8ac3bb9b36da49d37239f0e79fe487d08a72bea09e8a33e2be36d40): error adding pod openshift-network-diagnostics_network-check-target-rz2j4 to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-rz2j4/41553745-d6ae-4132-bf6d-b6c70f6bd528]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-ingress-canary                           51m         Warning   FailedCreatePodSandBox                        pod/ingress-canary-574w7                                              Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ingress-canary-574w7_openshift-ingress-canary_eb6fc672-66f0-43df-8884-a73012863687_0(2603ca2981c436e1e0453345f4ae77010018d1130542bb9a1d031add59d3abdb): error adding pod openshift-ingress-canary_ingress-canary-574w7 to CNI network "multus-cni-network": Multus: [openshift-ingress-canary/ingress-canary-574w7/eb6fc672-66f0-43df-8884-a73012863687]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
openshift-dns                                      51m         Warning   FailedCreatePodSandBox                        pod/dns-default-ghfxm                                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-ghfxm_openshift-dns_c835c4c8-430f-492c-a485-4f5e953d1d57_0(8a1ae6725c6d528aa8ac7b9f058f104f3d40fd66ecc755fe271d6a71e8f706f4): error adding pod openshift-dns_dns-default-ghfxm to CNI network "multus-cni-network": Multus: [openshift-dns/dns-default-ghfxm/c835c4c8-430f-492c-a485-4f5e953d1d57]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition

By inspecting the logs of ovnkube pods that were failing, we can see they are complaining about br-ex link not found. For example, with ovnkube-node-kc649):

2022-06-15T02:57:51.566341230Z I0615 02:57:51.566298  189133 gateway_localnet.go:173] Node local addresses initialized to: map[10.131.0.2:{10.131.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 16.1.15.2:{16.1.15.0 fffffffc} 172.22.2.196:{172.22.0.0 fffff800} 192.168.12.23:{192.168.12.0 ffffff00} 192.168.13.20:{192.168.13.0 ffffff00} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::440c:84ff:fe02:80fc:{fe80:: ffffffffffffffff0000000000000000} fe80::4cdd:55ff:fe87:975f:{fe80:: ffffffffffffffff0000000000000000} fe80::cc8d:ddff:fe76:ed26:{fe80:: ffffffffffffffff0000000000000000} fe80::f603:43ff:fecc:c0b0:{fe80:: ffffffffffffffff0000000000000000}]
2022-06-15T02:57:51.566518040Z I0615 02:57:51.566508  189133 helper_linux.go:73] Found default gateway interface bond0 192.168.12.1
2022-06-15T02:57:51.566551517Z F0615 02:57:51.566544  189133 ovnkube.go:130] could not find IP addresses: failed to lookup link br-ex: Link not found

Expected results:

PerformanceProfile should be applied correctly, with a correct transition towards the new MachineConfig in all nodes, and also having all the nodes in Ready status and the ovnkube pods working fine. Again, this was working on OCP 4.9.37, but not working on OCP 4.9.38.

Additional info:

In the DCI job reported in this BZ, you have access to the must-gather of the installation in the following link: https://www.distributed-ci.io/jobs/2f2dd76f-21f9-4a5a-9f77-c188f03b591c/files -> look for the filename called must_gather.tar.gz and download it.

Comment 1 Jaime Caamaño Ruiz 2022-06-15 14:59:02 UTC

We introduced this issue with the first PR attached to https://bugzilla.redhat.com/show_bug.cgi?id=2089763. A solution is already attached to that BZ as well and being considered on its verification. Linking the solving PR to this BZ for reference and marking as duplicate.

*** This bug has been marked as a duplicate of bug 2089763 ***

Note You need to log in before you can comment on or make changes to this bug.