Description of problem: When CNV is installed the kubemacpool mutatingwebhook interferes with Pods openshift-ovn-kubernetes because the namespace is not tagged with the kubemacpool/ignoreAdmission:"" If the kubemacpool controller is down or not available and a pod is deleted or node is restarted the pods can not be recreated or restarted waiting for the kubemacpool. Version-Release number of selected component (if applicable): OCP 4.3.5 CNV 2.2 Steps to Reproduce: 1. kubemacpool in CrashLoopBackOff 2. delete one of the ovnkube-node pods in the openshift-ovn-kubernetes namespace 3. container won't be able to run Expected results: Two options: 1) CNV operator should label the openshift-ovn-kubernetes with label kubemacpool/ignoreAdmission:"" during deployment 2) The kubemacpool should use a whitelist model instead of a blacklist model when determining in which namespaces to apply
We need to backport https://github.com/k8snetworkplumbingwg/kubemacpool/commit/02a7388b7c98336674f7425aab30686e69536966 to 2.3 seems like. I'm on it.
Test Environment : ================== $ oc version Client Version: 4.4.0-0.nightly-2020-02-17-022408 Server Version: 4.4.0-rc.4 Kubernetes Version: v1.17.1 CNV Version $ oc get csv -n openshift-cnv | awk ' { print $4 } ' | tail -n1 2.3.0 Steps: ===== Bug Summary: When our kubemacpool is broken, it blocks openshift-ovn-kubernetes and the cluster may become dead. Fix: Even when kubemacpool is in CrashLoopBackOff, pods when gets killed/deleted under namespace openshift-ovn-kubernetes should be started again. 1. Put kubemacpool in CrashLoopBackOff -- oc edit -n openshift-cnv deployment kubemacpool-mac-controller-manager change /manager to "false" or anything.Save the file. Now try to get status of pods using and they should move to CrashLoopBackOff state. -- $ oc get pods -n openshift-cnv | grep -i crash kubemacpool-mac-controller-manager-6767f6c687-g98n5 0/1 CrashLoopBackOff 13 43m 2. delete one of the ovnkube-node pods in the openshift-ovn-kubernetes namespace -- oc delete pods ovnkube-node-kjx8z -n openshift-ovn-kubernetes 3. Make sure the pod comes up. oc get pods -n openshift-ovn-kubernetes Test Case 2: try to kill all pods and make sure they come up . ============ for i in ovnkube-master-l8tc9 ovnkube-master-sgx6b ovnkube-master-zsxlf ovnkube-node-625gd ovnkube-node-6fd2d ovnkube-node-7n7x5 ovnkube-node-dn7x8 ovnkube-node-hxgx8 ovnkube-node-j4pkn; do oc delete pods $i -n openshift-ovn-kubernetes; done Check the status of pods later. Number of pods before/after should be same. $ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-5655k 4/4 Running 0 87s ovnkube-master-9mv4v 4/4 Running 0 84s ovnkube-master-c29gg 4/4 Running 0 75s ovnkube-node-2cg64 2/2 Running 0 46s ovnkube-node-2n86d 2/2 Running 0 55s ovnkube-node-5fhqk 2/2 Running 0 70s ovnkube-node-74cqs 2/2 Running 0 57s ovnkube-node-mqbrr 2/2 Running 0 68s ovnkube-node-tdjvg 2/2 Running 0 53s
As this is becoming important, I raised the Customer Escalation Flag.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:2011