Created attachment 1847607 [details] log from kubemacpool-cert-manager Description of problem: On a specific big cluster (~500 nodes) kubemacpool-mac-controller-manager pod never reaches Ready state, making it impossible to define VMs. Version-Release number of selected component (if applicable): CNV-4.9.1 $ oc version Client Version: 4.9.12 Server Version: 4.9.12 Kubernetes Version: v1.22.3+e790d7f How reproducible: repeatedly, on one specific cluster Steps to Reproduce: 1. Install OpenShift Virtualization Actual results: $ oc get pod -n openshift-cnv -l app=kubemacpool NAME READY STATUS RESTARTS AGE kubemacpool-cert-manager-7b7bcfc9db-2c8p6 1/1 Running 0 3h13m kubemacpool-mac-controller-manager-88b9c5b99-tt4tk 0/1 Running 33 (112s ago) 5h27m $ ./vm.sh |oc apply -f - Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mutatevirtualmachines.kubemacpool.io": failed to call webhook: Post "https://kubemacpool-service.openshift-cnv.svc:443/mutate-virtualmachines?timeout=10s": dial tcp 10.129.0.72:8000: connect: connection refused Expected results: kubemacpool is READY, serving unique MAC addresses to VMs. VM is defined. Additional info: Use $ oc label namespace mynamespace mutatevirtualmachines.kubemacpool.io=ignore to disable kubemacpool in your namespace. Make sure that your VMs do not refer to kubemacpool.io in their finalizers.
Created attachment 1847608 [details] log from non-ready kubemacpool-mac-controller-manager
Created attachment 1847610 [details] describe non-ready kubemacpool-mac-controller-manager
After some diggins looks like it takes a lot of time for InitMap to finish (it do some api access per pod/VM) so it never reachs webhook start and the readiness probes hits timeout. Possible solutions: 1. Increase readiness probe timeout 2. Remove all the api accesss per pod/vm at InitMap: a. Using controller-runtime client b. Caching namespaces and webhook configuration at the beggining 3. Parallelize InitMap and use a sync.Map for the data structure I suggest we go for 2.a so we have the cache for free and it's already well tested.
Completely avoid deployment of KubeMacPool with kubectl annotate --overwrite -n openshift-cnv hco kubevirt-hyperconverged 'networkaddonsconfigs.kubevirt.io/jsonpatch=[{"op": "replace","path": "/spec/kubeMacPool","value": null}]' (notice the plurals form of networkaddonsconfigs)
Targeting to 4.11. We want to take our time to properly design the solution. The workaround is described above.
https://github.com/k8snetworkplumbingwg/kubemacpool/pull/354
@ralavi Please update the Doc Type and Doc Text fields. Because this issue is now resolved, it is now longer a known issue. The documentation team will exclude the known issue from the 4.11 release notes. On a large cluster, the OpenShift Virtualization MAC pool manager might take too much time to boot and OpenShift Virtualization might not become ready. As a workaround, if you do not require MAC pooling functionality, then disable this sub-component by running the following command: `oc annotate --overwrite -n openshift-cnv hco kubevirt-hyperconverged 'networkaddonsconfigs.kubevirt.io/jsonpatch=[{"op": "replace","path": "/spec/kubeMacPool","value": null}]'`.
@ctomc is removing the release note from BZ and setting the status to "If docs needed, set a value" good enough?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6526