Description of problem: Downtime of 2-3 mins is seen during CA rotation where no new vm's can be created. This downtime varies on kubelet config , certificate keysize and handshake process (TLS). https://kubernetes.io/docs/concepts/configuration/secret/#mounted-secrets-are-updated-automatically During this time, CA is unstable so TLS connection is broken and creation of vm's is failure if vm is created under ns that has label mutatevirtualmachines.kubemacpool.io=allocate Common failures seen during this time are : Error from server (InternalError): error when creating "vm_create.yaml": Internal error occurred: failed calling webhook "mutatevirtualmachines.kubemacpool.io": Post https://kubemacpool-service.openshift-cnv.svc:443/mutate-virtualmachines?timeout=30s: x509: certificate signed by unknown authority Error from server (InternalError): error when creating "vm_create.yaml": Internal error occurred: failed calling webhook "mutatevirtualmachines.kubemacpool.io": Post https://kubemacpool-service.openshift-cnv.svc:443/mutate-virtualmachines?timeout=30s: dial tcp 10.128.2.32:8000: connect: connection refused Version-Release number of selected component (if applicable): $ oc get csv -n openshift-cnv | awk ' { print $4 } ' | tail -n1 2.4.0 How reproducible: always Steps to Reproduce: 1. Create a certificate and apply it on cabundle of mutatingwebhookconfiguration 2. Scripts used : https://github.com/k8snetworkplumbingwg/kubemacpool/pull/193 3. Actual results: KMP becomes unstable and unable to process request when CA bundle gets unstable Expected results: Ideal approach should be customer/admin should be able to configure certificate rotation policy based on their downtime/need/availability. Also, this downtime should be reduced when system is in unstable state or some rescheduling policy. Additional info:
Thanks for opening this. Since the rotation interval is quite long and the downtime happens only on opted in namespaces, I suggest we handle this in 2.5 (and not as a 2.4 blocker).
We need HCO to expose rotation parameters on its API. That will happen only in 2.6.
Created attachment 1742246 [details] vm-fedora VM Fedora with namespace: kmp-ns-bug
Created attachment 1742247 [details] kmp-namespace
Verified. Steps verified: 1. Create a certificate with the given scripts. 2. Create namespace with label: "mutatevirtualmachines.kubemacpool.io: allocate" and apply (oc apply -f namespace.yaml) - Attached namespace.yaml 3. Create VM under the namespace created - Attached vm-fedora.yaml 4. Check that VM is created successfully and running, KMP pods are running. 5. Delete VM works successfully without latency/downtime.
Comment on attachment 1742247 [details] kmp-namespace KMP Namespace example - kmp-ns-bug. Has label: "mutatevirtualmachines.kubemacpool.io: allocate"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0799