+++ This bug was initially created as a clone of Bug #2011376 +++ +++ This bug was initially created as a clone of Bug #2011369 +++ Description of problem: caBundles are stored on the mutatingWebhookConfiguration instance. Kubemacpool and kubernetes-nmstate components use kube-admission-webhook (https://github.com/qinqon/kube-admission-webhook) pod to rotate ca certs and append them to the appropriate mutatingWebhookConfiguration. After a large amount of rotations (~700+) then the caBundle appended to mutatingWebhookConfiguration cannot be applied due to etcd's APPLY side restirction (1.5Mb) This results with the kubemacpool/kubernetes pods to decline requests on ``` x509: certificate has expired or is not yet valid ``` on the kube-admission-webhook logs one can see that it fails to apply the caBundle: ``` failed rotating all certs: failed adding new CA cert to CA bundle at webhook: failed to update webhook CABundle: failed to update webhook CABundle: etcdserver: request is too large ``` Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. manually cause a lot of ca rotations on kubemacpool/kubernetes-nmstate. 2. 3. Actual results: VM cannot be created, due to "x509: certificate has expired or is not yet valid" Expected results: VM should be able to be created with no errors. Additional info: --- Additional comment from Ram Lavi on 2021-10-06 14:08:35 UTC --- https://github.com/qinqon/kube-admission-webhook/pull/54 --- Additional comment from Ram Lavi on 2021-10-06 14:12:36 UTC --- https://github.com/qinqon/kube-admission-webhook/pull/54
KAW PR : https://github.com/qinqon/kube-admission-webhook/pull/56 KAW bump on KMP: https://github.com/k8snetworkplumbingwg/kubemacpool/pull/340
KAW bump in KNMSTATE: https://github.com/nmstate/kubernetes-nmstate/pull/852
@ralavi are all the needed patches merged on U/S CNAO stable branch? We should be prepared, so the only missing thing to get this to D/S would be to merge a M/S patch.
The reproduction scenario is the exact one from https://bugzilla.redhat.com/show_bug.cgi?id=2011376#c5: 1. Check the preliminary contents (specifically the size) of the nmstate mutatingwebhookconfiguration resource: $ oc describe mutatingwebhookconfiguration nmstate > mutatingwebhookconfiguration-nmstate.orig $ ll total 16 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 14368 Jan 10 12:03 mutatingwebhookconfiguration-nmstate.orig 2. By running the attached script - I deleted the nmstate-ca secret (in openshift-cnv namespace) periodically every 10 seconds, for 240 iterations. The script includes storing the contents of the mutatingwebhookconfiguration resource in a file, every 10 iterations. 3. After the script finished - I checked the progress on the size of the mutatingwebhookconfiguration resource: $ ll -tr total 8192 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 14368 Jan 10 12:16 mutatingwebhookconfiguration-nmstate.0 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 56681 Jan 10 12:18 mutatingwebhookconfiguration-nmstate.10 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 99005 Jan 10 12:20 mutatingwebhookconfiguration-nmstate.20 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 141329 Jan 10 12:21 mutatingwebhookconfiguration-nmstate.30 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 183641 Jan 10 12:23 mutatingwebhookconfiguration-nmstate.40 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 225965 Jan 10 12:25 mutatingwebhookconfiguration-nmstate.50 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 268289 Jan 10 12:27 mutatingwebhookconfiguration-nmstate.60 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 310601 Jan 10 12:28 mutatingwebhookconfiguration-nmstate.70 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 352925 Jan 10 12:30 mutatingwebhookconfiguration-nmstate.80 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 395249 Jan 10 12:32 mutatingwebhookconfiguration-nmstate.90 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 437562 Jan 10 12:34 mutatingwebhookconfiguration-nmstate.100 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 479886 Jan 10 12:36 mutatingwebhookconfiguration-nmstate.110 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 522210 Jan 10 12:41 mutatingwebhookconfiguration-nmstate.120 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 564522 Jan 10 12:43 mutatingwebhookconfiguration-nmstate.130 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 606846 Jan 10 12:45 mutatingwebhookconfiguration-nmstate.140 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 649170 Jan 10 12:47 mutatingwebhookconfiguration-nmstate.150 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 691482 Jan 10 12:48 mutatingwebhookconfiguration-nmstate.160 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 733806 Jan 10 12:50 mutatingwebhookconfiguration-nmstate.170 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 776130 Jan 10 12:52 mutatingwebhookconfiguration-nmstate.180 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 818442 Jan 10 12:54 mutatingwebhookconfiguration-nmstate.190 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 860766 Jan 10 12:55 mutatingwebhookconfiguration-nmstate.200 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 903090 Jan 10 12:57 mutatingwebhookconfiguration-nmstate.210 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 945402 Jan 10 12:59 mutatingwebhookconfiguration-nmstate.220 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 987726 Jan 10 13:01 mutatingwebhookconfiguration-nmstate.230 The resource size should have remained steady after ~100 iterations, but instead it kept increasing. OCP version: 4.7.37 CNV Version: 2.6.9 nmstate-handler: v2.6.9-2 kubemacpool: v2.6.9-2
@Yossi, I see that the version of kubemacpool you used was 2.6.9-2 while the build where this issue was fixed was -5. The fix coming from kube-admission-webhook was added after -2: https://github.com/k8snetworkplumbingwg/kubemacpool/compare/d0b92cfd8807c470bcee65a40c550b4359365245..6fe75e4a5fdd87b2c4ef8cc7851ac6bb65f5461e (diff between the two builds). Could you please try to verify it again with kubemacpool 2.6.9-5?
Sorry, my bad. I missed the '-5'. Moving back to ON_QA, I will verify with the correct version. Petr, can you tell if kmp 2.6.9-5 is already available on D/S? CNV on this cluster was installed yesterday.
Verified by running again the same scenario as in comment #5, with the exception of modifying the script to run 120 iterations (instead of 240). This time I verified the cluster has kubemacpool v2.6.9-5. Output samples: $ ll -tr total 3256 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 10130 Jan 11 19:21 mutatingwebhookconfiguration-nmstate.orig -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 10130 Jan 11 19:21 mutatingwebhookconfiguration-nmstate.0 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 52455 Jan 11 19:23 mutatingwebhookconfiguration-nmstate.10 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 94767 Jan 11 19:25 mutatingwebhookconfiguration-nmstate.20 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 137091 Jan 11 19:26 mutatingwebhookconfiguration-nmstate.30 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 179415 Jan 11 19:28 mutatingwebhookconfiguration-nmstate.40 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 221727 Jan 11 19:30 mutatingwebhookconfiguration-nmstate.50 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 264051 Jan 11 19:32 mutatingwebhookconfiguration-nmstate.60 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 306376 Jan 11 19:33 mutatingwebhookconfiguration-nmstate.70 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 348688 Jan 11 19:35 mutatingwebhookconfiguration-nmstate.80 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 391012 Jan 11 19:37 mutatingwebhookconfiguration-nmstate.90 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 429101 Jan 11 19:39 mutatingwebhookconfiguration-nmstate.100 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 429101 Jan 11 19:40 mutatingwebhookconfiguration-nmstate.110 -rw-rw-r--. 1 cnv-qe-jenkins cnv-qe-jenkins 429101 Jan 11 19:44 mutatingwebhookconfiguration-nmstate.120 As can be seen - the resource size kept increasing, until after 90-100 iterations it remained steady. Eventually, I created a simple VM, and verified it is running, to ensure this CA rotation didn't break the CNV functionality. OCP version: 4.7.37 CNV Version: 2.6.9 (HCO Version: v2.6.9-40, taken from the deployment job console output) nmstate-handler: v2.6.9-5 kubemacpool: v2.6.9-5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.6.9 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0414