Bug 1848956

Summary: KMP requires downtime for CA stabilization during certificate rotation
Product: Container Native Virtualization (CNV) Reporter: Geetika Kapoor <gkapoor>
Component: NetworkingAssignee: Petr Horáček <phoracek>
Status: CLOSED ERRATA QA Contact: Ofir Nash <onash>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.4.0CC: cnv-qe-bugs, ncredi, onash
Target Milestone: ---   
Target Release: 2.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cluster-network-addons-operator-container-v2.5.0-8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-10 11:16:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vm-fedora
none
kmp-namespace none

Description Geetika Kapoor 2020-06-19 11:33:10 UTC
Description of problem:

Downtime of 2-3 mins is seen during CA rotation where no new vm's can be created. This downtime varies on kubelet config , certificate keysize and handshake process  (TLS).
https://kubernetes.io/docs/concepts/configuration/secret/#mounted-secrets-are-updated-automatically

During this time, CA is unstable so TLS connection is broken and creation of vm's is failure if vm is created under ns that has label mutatevirtualmachines.kubemacpool.io=allocate

Common failures seen during this time are :

Error from server (InternalError): error when creating "vm_create.yaml": Internal error occurred: failed calling webhook "mutatevirtualmachines.kubemacpool.io": Post https://kubemacpool-service.openshift-cnv.svc:443/mutate-virtualmachines?timeout=30s: x509: certificate signed by unknown authority 

Error from server (InternalError): error when creating "vm_create.yaml": Internal error occurred: failed calling webhook "mutatevirtualmachines.kubemacpool.io": Post https://kubemacpool-service.openshift-cnv.svc:443/mutate-virtualmachines?timeout=30s: dial tcp 10.128.2.32:8000: connect: connection refused



Version-Release number of selected component (if applicable):


$ oc get csv -n openshift-cnv | awk ' { print $4 } ' | tail -n1
2.4.0

How reproducible:

always 

Steps to Reproduce:
1. Create a certificate and apply it on cabundle of mutatingwebhookconfiguration
2. Scripts used : https://github.com/k8snetworkplumbingwg/kubemacpool/pull/193
3.

Actual results:

KMP becomes unstable and unable to process request when CA bundle gets unstable


Expected results:

Ideal approach should be customer/admin should be able to configure certificate rotation policy based on their downtime/need/availability. Also, this downtime should be reduced when system is in unstable state or some rescheduling policy.

Additional info:

Comment 1 Petr Horáček 2020-06-19 11:36:24 UTC
Thanks for opening this.

Since the rotation interval is quite long and the downtime happens only on opted in namespaces, I suggest we handle this in 2.5 (and not as a 2.4 blocker).

Comment 3 Petr Horáček 2020-09-03 12:07:19 UTC
We need HCO to expose rotation parameters on its API. That will happen only in 2.6.

Comment 4 Ofir Nash 2020-12-27 10:04:36 UTC
Created attachment 1742246 [details]
vm-fedora

VM Fedora with namespace: kmp-ns-bug

Comment 5 Ofir Nash 2020-12-27 10:06:49 UTC
Created attachment 1742247 [details]
kmp-namespace

Comment 6 Ofir Nash 2020-12-27 10:07:52 UTC
Verified.

Steps verified:
1. Create a certificate with the given scripts.
2. Create namespace with label: "mutatevirtualmachines.kubemacpool.io: allocate" and apply (oc apply -f namespace.yaml) - Attached namespace.yaml
3. Create VM under the namespace created - Attached vm-fedora.yaml
4. Check that VM is created successfully and running, KMP pods are running.
5. Delete VM works successfully without latency/downtime.

Comment 7 Ofir Nash 2020-12-27 10:09:27 UTC
Comment on attachment 1742247 [details]
kmp-namespace

KMP Namespace example - kmp-ns-bug.
Has label: "mutatevirtualmachines.kubemacpool.io: allocate"

Comment 10 errata-xmlrpc 2021-03-10 11:16:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799