Description of problem: OCP upgrade from 4.6.26 to 4.7.8 stuck while trying to evict PODs from nodes to be drained kube-apiserver logs showed `configuration:virt-api-validator,webhook:virt-launcher-eviction-interceptor.kubevirt.io` failing with `context canceled` Version-Release number of selected component (if applicable): - OCP version 4.6.26 - OCP Virtualization 2.5.5 - Fresh cluster with no customer workloads, no OCP-Virt VMs How reproducible: Currently being seen in customer environment Steps to Reproduce: 1. Have an OCP 4.6.x cluster with CNV 2.5.x installed with no VMs at that point 2. Trigger the upgrade to 4.7.x Actual results: Cluster upgrade is not progressing since nodes cannot be drained. Expected results: Since webhook:virt-launcher-eviction-interceptor has a failurePolicy set to Ignore, drain should not fail. Additional info: must-gather couldn't be collected due to apiserver unresponsiveness
Assigned to only kubevirt component I see in the list. Please forward to the right one.
Hi, I also not sure what is the correct component, moving to virtualization, if it's the wrong component, please forward to the right one.
"context canceled" means that the webhook does not answer within 30s. Then it is too late to continue. The apiserver cannot do more than doing the request. If you think your webhook should answer quicker, set the TimeoutSeconds value in the webhook configuration, leaving enough time to ignore it.
(In reply to Stefan Schimanski from comment #9) > "context canceled" means that the webhook does not answer within 30s. Then > it is too late to continue. The apiserver cannot do more than doing the > request. If you think your webhook should answer quicker, set the > TimeoutSeconds value in the webhook configuration, leaving enough time to > ignore it. i see. So, our validation webhook does not set a timeout. The v1 version of the webhook registration documents that the default timeout is 10 seconds, however the v1beta api documents that the default timeout is 30 seconds. I gather the default of 30 seconds is what we're hitting here. Explicitly setting this timeout on our webhooks to 10 seconds seems like a reasonable change from our end.
I've posted a PR related to this issue. https://github.com/kubevirt/kubevirt/pull/5661 In this PR, our webhooks now have a 10 second timeout explicitly defined, which should avoid our component hitting the api servers 30 second request context timeout.
@sttts I just want to confirm that 10 second webhook timeout (with failurePolicy:Ignore) is expected to avoid a total request timeout on the api-server. We don't want evictions to fail if our component is unreachable. Is there anything else we should be aware of here, or will explicitly defining a 10 second timeout give us the behavior we're looking for where evictions are intercepted on a best effort basis?
Verify with build: hco:v2.5.7-54 virt-operator-container-v2.5.6-8 step: 1. deploy ocp 4.6.36 with cnv 2.5.7 2. check there is no vms on cluster 3. upgrade ocp to 4.7.18 check the log of kube-apiserver no error msg like "virt-launcher-eviction-interceptor.kubevirt.io" upgrade succeed without error. #oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-apiserver 4.7.18 True False False 40h #oc describe clusterversion History: Completion Time: 2021-07-09T02:42:45Z Image: quay.io/openshift-release-dev/ocp-release@sha256:afcb309425d45a240de2df8e376f9632e6144052177fd62a0347934657b3573f Started Time: 2021-07-09T01:36:39Z State: Completed Verified: true Version: 4.7.18 Completion Time: 2021-07-07T11:05:39Z Image: quay.io/openshift-release-dev/ocp-release@sha256:4205c6709ec4b8523eb18144f7c5bed17a32ba71348fd4c2b6ab43a636cf028e Started Time: 2021-07-07T10:13:02Z State: Completed Verified: false Version: 4.6.36 check cnv pods in OpenShift-cnv, all in running status move this to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.5.7 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:2934
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days