Bug 1957477 - Upgrade hung due to a failure to drain with several pods unable to be evicted
Summary: Upgrade hung due to a failure to drain with several pods unable to be evicted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.5.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 2.5.7
Assignee: sgott
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-05 20:25 UTC by Javier Coscia
Modified: 2024-06-14 01:31 UTC (History)
7 users (show)

Fixed In Version: virt-operator-container-v2.5.6-8 hco-bundle-registry-container-v2.5.7-27
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-28 11:29:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-11891 0 None None None 2024-06-14 01:31:29 UTC
Red Hat Product Errata RHEA-2021:2934 0 None None None 2021-07-28 11:29:54 UTC

Description Javier Coscia 2021-05-05 20:25:06 UTC
Description of problem:
OCP upgrade from 4.6.26 to 4.7.8 stuck while trying to evict PODs from nodes to be drained

kube-apiserver logs showed `configuration:virt-api-validator,webhook:virt-launcher-eviction-interceptor.kubevirt.io` failing with `context canceled` 

Version-Release number of selected component (if applicable):
 - OCP version 4.6.26
 - OCP Virtualization 2.5.5
 - Fresh cluster with no customer workloads, no OCP-Virt  VMs

How reproducible:
Currently being seen in customer environment

Steps to Reproduce:
1. Have an OCP 4.6.x cluster with CNV 2.5.x installed with no VMs at that point
2. Trigger the upgrade to 4.7.x


Actual results:
Cluster upgrade is not progressing since nodes cannot be drained.

Expected results:
Since webhook:virt-launcher-eviction-interceptor has a failurePolicy set to Ignore, drain should not fail.

Additional info:

must-gather couldn't be collected due to apiserver unresponsiveness

Comment 3 Stefan Schimanski 2021-05-06 09:03:43 UTC
Assigned to only kubevirt component I see in the list. Please forward to the right one.

Comment 4 Yaacov Zamir 2021-05-06 12:50:41 UTC
Hi, I also not sure what is the correct component, moving to virtualization, if it's the wrong component, please forward to the right one.

Comment 9 Stefan Schimanski 2021-05-17 16:28:42 UTC
"context canceled" means that the webhook does not answer within 30s. Then it is too late to continue. The apiserver cannot do more than doing the request. If you think your webhook should answer quicker, set the TimeoutSeconds value in the webhook configuration, leaving enough time to ignore it.

Comment 10 David Vossel 2021-05-17 19:23:46 UTC
(In reply to Stefan Schimanski from comment #9)
> "context canceled" means that the webhook does not answer within 30s. Then
> it is too late to continue. The apiserver cannot do more than doing the
> request. If you think your webhook should answer quicker, set the
> TimeoutSeconds value in the webhook configuration, leaving enough time to
> ignore it.


i see. 

So, our validation webhook does not set a timeout. The v1 version of the webhook registration documents that the default timeout is 10 seconds, however the v1beta api documents that the default timeout is 30 seconds. I gather the default of 30 seconds is what we're hitting here.

Explicitly setting this timeout on our webhooks to 10 seconds seems like a reasonable change from our end.

Comment 11 David Vossel 2021-05-17 19:56:59 UTC

I've posted a PR related to this issue. https://github.com/kubevirt/kubevirt/pull/5661

In this PR, our webhooks now have a 10 second timeout explicitly defined, which should avoid our component hitting the api servers 30 second request context timeout.

Comment 12 David Vossel 2021-05-17 20:10:19 UTC
@sttts 

I just want to confirm that 10 second webhook timeout (with failurePolicy:Ignore) is expected to avoid a total request timeout on the api-server. We don't want evictions to fail if our component is unreachable.

Is there anything else we should be aware of here, or will explicitly defining a 10 second timeout give us the behavior we're looking for where evictions are intercepted on a best effort basis?

Comment 13 zhe peng 2021-07-09 03:12:53 UTC
Verify with build:
hco:v2.5.7-54
virt-operator-container-v2.5.6-8

step:
1. deploy ocp 4.6.36 with cnv 2.5.7
2. check there is no vms on cluster
3. upgrade ocp to 4.7.18

check the log of kube-apiserver
no error msg like "virt-launcher-eviction-interceptor.kubevirt.io"
upgrade succeed without error.

#oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.7.18    True        False         False      40h

#oc describe clusterversion
History:
    Completion Time:    2021-07-09T02:42:45Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:afcb309425d45a240de2df8e376f9632e6144052177fd62a0347934657b3573f
    Started Time:       2021-07-09T01:36:39Z
    State:              Completed
    Verified:           true
    Version:            4.7.18
    Completion Time:    2021-07-07T11:05:39Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:4205c6709ec4b8523eb18144f7c5bed17a32ba71348fd4c2b6ab43a636cf028e
    Started Time:       2021-07-07T10:13:02Z
    State:              Completed
    Verified:           false
    Version:            4.6.36

check cnv pods in OpenShift-cnv, all in running status

move this to verified.

Comment 19 errata-xmlrpc 2021-07-28 11:29:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.7 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:2934

Comment 20 Red Hat Bugzilla 2023-09-15 01:06:07 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.