Bug 1909983 - Cluster upgrade fails because of vpa webhook
Summary: Cluster upgrade fails because of vpa webhook
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.7.0
Assignee: Joel Smith
QA Contact: Weinan Liu
URL:
Whiteboard:
: 1909982 (view as bug list)
Depends On:
Blocks: 1927854
TreeView+ depends on / blocked
 
Reported: 2020-12-22 08:47 UTC by Apoorva Jagtap
Modified: 2021-03-22 18:41 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-11 16:51:10 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5374231 0 None None None 2021-01-01 16:16:09 UTC

Internal Links: 1925260

Description Apoorva Jagtap 2020-12-22 08:47:54 UTC
Description of problem:

While the upgrade is in process, the pods in the `openshift-vertical-pod-autoscaler` namespace, fail to get created with `no endpoints available for service "vpa-webhook"` messages. The vpa-admission-plugin-default pod itself fails to get created with the same error message causing a deadlock kind of situation. 


Steps to Reproduce:
1. Install VPA operator.
2. Start an upgrade from OCP v4.6.x to v4.6.y

Actual results:

Observed the following messages while upgrading:
~~~
$ oc get pod -n openshift-vertical-pod-autoscaler 
NAME                                                    READY   STATUS    RESTARTS   AGE
pod/vertical-pod-autoscaler-operator-6c64cd877b-46rmd   1/1     Running   0          14h
pod/vpa-recommender-default-649f9f4479-jd4jx            1/1     Running   0          13h
pod/vpa-updater-default-59bf95f4db-bvwld                1/1     Running   0          13h
~~~ 
- As the svc vpa-webhook has its endpoints populated as vpa-admission-plugin-default pods IP, the 'no endpoints available for service "vpa-webhook"' was encountered.

[*] The description of replicaset for vpa-admission-plugin-default pod:
~~~
$ oc describe replicaset.apps/vpa-admission-plugin-default-7d4c654465
Name:           vpa-admission-plugin-default-7d4c654465
Namespace:      openshift-vertical-pod-autoscaler
Selector:       app=vpa-admission-controller,pod-template-hash=7d4c654465,vertical-pod-autoscaler=default
...
Controlled By:  Deployment/vpa-admission-plugin-default
Replicas:       0 current / 1 desired
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
  Type     Reason            Age                  From                   Message
  ----     ------            ----                 ----                   -------
  Normal   SuccessfulCreate  145m                 replicaset-controller  Created pod: vpa-admission-plugin-default-7d4c654465-nq9ht
  Warning  FailedCreate      105s (x21 over 24m)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "vpa.k8s.io": Post "https://vpa-webhook.openshift-vertical-pod-autoscaler.svc:443/?timeout=10s": no endpoints available for service "vpa-webhook"
~~~
- The replicaset itself is failing to create the replica due to no endpoints available for vpa-webhook svc.


Expected results:

The pods should be created without these messages and upgrade should complete.

[ Workaround ] Deleting the mutatingwebhookconfigurations helps to overcome the issue:
      $ oc delete mutatingwebhookconfigurations vpa-webhook-config

Comment 1 Neelesh Agrawal 2021-01-04 19:35:05 UTC
*** Bug 1909982 has been marked as a duplicate of this bug. ***

Comment 4 Joel Smith 2021-02-11 16:45:39 UTC
This should be working in 4.7 already. We have a 4.6 PR open to fix it which we're working to merge:

https://github.com/openshift/kubernetes-autoscaler/pull/186


Note You need to log in before you can comment on or make changes to this bug.