Bug 1909983

Summary: Cluster upgrade fails because of vpa webhook
Product: OpenShift Container Platform Reporter: Apoorva Jagtap <apjagtap>
Component: NodeAssignee: Joel Smith <joelsmith>
Node sub component: Autoscaler (HPA, VPA) QA Contact: Weinan Liu <weinliu>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, joelsmith, nagrawal, tsweeney
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-11 16:51:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1927854    

Description Apoorva Jagtap 2020-12-22 08:47:54 UTC
Description of problem:

While the upgrade is in process, the pods in the `openshift-vertical-pod-autoscaler` namespace, fail to get created with `no endpoints available for service "vpa-webhook"` messages. The vpa-admission-plugin-default pod itself fails to get created with the same error message causing a deadlock kind of situation. 


Steps to Reproduce:
1. Install VPA operator.
2. Start an upgrade from OCP v4.6.x to v4.6.y

Actual results:

Observed the following messages while upgrading:
~~~
$ oc get pod -n openshift-vertical-pod-autoscaler 
NAME                                                    READY   STATUS    RESTARTS   AGE
pod/vertical-pod-autoscaler-operator-6c64cd877b-46rmd   1/1     Running   0          14h
pod/vpa-recommender-default-649f9f4479-jd4jx            1/1     Running   0          13h
pod/vpa-updater-default-59bf95f4db-bvwld                1/1     Running   0          13h
~~~ 
- As the svc vpa-webhook has its endpoints populated as vpa-admission-plugin-default pods IP, the 'no endpoints available for service "vpa-webhook"' was encountered.

[*] The description of replicaset for vpa-admission-plugin-default pod:
~~~
$ oc describe replicaset.apps/vpa-admission-plugin-default-7d4c654465
Name:           vpa-admission-plugin-default-7d4c654465
Namespace:      openshift-vertical-pod-autoscaler
Selector:       app=vpa-admission-controller,pod-template-hash=7d4c654465,vertical-pod-autoscaler=default
...
Controlled By:  Deployment/vpa-admission-plugin-default
Replicas:       0 current / 1 desired
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
  Type     Reason            Age                  From                   Message
  ----     ------            ----                 ----                   -------
  Normal   SuccessfulCreate  145m                 replicaset-controller  Created pod: vpa-admission-plugin-default-7d4c654465-nq9ht
  Warning  FailedCreate      105s (x21 over 24m)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "vpa.k8s.io": Post "https://vpa-webhook.openshift-vertical-pod-autoscaler.svc:443/?timeout=10s": no endpoints available for service "vpa-webhook"
~~~
- The replicaset itself is failing to create the replica due to no endpoints available for vpa-webhook svc.


Expected results:

The pods should be created without these messages and upgrade should complete.

[ Workaround ] Deleting the mutatingwebhookconfigurations helps to overcome the issue:
      $ oc delete mutatingwebhookconfigurations vpa-webhook-config

Comment 1 Neelesh Agrawal 2021-01-04 19:35:05 UTC
*** Bug 1909982 has been marked as a duplicate of this bug. ***

Comment 4 Joel Smith 2021-02-11 16:45:39 UTC
This should be working in 4.7 already. We have a 4.6 PR open to fix it which we're working to merge:

https://github.com/openshift/kubernetes-autoscaler/pull/186