Bug 2008713

Summary: VPA webhook timeout prevents all pods from starting
Product: OpenShift Container Platform Reporter: Joel Smith <joelsmith>
Component: NodeAssignee: Joel Smith <joelsmith>
Node sub component: Autoscaler (HPA, VPA) QA Contact: Weinan Liu <weinliu>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs
Version: 4.7   
Target Milestone: ---   
Target Release: 4.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2008712 Environment:
Last Closed: 2022-05-03 07:35:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2008712    
Bug Blocks: 2083758    

Description Joel Smith 2021-09-29 00:36:24 UTC
+++ This bug was initially created as a clone of Bug #2008712 +++

Description of problem:
When VPA is installed, if its webhook service becomes slow or unreachable, the API server should ignore the failure and instead continue with pod creation.

The VPA webhook has such a long timeout that if the webhook service is in a bad state, the entire pod creation request fails without proceeding beyond the VPA webhook timeout. 

How reproducible:
100%

Steps to Reproduce:
1. Install VPA
2. oc get deployment -n openshift-vertical-pod-autoscaler vpa-admission-plugin-default -o jsonpath='{.spec.template.spec.containers[0].args}' | jq
3. oc get mutatingwebhookconfiguration vpa-webhook-config -o jsonpath='{.webhooks[0].timeoutSeconds}{"\n"}'

Actual results:
[
  "--logtostderr",
  "--v=1",
  "--tls-cert-file=/data/tls-certs/tls.crt",
  "--tls-private-key=/data/tls-certs/tls.key",
  "--client-ca-file=/data/tls-ca-certs/service-ca.crt"
]
30

Expected results:
[
  "--logtostderr",
  "--v=1",
  "--tls-cert-file=/data/tls-certs/tls.crt",
  "--tls-private-key=/data/tls-certs/tls.key",
  "--client-ca-file=/data/tls-ca-certs/service-ca.crt"
  "--webhook-timeout-seconds=10"
]
10

Comment 4 Weinan Liu 2022-04-29 08:45:47 UTC
$ oc get deployment -n openshift-vertical-pod-autoscaler vpa-admission-plugin-default -o jsonpath='{.spec.template.spec.containers[0].args}' | jq
[
  "--logtostderr",
  "--v=1",
  "--tls-cert-file=/data/tls-certs/tls.crt",
  "--tls-private-key=/data/tls-certs/tls.key",
  "--client-ca-file=/data/tls-ca-certs/service-ca.crt",
  "--webhook-timeout-seconds=10"
]
$ oc get mutatingwebhookconfiguration vpa-webhook-config -o jsonpath='{.webhooks[0].timeoutSeconds}{"\n"}'
10

ose-vertical-pod-autoscaler-operator-metadata-container-v4.9.0.202204220627

Comment 6 errata-xmlrpc 2022-05-03 07:35:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1605

Comment 7 Red Hat Bugzilla 2023-09-15 01:36:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days