Bug 1836017 - vSphere UPI: Both Internal and External load balancers for kube-apiserver should use /readyz
Summary: vSphere UPI: Both Internal and External load balancers for kube-apiserver sho...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: aos-install
QA Contact: jima
: 1870183 (view as bug list)
Depends On:
TreeView+ depends on / blocked
Reported: 2020-05-14 23:44 UTC by W. Trevor King
Modified: 2024-03-25 15:56 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The reference vSphere UPI load balancer was configured for a simple TCP check. Consequence: The health checks did not consider the health of the api server which could lead to failed api requests whenever the api server restarted. Fix: The health checks now verify api server health against the /readyz endpoint. Result: The reference API load balancer now handles requests during API server restarts gracefully.
Clone Of: 1828382
Last Closed: 2021-02-24 15:12:13 UTC
Target Upstream Version:
abodhe: needinfo+

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4012 0 None closed Bug 1836017: Configure haproxy to check /readyz 2021-02-19 14:13:48 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:13:33 UTC

Description W. Trevor King 2020-05-14 23:44:49 UTC
+++ This bug was initially created as a clone of Bug #1828382 +++

Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.



Please investigate the following, not sure if it needs to be addressed 

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    

We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"

Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 

Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp connect: connection refused

Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp connect: connection refused" error while kube-apiserver is being rolled out.

Bug 1828382 is about installer-provisioned Azure.  This clone is about user-provisioned vSphere.
Description W. Trevor King 2020-05-14 23:42:02 UTC

+++ This bug was initially created as a clone of Bug #1828382 +++

Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.



Please investigate the following, not sure if it needs to be addressed 

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    

We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"

Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 

Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp connect: connection refused

Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp connect: connection refused" error while kube-apiserver is being rolled out.

Bug 1828382 handled installer-provisioned Azure.  This clone is for the user-provisioned Azure recommendations.  Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking.

Comment 1 Abhinav Dahiya 2020-05-18 17:25:11 UTC
Won't be able to get to it this sprint.

Comment 11 Abhinav Dahiya 2020-09-10 22:14:21 UTC
*** Bug 1870183 has been marked as a duplicate of this bug. ***

Comment 12 Stefan Schimanski 2020-09-11 14:54:28 UTC
*** Bug 1873816 has been marked as a duplicate of this bug. ***

Comment 28 jima 2020-10-27 06:33:33 UTC
use updated transformer scripts including this fix to install upi on vsphere with 4.7.0-0.nightly-2020-10-26-152308, and installation is successful, also tried to reboot master node, and it also works well. Move the bug to VERIFIED.

On lb server, backend api-server in haproxy configuration is changed as below:
backend api-server
    option  httpchk GET /readyz HTTP/1.0
    option  log-health-checks
    balance roundrobin
        server xxx.xx.248.138 xxx.xx.248.138:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
        server xxx.xx.248.139 xxx.xx.248.139:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
        server xxx.xx.248.137 xxx.xx.248.137:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3

Comment 38 errata-xmlrpc 2021-02-24 15:12:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.