Bug 1836017 - vSphere UPI: Both Internal and External load balancers for kube-apiserver should use /readyz
Summary: vSphere UPI: Both Internal and External load balancers for kube-apiserver sho...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: aos-install
QA Contact: jima
URL:
Whiteboard:
: 1870183 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-14 23:44 UTC by W. Trevor King
Modified: 2021-02-24 15:13 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The reference vSphere UPI load balancer was configured for a simple TCP check. Consequence: The health checks did not consider the health of the api server which could lead to failed api requests whenever the api server restarted. Fix: The health checks now verify api server health against the /readyz endpoint. Result: The reference API load balancer now handles requests during API server restarts gracefully.
Clone Of: 1828382
Environment:
Last Closed: 2021-02-24 15:12:13 UTC
Target Upstream Version:
abodhe: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4012 0 None closed Bug 1836017: Configure haproxy to check /readyz 2021-02-19 14:13:48 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:13:33 UTC

Description W. Trevor King 2020-05-14 23:44:49 UTC
+++ This bug was initially created as a clone of Bug #1828382 +++

Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.

Azure: 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184


VSphere: 
https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33


Please investigate the following, not sure if it needs to be addressed 
https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    


We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"
 }"



Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:
Always

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 


Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused



Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.


Bug 1828382 is about installer-provisioned Azure.  This clone is about user-provisioned vSphere.
Private
Description W. Trevor King 2020-05-14 23:42:02 UTC

+++ This bug was initially created as a clone of Bug #1828382 +++

Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.

Azure: 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184


VSphere: 
https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33


Please investigate the following, not sure if it needs to be addressed 
https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    


We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"
 }"



Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:
Always

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 


Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused



Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.


Bug 1828382 handled installer-provisioned Azure.  This clone is for the user-provisioned Azure recommendations.  Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking.

Comment 1 Abhinav Dahiya 2020-05-18 17:25:11 UTC
Won't be able to get to it this sprint.

Comment 11 Abhinav Dahiya 2020-09-10 22:14:21 UTC
*** Bug 1870183 has been marked as a duplicate of this bug. ***

Comment 12 Stefan Schimanski 2020-09-11 14:54:28 UTC
*** Bug 1873816 has been marked as a duplicate of this bug. ***

Comment 28 jima 2020-10-27 06:33:33 UTC
use updated transformer scripts including this fix to install upi on vsphere with 4.7.0-0.nightly-2020-10-26-152308, and installation is successful, also tried to reboot master node, and it also works well. Move the bug to VERIFIED.

On lb server, backend api-server in haproxy configuration is changed as below:
backend api-server
    option  httpchk GET /readyz HTTP/1.0
    option  log-health-checks
    balance roundrobin
        server xxx.xx.248.138 xxx.xx.248.138:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
        server xxx.xx.248.139 xxx.xx.248.139:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3
        server xxx.xx.248.137 xxx.xx.248.137:6443 weight 1 verify none check check-ssl inter 1s fall 2 rise 3

Comment 38 errata-xmlrpc 2021-02-24 15:12:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.