I've noticed that this issue is still present in the 4.5 Azure UPI templates. The IPI fix has been backported all the way to 4.3 (https://github.com/openshift/installer/pull/3665), but the UPI fix is only currently in 4.6 and lagging. Backporting to 4.5 would fully resolve https://bugzilla.redhat.com/show_bug.cgi?id=1856729 which is currently resolved for 4.6. +++ This bug was initially created as a clone of Bug #1836016 +++ +++ This bug was initially created as a clone of Bug #1828382 +++ Description of problem: Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check. After scanning the github repo, we found the following platforms do not use /readyz for backend health check. Azure: https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138 https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164 https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101 https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184 VSphere: https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33 Please investigate the following, not sure if it needs to be addressed https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179 The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action. We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87 "health_check { healthy_threshold = 2 unhealthy_threshold = 2 interval = 10 port = 6443 protocol = "HTTPS" path = "/readyz" }" Version-Release number of the following components: OpenShift 4.5. How reproducible: Always Steps to Reproduce: 1. Run an upgrade job on the specified infrastructure. You will see that the load balancer is sending request to an apiserver while it's down. Actual results: An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542 Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused Expected results: The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out. Bug 1828382 handled installer-provisioned Azure. This clone is for the user-provisioned Azure recommendations. Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking. --- Additional comment from Abhinav Dahiya on 2020-05-18 17:24:35 UTC --- Won't be able to get to it this sprint. --- Additional comment from Brenton Leanhardt on 2020-05-18 17:59:29 UTC --- We discussed this bug during today's bug scrub and decided that it should be deferred to an upcoming sprint. --- Additional comment from John Hixson on 2020-06-08 16:56:01 UTC --- PR: https://github.com/openshift/installer/pull/3720 --- Additional comment from errata-xmlrpc on 2020-06-10 00:04:35 UTC --- This bug has been added to advisory RHBA-2020:54579 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com) --- Additional comment from errata-xmlrpc on 2020-06-10 00:04:42 UTC --- Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHBA-2020:54579-02 https://errata.devel.redhat.com/advisory/54579 --- Additional comment from Mike Gahagan on 2020-06-11 20:54:16 UTC --- Confirmed both internal and external loadbalancers are using http/https and the /readyz endpoint in UPI Azure using 4.5.0-0.nightly-2020-06-10-224736 public lb: "name": "api-internal-probe", "numberOfProbes": 3, "port": 6443, "protocol": "Https", "provisioningState": "Succeeded", "requestPath": "/readyz", "resourceGroup": "esimardupi-4zmfb-rg", "type": "Microsoft.Network/loadBalancers/probes" internal lb: "name": "api-internal-probe", "numberOfProbes": 3, "port": 6443, "protocol": "Https", "provisioningState": "Succeeded", "requestPath": "/readyz", "resourceGroup": "esimardupi-4zmfb-rg", "type": "Microsoft.Network/loadBalancers/probes" --- Additional comment from Abhinav Dahiya on 2020-08-24 16:50:48 UTC ---
*** This bug has been marked as a duplicate of bug 1874582 ***