Opening this bugzilla for 4.3 as well.
+++ This bug was initially created as a clone of Bug #1872887 +++
I've noticed that this issue is still present in the 4.5 Azure UPI templates. The IPI fix has been backported all the way to 4.3 (https://github.com/openshift/installer/pull/3665), but the UPI fix is only currently in 4.6 and lagging.
Backporting to 4.5 would fully resolve https://bugzilla.redhat.com/show_bug.cgi?id=1856729 which is currently resolved for 4.6.
+++ This bug was initially created as a clone of Bug #1836016 +++
+++ This bug was initially created as a clone of Bug #1828382 +++
Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.
After scanning the github repo, we found the following platforms do not use /readyz for backend health check.
Azure:
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184
VSphere:
https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33
Please investigate the following, not sure if it needs to be addressed
https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179
The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.
We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87
"health_check {
healthy_threshold = 2
unhealthy_threshold = 2
interval = 10
port = 6443
protocol = "HTTPS"
path = "/readyz"
}"
Version-Release number of the following components:
OpenShift 4.5.
How reproducible:
Always
Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure.
You will see that the load balancer is sending request to an apiserver while it's down.
Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542
Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused
Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.
Bug 1828382 handled installer-provisioned Azure. This clone is for the user-provisioned Azure recommendations. Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking.
--- Additional comment from Abhinav Dahiya on 2020-05-18 17:24:35 UTC ---
Won't be able to get to it this sprint.
--- Additional comment from Brenton Leanhardt on 2020-05-18 17:59:29 UTC ---
We discussed this bug during today's bug scrub and decided that it should be deferred to an upcoming sprint.
--- Additional comment from John Hixson on 2020-06-08 16:56:01 UTC ---
PR: https://github.com/openshift/installer/pull/3720
--- Additional comment from errata-xmlrpc on 2020-06-10 00:04:35 UTC ---
This bug has been added to advisory RHBA-2020:54579 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com)
--- Additional comment from errata-xmlrpc on 2020-06-10 00:04:42 UTC ---
Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2020:54579-02
https://errata.devel.redhat.com/advisory/54579
--- Additional comment from Mike Gahagan on 2020-06-11 20:54:16 UTC ---
Confirmed both internal and external loadbalancers are using http/https and the /readyz endpoint in UPI Azure using 4.5.0-0.nightly-2020-06-10-224736
public lb:
"name": "api-internal-probe",
"numberOfProbes": 3,
"port": 6443,
"protocol": "Https",
"provisioningState": "Succeeded",
"requestPath": "/readyz",
"resourceGroup": "esimardupi-4zmfb-rg",
"type": "Microsoft.Network/loadBalancers/probes"
internal lb:
"name": "api-internal-probe",
"numberOfProbes": 3,
"port": 6443,
"protocol": "Https",
"provisioningState": "Succeeded",
"requestPath": "/readyz",
"resourceGroup": "esimardupi-4zmfb-rg",
"type": "Microsoft.Network/loadBalancers/probes"
--- Additional comment from Abhinav Dahiya on 2020-08-24 16:50:48 UTC ---
Since this only affects clusters at install time and 4.3 is going EOL at 4.6 GA I'm closing this bug WONTFIX. I don't think we're going to get to this before 4.3 goes EOL and even if we did the value of doing so will be minimal, there's not a lot of new 4.3 clusters being created today.