Bug 1873280

Summary:	[4.3] Azure UPI: Both Internal and External load balancers for kube-apiserver should use /readyz
Product:	OpenShift Container Platform	Reporter:	Etienne Simard <esimard>
Component:	Installer	Assignee:	Etienne Simard <esimard>
Installer sub component:	openshift-installer	QA Contact:	Etienne Simard <esimard>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	medium
Priority:	medium	CC:	adahiya, akashem, aos-install, bleanhar, choag, esimard, ffranz, jminter, mgahagan, mjudeiki, wking, xxia
Version:	4.3.z	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1872887	Environment:
Last Closed:	2020-09-29 15:34:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1836016, 1836018, 1872822, 1872887, 1874582
Bug Blocks:

Description Etienne Simard 2020-08-27 18:43:32 UTC

Opening this bugzilla for 4.3 as well.

+++ This bug was initially created as a clone of Bug #1872887 +++

I've noticed that this issue is still present in the 4.5 Azure UPI templates. The IPI fix has been backported all the way to 4.3 (https://github.com/openshift/installer/pull/3665), but the UPI fix is only currently in 4.6 and lagging.

Backporting to 4.5 would fully resolve https://bugzilla.redhat.com/show_bug.cgi?id=1856729 which is currently resolved for 4.6.

+++ This bug was initially created as a clone of Bug #1836016 +++

+++ This bug was initially created as a clone of Bug #1828382 +++

Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.

Azure: 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184


VSphere: 
https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33


Please investigate the following, not sure if it needs to be addressed 
https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    


We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"
 }"



Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:
Always

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 


Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused



Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.


Bug 1828382 handled installer-provisioned Azure.  This clone is for the user-provisioned Azure recommendations.  Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking.

--- Additional comment from Abhinav Dahiya on 2020-05-18 17:24:35 UTC ---

Won't be able to get to it this sprint.

--- Additional comment from Brenton Leanhardt on 2020-05-18 17:59:29 UTC ---

We discussed this bug during today's bug scrub and decided that it should be deferred to an upcoming sprint.

--- Additional comment from John Hixson on 2020-06-08 16:56:01 UTC ---

PR: https://github.com/openshift/installer/pull/3720

--- Additional comment from errata-xmlrpc on 2020-06-10 00:04:35 UTC ---

This bug has been added to advisory RHBA-2020:54579 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com)

--- Additional comment from errata-xmlrpc on 2020-06-10 00:04:42 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2020:54579-02
https://errata.devel.redhat.com/advisory/54579

--- Additional comment from Mike Gahagan on 2020-06-11 20:54:16 UTC ---

Confirmed both internal and external loadbalancers are using http/https and the /readyz endpoint in UPI Azure using 4.5.0-0.nightly-2020-06-10-224736

public lb:



    "name": "api-internal-probe",
    "numberOfProbes": 3,
    "port": 6443,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/readyz",
    "resourceGroup": "esimardupi-4zmfb-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

internal lb:


    "name": "api-internal-probe",
    "numberOfProbes": 3,
    "port": 6443,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/readyz",
    "resourceGroup": "esimardupi-4zmfb-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

--- Additional comment from Abhinav Dahiya on 2020-08-24 16:50:48 UTC ---

Comment 1 Scott Dodson 2020-09-29 15:34:26 UTC

Since this only affects clusters at install time and 4.3 is going EOL at 4.6 GA I'm closing this bug WONTFIX. I don't think we're going to get to this before 4.3 goes EOL and even if we did the value of doing so will be minimal, there's not a lot of new 4.3 clusters being created today.