Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1872887

Summary: [4.4] Azure UPI: Both Internal and External load balancers for kube-apiserver should use /readyz
Product: OpenShift Container Platform Reporter: Etienne Simard <esimard>
Component: InstallerAssignee: Etienne Simard <esimard>
Installer sub component: openshift-installer QA Contact: Etienne Simard <esimard>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: adahiya, akashem, aos-install, bleanhar, choag, esimard, ffranz, jminter, mgahagan, mjudeiki, openshift-bugzilla-robot, wking, xxia
Version: 4.4Keywords: Upgrades
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1872822
: 1873280 (view as bug list) Environment:
Last Closed: 2020-10-28 08:11:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1874582    
Bug Blocks: 1873280    

Description Etienne Simard 2020-08-26 20:04:06 UTC
Opening this bugzilla for 4.4 as well.

+++ This bug was initially created as a clone of Bug #1872822 +++

I've noticed that this issue is still present in the 4.5 Azure UPI templates. The IPI fix has been backported all the way to 4.3 (https://github.com/openshift/installer/pull/3665), but the UPI fix is only currently in 4.6 and lagging.

Backporting to 4.5 would fully resolve https://bugzilla.redhat.com/show_bug.cgi?id=1856729 which is currently resolved for 4.6.

+++ This bug was initially created as a clone of Bug #1836016 +++

+++ This bug was initially created as a clone of Bug #1828382 +++

Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.

Azure: 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184


VSphere: 
https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33


Please investigate the following, not sure if it needs to be addressed 
https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    


We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"
 }"



Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:
Always

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 


Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused



Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.


Bug 1828382 handled installer-provisioned Azure.  This clone is for the user-provisioned Azure recommendations.  Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking.

--- Additional comment from Abhinav Dahiya on 2020-05-18 17:24:35 UTC ---

Won't be able to get to it this sprint.

--- Additional comment from Brenton Leanhardt on 2020-05-18 17:59:29 UTC ---

We discussed this bug during today's bug scrub and decided that it should be deferred to an upcoming sprint.

--- Additional comment from John Hixson on 2020-06-08 16:56:01 UTC ---

PR: https://github.com/openshift/installer/pull/3720

--- Additional comment from errata-xmlrpc on 2020-06-10 00:04:35 UTC ---

This bug has been added to advisory RHBA-2020:54579 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com)

--- Additional comment from errata-xmlrpc on 2020-06-10 00:04:42 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2020:54579-02
https://errata.devel.redhat.com/advisory/54579

--- Additional comment from Mike Gahagan on 2020-06-11 20:54:16 UTC ---

Confirmed both internal and external loadbalancers are using http/https and the /readyz endpoint in UPI Azure using 4.5.0-0.nightly-2020-06-10-224736

public lb:



    "name": "api-internal-probe",
    "numberOfProbes": 3,
    "port": 6443,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/readyz",
    "resourceGroup": "esimardupi-4zmfb-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

internal lb:


    "name": "api-internal-probe",
    "numberOfProbes": 3,
    "port": 6443,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/readyz",
    "resourceGroup": "esimardupi-4zmfb-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

--- Additional comment from Abhinav Dahiya on 2020-08-24 16:50:48 UTC ---

Comment 1 Scott Dodson 2020-09-29 15:36:40 UTC
*** Bug 1880726 has been marked as a duplicate of this bug. ***

Comment 3 Etienne Simard 2020-10-07 18:25:08 UTC
Verified with 4.4.0-0.nightly-2020-10-05-235326 and templates from https://github.com/openshift/installer/pull/4202

Confirmed the cluster works well with the changes to the internal-lb and public-lb. Also had Mike Gahagan confirm it since I've initiated the backport.

Comment 6 errata-xmlrpc 2020-10-28 08:11:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.4.29 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4224