Bug 1872822 - [4.5] Azure UPI: Both Internal and External load balancers for kube-apiserver should use /readyz
Summary: [4.5] Azure UPI: Both Internal and External load balancers for kube-apiserver...
Keywords:
Status: CLOSED DUPLICATE of bug 1874582
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.z
Assignee: Etienne Simard
QA Contact: Etienne Simard
URL:
Whiteboard:
Depends On: 1836016
Blocks: 1873280
TreeView+ depends on / blocked
 
Reported: 2020-08-26 16:55 UTC by Etienne Simard
Modified: 2020-12-04 18:25 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1836016
: 1872887 (view as bug list)
Environment:
Last Closed: 2020-12-04 18:25:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Etienne Simard 2020-08-26 16:55:18 UTC
I've noticed that this issue is still present in the 4.5 Azure UPI templates. The IPI fix has been backported all the way to 4.3 (https://github.com/openshift/installer/pull/3665), but the UPI fix is only currently in 4.6 and lagging.

Backporting to 4.5 would fully resolve https://bugzilla.redhat.com/show_bug.cgi?id=1856729 which is currently resolved for 4.6.

+++ This bug was initially created as a clone of Bug #1836016 +++

+++ This bug was initially created as a clone of Bug #1828382 +++

Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.

Azure: 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184


VSphere: 
https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33


Please investigate the following, not sure if it needs to be addressed 
https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    


We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"
 }"



Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:
Always

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 


Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused



Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.


Bug 1828382 handled installer-provisioned Azure.  This clone is for the user-provisioned Azure recommendations.  Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking.

--- Additional comment from Abhinav Dahiya on 2020-05-18 17:24:35 UTC ---

Won't be able to get to it this sprint.

--- Additional comment from Brenton Leanhardt on 2020-05-18 17:59:29 UTC ---

We discussed this bug during today's bug scrub and decided that it should be deferred to an upcoming sprint.

--- Additional comment from John Hixson on 2020-06-08 16:56:01 UTC ---

PR: https://github.com/openshift/installer/pull/3720

--- Additional comment from errata-xmlrpc on 2020-06-10 00:04:35 UTC ---

This bug has been added to advisory RHBA-2020:54579 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com)

--- Additional comment from errata-xmlrpc on 2020-06-10 00:04:42 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2020:54579-02
https://errata.devel.redhat.com/advisory/54579

--- Additional comment from Mike Gahagan on 2020-06-11 20:54:16 UTC ---

Confirmed both internal and external loadbalancers are using http/https and the /readyz endpoint in UPI Azure using 4.5.0-0.nightly-2020-06-10-224736

public lb:



    "name": "api-internal-probe",
    "numberOfProbes": 3,
    "port": 6443,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/readyz",
    "resourceGroup": "esimardupi-4zmfb-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

internal lb:


    "name": "api-internal-probe",
    "numberOfProbes": 3,
    "port": 6443,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/readyz",
    "resourceGroup": "esimardupi-4zmfb-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

--- Additional comment from Abhinav Dahiya on 2020-08-24 16:50:48 UTC ---

Comment 1 Scott Dodson 2020-09-29 15:47:53 UTC

*** This bug has been marked as a duplicate of bug 1874582 ***

Comment 2 Scott Dodson 2020-12-04 18:25:08 UTC

*** This bug has been marked as a duplicate of bug 1874582 ***


Note You need to log in before you can comment on or make changes to this bug.