Bug 1828382 - Azure IPI: Both Internal and External load balancers for kube-apiserver should use /readyz
Summary: Azure IPI: Both Internal and External load balancers for kube-apiserver shoul...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Abu Kashem
QA Contact: Etienne Simard
URL:
Whiteboard:
: 1820577 (view as bug list)
Depends On: 1831760
Blocks: 1836038
TreeView+ depends on / blocked
 
Reported: 2020-04-27 15:53 UTC by Abu Kashem
Modified: 2020-07-13 17:32 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1836016 1836017 1836018 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:31:52 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3600 0 None closed Bug 1828382: data/azure/vnet: switch to HTTPS probes for lbs 2021-02-17 13:12:30 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:32:09 UTC

Description Abu Kashem 2020-04-27 15:53:21 UTC
Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.

Azure: 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184


VSphere: 
https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33


Please investigate the following, not sure if it needs to be addressed 
https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    


We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"
 }"



Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:
Always

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 


Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused



Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.

Comment 1 Abhinav Dahiya 2020-04-27 17:00:08 UTC
*** Bug 1820577 has been marked as a duplicate of this bug. ***

Comment 2 Jim Minter 2020-04-27 17:21:18 UTC
The fix for this will be made for 4.4 too, right?

Comment 3 Jim Minter 2020-04-27 20:50:24 UTC
I tried and failed to make this work on ARO today.  AFAICS the gap is as follows: the bootstrap node never indicates /readyz so it never joins the ILB and install fails.  AFAICS Azure does not permit different backends to listen on the same frontend port with different probe configurations.  So you'll need to get the bootstrap node to indicate /readyz as a precondition for making this work.

Comment 4 Abu Kashem 2020-04-28 16:17:05 UTC
Hi jminter@redhat.com,
If I am not mistaken the bootstrap node should run the version of kube-apiserver in the release image which should offer /readyz. 
You mentioned in the slack thread that you have tried with "4.3.13" which has /readyz. 

Also, bootstrap logic seems to be the same for aws/gcp/azure - https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L202

We would like to debug this further. Is it possible to ssh to the bootstrap node and directly probe the kube-apiserver (while installation is in progress)? We expect to see an "ok" response in this case.
# curl -k https://localhost:6443/readyz
ok

This will help us narrow it down.

Comment 5 W. Trevor King 2020-04-28 22:40:41 UTC
This is not purely an update-time issue, but updates need node reboots, and node reboots need LB target adjustment.  So this issue will impact API connectivity on Azure and other platforms which aren't using /readyz.

Comment 6 Jim Minter 2020-04-28 23:04:05 UTC
@Abu Kashem

> We would like to debug this further. Is it possible to ssh to the bootstrap node and directly probe the kube-apiserver (while installation is in progress)? We expect to see an "ok" response in this case.
> # curl -k https://localhost:6443/readyz
> ok

I did that yesterday and got a '404 not found' text, hence my message on Slack.

Comment 7 Scott Dodson 2020-04-29 13:27:21 UTC
Leaving this as Component Installer but assigning to Abu from apiserver team to define the exact implementation details as the api server team should be outlining exactly how this works. Installer team can help if necessary.

Comment 8 Abu Kashem 2020-04-29 22:06:02 UTC
Hi jminter@redhat.com,
I ran a test and the bootstrap node does offer /readyz with 4.3.13. These are the steps I followed

I don't have access to Azure, so I did this on gcp.

- kick off a 4.4 cluster with "oc adm release extract --tools quay.io/openshift-release-dev/ocp-release:4.3.13-x86_64"
- ssh into the bootstrap node as soon as it comes up in gce console.
- run the probe - while true; do curl -k https://localhost:6443/readyz; sleep 2; done

[core@akashe-2fxlj-b ~]$ while true; do curl -k https://localhost:6443/readyz; sleep 2; done
curl: (7) Failed to connect to localhost port 6443: Connection refused
curl: (7) Failed to connect to localhost port 6443: Connection refused
curl: (7) Failed to connect to localhost port 6443: Connection refused
curl: (7) Failed to connect to localhost port 6443: Connection refused
curl: (7) Failed to connect to localhost port 6443: Connection refused
curl: (7) Failed to connect to localhost port 6443: Connection refused
curl: (7) Failed to connect to localhost port 6443: Connection refused
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/readyz\"",
  "reason": "Forbidden",
  "details": {
    
  },
  "code": 403
}okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok

Comment 10 Abu Kashem 2020-05-04 14:53:30 UTC
> While we do this, can somebody give more background if https endpoint check was working in the first place at any time?

Yes, kube-apiserver offers /readyz over HTTPS only. To my knowledge, aws and gcp use /readyz probe over https.

Comment 13 W. Trevor King 2020-05-14 21:33:42 UTC
Looks like installer#3600 addressed installer-provisioned Azure.  Are we spinning off a new bug for vSphere (which, per comment 0, was also not using /readyz) and for user-provisioned Azure (also called out in comment 0)?

Comment 14 Etienne Simard 2020-05-14 23:34:49 UTC
(In reply to W. Trevor King from comment #13)
> Looks like installer#3600 addressed installer-provisioned Azure.  Are we
> spinning off a new bug for vSphere (which, per comment 0, was also not using
> /readyz) and for user-provisioned Azure (also called out in comment 0)?

It would be a good idea. I will only able to verify the fix for the IPI on Azure.

Comment 15 W. Trevor King 2020-05-14 23:49:08 UTC
Spun off into:

* Bug 1836016: user-provisioned Azure
* Bug 1836017: user-provisioned vSphere
* Bug 1836018: user-provisioned AWS

Comment 16 Etienne Simard 2020-05-15 00:12:16 UTC
Thanks Trevor!

This Azure IPI bug fix also depends on https://bugzilla.redhat.com/show_bug.cgi?id=1831760 and it would be best to test those together for any backport.

Comment 17 W. Trevor King 2020-05-15 01:39:26 UTC
> This Azure IPI bug fix also depends on https://bugzilla.redhat.com/show_bug.cgi?id=1831760 and it would be best to test those together for any backport.

Bug 1832137 is already ON_QA in 4.4, so we should be good to go there.  I dunno if these are going to go all the way back to 4.3 or not.

Comment 18 Etienne Simard 2020-05-15 01:41:28 UTC
Verified with:

https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-05-14-231228

The cluster was installed without issue and I did not remark any visible issue.

Load balancer probes seen after installation:

Internal:

    "name": "sint-probe",
    "numberOfProbes": 2,
    "port": 22623,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/healthz",
    "resourceGroup": "qeipi-98s24-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

    "name": "api-internal-probe",
    "numberOfProbes": 2,
    "port": 6443,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/readyz",
    "resourceGroup": "qeipi-98s24-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

Public:

    "name": "api-internal-probe",
    "numberOfProbes": 2,
    "port": 6443,
    "protocol": "Https",
    "provisioningState": "Succeeded",
    "requestPath": "/readyz",
    "resourceGroup": "qeipi-98s24-rg",
    "type": "Microsoft.Network/loadBalancers/probes"

Comment 19 Abhinav Dahiya 2020-05-22 23:52:23 UTC
*** Bug 1820577 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2020-07-13 17:31:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.