Description of problem: Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check. After scanning the github repo, we found the following platforms do not use /readyz for backend health check. Azure: https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138 https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164 https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101 https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184 VSphere: https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33 Please investigate the following, not sure if it needs to be addressed https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179 The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action. We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87 "health_check { healthy_threshold = 2 unhealthy_threshold = 2 interval = 10 port = 6443 protocol = "HTTPS" path = "/readyz" }" Version-Release number of the following components: OpenShift 4.5. How reproducible: Always Steps to Reproduce: 1. Run an upgrade job on the specified infrastructure. You will see that the load balancer is sending request to an apiserver while it's down. Actual results: An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542 Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused Expected results: The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.
*** Bug 1820577 has been marked as a duplicate of this bug. ***
The fix for this will be made for 4.4 too, right?
I tried and failed to make this work on ARO today. AFAICS the gap is as follows: the bootstrap node never indicates /readyz so it never joins the ILB and install fails. AFAICS Azure does not permit different backends to listen on the same frontend port with different probe configurations. So you'll need to get the bootstrap node to indicate /readyz as a precondition for making this work.
Hi jminter, If I am not mistaken the bootstrap node should run the version of kube-apiserver in the release image which should offer /readyz. You mentioned in the slack thread that you have tried with "4.3.13" which has /readyz. Also, bootstrap logic seems to be the same for aws/gcp/azure - https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L202 We would like to debug this further. Is it possible to ssh to the bootstrap node and directly probe the kube-apiserver (while installation is in progress)? We expect to see an "ok" response in this case. # curl -k https://localhost:6443/readyz ok This will help us narrow it down.
This is not purely an update-time issue, but updates need node reboots, and node reboots need LB target adjustment. So this issue will impact API connectivity on Azure and other platforms which aren't using /readyz.
@Abu Kashem > We would like to debug this further. Is it possible to ssh to the bootstrap node and directly probe the kube-apiserver (while installation is in progress)? We expect to see an "ok" response in this case. > # curl -k https://localhost:6443/readyz > ok I did that yesterday and got a '404 not found' text, hence my message on Slack.
Leaving this as Component Installer but assigning to Abu from apiserver team to define the exact implementation details as the api server team should be outlining exactly how this works. Installer team can help if necessary.
Hi jminter, I ran a test and the bootstrap node does offer /readyz with 4.3.13. These are the steps I followed I don't have access to Azure, so I did this on gcp. - kick off a 4.4 cluster with "oc adm release extract --tools quay.io/openshift-release-dev/ocp-release:4.3.13-x86_64" - ssh into the bootstrap node as soon as it comes up in gce console. - run the probe - while true; do curl -k https://localhost:6443/readyz; sleep 2; done [core@akashe-2fxlj-b ~]$ while true; do curl -k https://localhost:6443/readyz; sleep 2; done curl: (7) Failed to connect to localhost port 6443: Connection refused curl: (7) Failed to connect to localhost port 6443: Connection refused curl: (7) Failed to connect to localhost port 6443: Connection refused curl: (7) Failed to connect to localhost port 6443: Connection refused curl: (7) Failed to connect to localhost port 6443: Connection refused curl: (7) Failed to connect to localhost port 6443: Connection refused curl: (7) Failed to connect to localhost port 6443: Connection refused { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/readyz\"", "reason": "Forbidden", "details": { }, "code": 403 }okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok
> While we do this, can somebody give more background if https endpoint check was working in the first place at any time? Yes, kube-apiserver offers /readyz over HTTPS only. To my knowledge, aws and gcp use /readyz probe over https.
Looks like installer#3600 addressed installer-provisioned Azure. Are we spinning off a new bug for vSphere (which, per comment 0, was also not using /readyz) and for user-provisioned Azure (also called out in comment 0)?
(In reply to W. Trevor King from comment #13) > Looks like installer#3600 addressed installer-provisioned Azure. Are we > spinning off a new bug for vSphere (which, per comment 0, was also not using > /readyz) and for user-provisioned Azure (also called out in comment 0)? It would be a good idea. I will only able to verify the fix for the IPI on Azure.
Spun off into: * Bug 1836016: user-provisioned Azure * Bug 1836017: user-provisioned vSphere * Bug 1836018: user-provisioned AWS
Thanks Trevor! This Azure IPI bug fix also depends on https://bugzilla.redhat.com/show_bug.cgi?id=1831760 and it would be best to test those together for any backport.
> This Azure IPI bug fix also depends on https://bugzilla.redhat.com/show_bug.cgi?id=1831760 and it would be best to test those together for any backport. Bug 1832137 is already ON_QA in 4.4, so we should be good to go there. I dunno if these are going to go all the way back to 4.3 or not.
Verified with: https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-05-14-231228 The cluster was installed without issue and I did not remark any visible issue. Load balancer probes seen after installation: Internal: "name": "sint-probe", "numberOfProbes": 2, "port": 22623, "protocol": "Https", "provisioningState": "Succeeded", "requestPath": "/healthz", "resourceGroup": "qeipi-98s24-rg", "type": "Microsoft.Network/loadBalancers/probes" "name": "api-internal-probe", "numberOfProbes": 2, "port": 6443, "protocol": "Https", "provisioningState": "Succeeded", "requestPath": "/readyz", "resourceGroup": "qeipi-98s24-rg", "type": "Microsoft.Network/loadBalancers/probes" Public: "name": "api-internal-probe", "numberOfProbes": 2, "port": 6443, "protocol": "Https", "provisioningState": "Succeeded", "requestPath": "/readyz", "resourceGroup": "qeipi-98s24-rg", "type": "Microsoft.Network/loadBalancers/probes"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409