+++ This bug was initially created as a clone of Bug #1828382 +++ Description of problem: Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check. After scanning the github repo, we found the following platforms do not use /readyz for backend health check. Azure: https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138 https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164 https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101 https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184 VSphere: https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33 Please investigate the following, not sure if it needs to be addressed https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179 The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action. We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87 "health_check { healthy_threshold = 2 unhealthy_threshold = 2 interval = 10 port = 6443 protocol = "HTTPS" path = "/readyz" }" Version-Release number of the following components: OpenShift 4.5. How reproducible: Always Steps to Reproduce: 1. Run an upgrade job on the specified infrastructure. You will see that the load balancer is sending request to an apiserver while it's down. Actual results: An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542 Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused Expected results: The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out. Bug 1828382 handled installer-provisioned Azure. This clone is for the user-provisioned AWS recommendations. Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking.
Won't be able to get to it this sprint.
Created attachment 1696713 [details] install log
> https://github.com/openshift/installer/pull/3709/files#diff-c60d83f04b0b1233e1739f06c356299fR221 I think that one should have been healthz https://github.com/openshift/installer/blob/12320ec6536b5145ef49c825205e4d487e0f3c4d/data/data/aws/vpc/master-elb.tf#L114-L120
verified on 4.6.0-0.nightly-2020-06-16-214732. FAILED. Same problem as described in comment #5 https://bugzilla.redhat.com/show_bug.cgi?id=1836018#c5
can you provide must-gather and info on the AWS target groups like the backends, health probe info etc.
After several attempts to resolve this issue, a adequate fix has not yet been identified. Flagging with UpcomingSprint for review.
Hello Abhinav, following is the target group info, and log-bundle is attached. vpc-id is vpc-00bb6a475a9898cd2, let me know if you need further information. 1. checking health check settings: aws elbv2 describe-target-groups --names yunji-Inter-F33KVHQZNMLM { "TargetGroups": [ { "TargetGroupArn": "arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/yunji-Inter-F33KVHQZNMLM/0a5d717106179edd", "TargetGroupName": "yunji-Inter-F33KVHQZNMLM", "Protocol": "TCP", "Port": 22623, "VpcId": "vpc-00bb6a475a9898cd2", "HealthCheckProtocol": "HTTPS", "HealthCheckPort": "22623", "HealthCheckEnabled": true, "HealthCheckIntervalSeconds": 10, "HealthCheckTimeoutSeconds": 10, "HealthyThresholdCount": 2, "UnhealthyThresholdCount": 2, "HealthCheckPath": "/readyz", "Matcher": { "HttpCode": "200-399" }, "LoadBalancerArns": [ "arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/yunjiang-22bz0185-c4wkt-int/bd15404005f6e841" ], "TargetType": "ip" } ] } 2. checking healthy status: aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/yunji-Inter-F33KVHQZNMLM/0a5d717106179edd | jq .TargetHealthDescriptions[].TargetHealth.State "unhealthy" "unhealthy" "unhealthy"
> https://github.com/openshift/installer/pull/3757/files#diff-c60d83f04b0b1233e1739f06c356299fR221 We changed the health endpoint, but it looks like you are still using the old CF template. > > aws elbv2 describe-target-groups --names yunji-Inter-F33KVHQZNMLM > > { > "TargetGroups": [ > { > "TargetGroupArn": > "arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/yunji-Inter- > F33KVHQZNMLM/0a5d717106179edd", > "TargetGroupName": "yunji-Inter-F33KVHQZNMLM", > "Protocol": "TCP", > "Port": 22623, > "VpcId": "vpc-00bb6a475a9898cd2", > "HealthCheckProtocol": "HTTPS", > "HealthCheckPort": "22623", > "HealthCheckEnabled": true, > "HealthCheckIntervalSeconds": 10, > "HealthCheckTimeoutSeconds": 10, > "HealthyThresholdCount": 2, > "UnhealthyThresholdCount": 2, > "HealthCheckPath": "/readyz", > "Matcher": { > "HttpCode": "200-399" > }, > "LoadBalancerArns": [ > > "arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/ > yunjiang-22bz0185-c4wkt-int/bd15404005f6e841" > ], > "TargetType": "ip" > } > ] > } > Please use the latest CF template and try again.
verified on registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-06-28-020831. PASS.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196