1836018 – AWS UPI: Both Internal and External load balancers for kube-apiserver should use /readyz

Bug 1836018 - AWS UPI: Both Internal and External load balancers for kube-apiserver should use /readyz

Summary: AWS UPI: Both Internal and External load balancers for kube-apiserver should ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Russell Teague
QA Contact:	Yunfei Jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1873280 1876581
TreeView+	depends on / blocked

Reported:	2020-05-14 23:46 UTC by W. Trevor King
Modified:	2020-11-09 15:10 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Healthcheck probes not defined on AWS UPI deployed load balancers Consequence: Load balancers do not properly detect when an API endpoint is no longer available and may still direct traffic causing client failures. Fix: Added Healthcheck probes to AWS UPI deployed load balancers Result: Load balancers properly detect when API endpoints are not available and do not route traffic to offline nodes.
Clone Of:	1828382
Environment:
Last Closed:	2020-10-27 15:59:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
install log (73.81 KB, text/plain) 2020-06-11 10:37 UTC, Yunfei Jiang	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3709	None	closed	Bug 1836018: upi/aws/cloudformation: Define healthcheck probes for LBs	2020-11-13 01:53:58 UTC
Github	openshift installer pull 3757	None	closed	Bug 1836018: Use /healthz for mco	2020-11-13 01:53:58 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:59:53 UTC

Description W. Trevor King 2020-05-14 23:46:53 UTC

+++ This bug was initially created as a clone of Bug #1828382 +++

Description of problem:
Load balancers (both internal and external) for kube-apiserver should use /readyz for back-end health check.

After scanning the github repo, we found the following platforms do not use /readyz for backend health check.

Azure: 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L101
https://github.com/openshift/installer/blob/master/upi/azure/03_infra.json#L184


VSphere: 
https://github.com/openshift/installer/blob/master/upi/vsphere/lb/haproxy.tmpl#L33


Please investigate the following, not sure if it needs to be addressed 
https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/02_cluster_infra.yaml#L179

The above health checks use default TCP rule which breaks graceful termination. When the apiserver receives a KILL signal /readyz will start reporting failure. This gives the load balancer a chance to detect an instance that is rolling out and take appropriate action.    


We should use AWS lb rules as a reference for consistency - https://github.com/openshift/installer/blob/master/data/data/aws/vpc/master-elb.tf#L87

"health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = 6443
    protocol            = "HTTPS"
    path                = "/readyz"
 }"



Version-Release number of the following components:
OpenShift 4.5. 

How reproducible:
Always

Steps to Reproduce:
1. Run an upgrade job on the specified infrastructure. 

You will see that the load balancer is sending request to an apiserver while it's down. 


Actual results:
An example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1542

Apr 26 21:45:14.063: INFO: Unexpected error listing nodes: Get https://api.ci-op-hvqc8pgw-0ba00.ci.azure.devcluster.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 52.154.154.255:6443: connect: connection refused



Expected results:
The load balancer should identify the kube-apiserer instance that is being rolled out in time and forward requests to another instance that is serving. Ideally, we should not see any "dial tcp 52.154.154.255:6443: connect: connection refused" error while kube-apiserver is being rolled out.


Bug 1828382 handled installer-provisioned Azure.  This clone is for the user-provisioned AWS recommendations.  Dropping severity down to medium, because user-provisioned recommendations are suggesting approaches, not telling the user what they should be doing without thinking.

Comment 1 Abhinav Dahiya 2020-05-18 17:25:26 UTC

Won't be able to get to it this sprint.

Comment 6 Yunfei Jiang 2020-06-11 10:37:40 UTC

Created attachment 1696713 [details]
install log

Comment 7 Abhinav Dahiya 2020-06-15 17:49:24 UTC

> https://github.com/openshift/installer/pull/3709/files#diff-c60d83f04b0b1233e1739f06c356299fR221

I think that one should have been healthz https://github.com/openshift/installer/blob/12320ec6536b5145ef49c825205e4d487e0f3c4d/data/data/aws/vpc/master-elb.tf#L114-L120

Comment 9 Yunfei Jiang 2020-06-17 03:24:32 UTC

verified on 4.6.0-0.nightly-2020-06-16-214732. FAILED.
Same problem as described in comment #5 https://bugzilla.redhat.com/show_bug.cgi?id=1836018#c5

Comment 10 Abhinav Dahiya 2020-06-18 23:48:51 UTC

can you provide must-gather
and info on the AWS target groups like the backends, health probe info etc.

Comment 11 Russell Teague 2020-06-19 15:57:48 UTC

After several attempts to resolve this issue, a adequate fix has not yet been identified.  Flagging with UpcomingSprint for review.

Comment 13 Yunfei Jiang 2020-06-22 10:36:15 UTC

Hello Abhinav,

following is the target group info, and log-bundle is attached.
vpc-id is vpc-00bb6a475a9898cd2, let me know if you need further information.

1. checking health check settings:

aws elbv2 describe-target-groups --names yunji-Inter-F33KVHQZNMLM

{
    "TargetGroups": [
        {
            "TargetGroupArn": "arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/yunji-Inter-F33KVHQZNMLM/0a5d717106179edd",
            "TargetGroupName": "yunji-Inter-F33KVHQZNMLM",
            "Protocol": "TCP",
            "Port": 22623,
            "VpcId": "vpc-00bb6a475a9898cd2",
            "HealthCheckProtocol": "HTTPS",
            "HealthCheckPort": "22623",
            "HealthCheckEnabled": true,
            "HealthCheckIntervalSeconds": 10,
            "HealthCheckTimeoutSeconds": 10,
            "HealthyThresholdCount": 2,
            "UnhealthyThresholdCount": 2,
            "HealthCheckPath": "/readyz",
            "Matcher": {
                "HttpCode": "200-399"
            },
            "LoadBalancerArns": [
                "arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/yunjiang-22bz0185-c4wkt-int/bd15404005f6e841"
            ],
            "TargetType": "ip"
        }
    ]
}


2. checking healthy status:

aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/yunji-Inter-F33KVHQZNMLM/0a5d717106179edd | jq .TargetHealthDescriptions[].TargetHealth.State
"unhealthy"
"unhealthy"
"unhealthy"

Comment 14 Abhinav Dahiya 2020-06-24 15:07:36 UTC

> https://github.com/openshift/installer/pull/3757/files#diff-c60d83f04b0b1233e1739f06c356299fR221

We changed the health endpoint, but it looks like you are still using the old CF template.

> 
> aws elbv2 describe-target-groups --names yunji-Inter-F33KVHQZNMLM
> 
> {
>     "TargetGroups": [
>         {
>             "TargetGroupArn":
> "arn:aws:elasticloadbalancing:us-east-2:301721915996:targetgroup/yunji-Inter-
> F33KVHQZNMLM/0a5d717106179edd",
>             "TargetGroupName": "yunji-Inter-F33KVHQZNMLM",
>             "Protocol": "TCP",
>             "Port": 22623,
>             "VpcId": "vpc-00bb6a475a9898cd2",
>             "HealthCheckProtocol": "HTTPS",
>             "HealthCheckPort": "22623",
>             "HealthCheckEnabled": true,
>             "HealthCheckIntervalSeconds": 10,
>             "HealthCheckTimeoutSeconds": 10,
>             "HealthyThresholdCount": 2,
>             "UnhealthyThresholdCount": 2,
>             "HealthCheckPath": "/readyz",
>             "Matcher": {
>                 "HttpCode": "200-399"
>             },
>             "LoadBalancerArns": [
>                
> "arn:aws:elasticloadbalancing:us-east-2:301721915996:loadbalancer/net/
> yunjiang-22bz0185-c4wkt-int/bd15404005f6e841"
>             ],
>             "TargetType": "ip"
>         }
>     ]
> }
> 

Please use the latest CF template and try again.

Comment 15 Yunfei Jiang 2020-06-29 08:48:31 UTC

verified on registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-06-28-020831. PASS.

Comment 17 errata-xmlrpc 2020-10-27 15:59:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.