Bug 1970231
| Summary: | Google Cloud is not reflecting correct backends information for load balancer services of the OCP cluster | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pamela Escorza <pescorza> | ||||
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> | ||||
| Cloud Compute sub component: | Cloud Controller Manager | QA Contact: | sunzhaohua <zhsun> | ||||
| Status: | CLOSED NOTABUG | Docs Contact: | |||||
| Severity: | low | ||||||
| Priority: | unspecified | CC: | aos-bugs | ||||
| Version: | 4.6 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | All | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-06-14 09:42:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 1789732 [details]
LB health check information
This is a part of the design of the core Kubernetes service controllers and not something we are going to be able to change. But for some context, it is intended that all instances in a cluster are targets for the load balancer. When a service is created, the `externalTrafficPolicy` can be set to make sure that only instances running the router pods actually accept traffic. This means that these instances pass the health checks, and the instances which don't contain the pods, fail the health checks. It is then the responsibility of the cloud load balancer to route the traffic appropriately. If you don't set the `externalTrafficPolicy`, then all instances will accept the traffic and proxy within the cluster, so there would be no failures in the health checks, but there would be an extra hop for some requests. If you want to prevent the instances from being registered, you can use the node label `node.kubernetes.io/exclude-from-external-load-balancers: ""`, but this will exclude the node from ALL services that use load balancers, there is no way to set this just for a single service object. This isn't a bug and for the most part, is actually beneficial. For example when you have 3 pods across 6 nodes, with all of those nodes registered, if the pods move between nodes, they are available on the load balancer faster than if we had to register and deregister targets every time the pods moved. To be extra clear, this is a core Kubernetes design, in use across all of GCP and other platforms too. This is not something we will be able to change. Customer has closed their support case after receiving the feedback above, closing this one out too |
Description of problem: When cluster OCP IPI internal is deployed on Google Cloud Provider, the cloud provider creates load balancer with their respective health-checks : ~~~ $ gcloud compute health-checks list NAME PROTOCOL aa973131c99c242a58428421be18b117 HTTP ac5159cb260fc427691ed33f57446bdf HTTP k8s-9dbb74ab8cd189b6-node HTTP ocp-int-79462-api-internal HTTPS ~~~ One of them for the ingress default route, in this case "ac5159cb260fc427691ed33f57446bdf" where the health check status remains in warning[0] because is verifying the router availability within all the instance groups of the cluster. which are included as backends of the load balancer created by Google Cloud, this doesn't match the information of the default router from OCP cluster. Bellow detailed information about route default load balancer from Google Cloud Console: Forwarding rule name: ac5159cb260fc427691ed33f57446bdf Scope: Regional (europe-west2) Address: 10.17.0.32:80,443 Protocol: TCP (Internal) Network Tier: Premium Load balancer: ac5159cb260fc427691ed33f57446bdf Regional backend service details ac5159cb260fc427691ed33f57446bdf {"kubernetes.io/service-name":"openshift-ingress/router-default"} - General properties Region: europe-west2 Protocol: TCP Session affinity: None In use by: ac5159cb260fc427691ed33f57446bdf Backends: ocp-int-79462-master-europe-west2-c ocp-int-79462-master-europe-west2-b ocp-int-79462-master-europe-west2-a k8s-ig--9dbb74ab8cd189b6 k8s-ig--9dbb74ab8cd189b6 k8s-ig--9dbb74ab8cd189b6 Information from OCP cluster: $ oc describe service router-default -n openshift-ingress Name: router-default Namespace: openshift-ingress Labels: app=router ingresscontroller.operator.openshift.io/owning-ingresscontroller=default router=router-default Annotations: cloud.google.com/load-balancer-type: Internal Selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default Type: LoadBalancer IP: 172.30.114.68 LoadBalancer Ingress: 10.17.0.32 Port: http 80/TCP TargetPort: http/TCP NodePort: http 30389/TCP Endpoints: 10.153.4.4:80,10.155.4.4:80 Port: https 443/TCP TargetPort: https/TCP NodePort: https 30406/TCP Endpoints: 10.153.4.4:443,10.155.4.4:443 Session Affinity: None External Traffic Policy: Local HealthCheck NodePort: 32265 Events: <none> $ oc -n openshift-ingress get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.114.68 10.17.0.32 80:30389/TCP,443:30406/TCP 25h router-internal-default ClusterIP 172.30.41.13 <none> 80/TCP,443/TCP,1936/TCP 25h The Taint and Toleration are configured to host the default router on the infra nodes: $ oc get ingresscontroller/default -n openshift-ingress-operator -o jsonpath='{.spec.nodePlacement}' | jq -r { "nodeSelector": { "matchLabels": { "node-role.kubernetes.io/infra": "" } }, "tolerations": [ { "effect": "NoSchedule", "key": "infra", "value": "reserved" }, { "effect": "NoExecute", "key": "infra", "value": "reserved" } ] } The bug is open to request information on how to remediate this warning, is there any annotation that can be send to Cloud Manager Controller in order to reflect the correct configuration for the load balancer? The manual modification of the Load Balancer details from the GCP console is not being preserved. Also there is a feature gate that Kubernetes offer in order to disable the http load balancer verification but seems it's not helpful as will affect https load balancer services that customer is using for their applications. The customer has OCP IPI clusters publish as Internal but the behavior is the same for cluster publish as External. Customer has also opened a case to Google Cloud Support in order to give further information about this. Version-Release number of selected component (if applicable): OpenShift 4.6 IPI on Google Cloud How reproducible: Deploy an IPI OCP 4.6 cluster on GCP. Steps to Reproduce: 1. Once installed the cluster, go to Load Balancer information from the Google Cloud Console and verify the load balancer created for the default router Actual results: Wrong asignation of the backends for the default router load balancer created by the cloud provider, it's including as backends all the instances of the cluster when it shuould be only the instance hosting the endpoints of the default router service. Expected results: Get a correct information of the http load balancer backends from the Google Cloud Console based on the information provided by OCP cluster Additional info: [0] Picture 1 and 2 of the pdf attached to this bug