1821671 – GCP: the load balancer is not ready when installing private cluster in GCP

Bug 1821671 - GCP: the load balancer is not ready when installing private cluster in GCP

Summary: GCP: the load balancer is not ready when installing private cluster in GCP

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Daneyon Hansen
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1816806
TreeView+	depends on / blocked

Reported:	2020-04-07 12:01 UTC by Hongan Li
Modified:	2022-08-04 22:27 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:26:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 415	0	None	closed	Bug 1821671: Let cloud-provider update services/status on GCP	2021-01-28 12:35:34 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:26:31 UTC

Description Hongan Li 2020-04-07 12:01:13 UTC

Description of problem:
The load balancer for apps is not ready when installing private cluster in GCP, and the events show:
  Warning  SyncLoadBalancerFailed  42m (x19 over 107m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"


Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-04-07-06214

How reproducible:
100%

Steps to Reproduce:
1. install private cluster in GCP
2. check operator ingress
3.

Actual results:
1. the installation is failed with some operators are degraded.
2. 
$ oc -n openshift-ingress-operator get ingresscontroller/default -o yaml
<---snip--->
spec:
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2020-04-07T09:50:56Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-04-07T09:55:30Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-07T09:55:30Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-04-07T09:51:00Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-04-07T09:51:00Z"
    message: |-
      The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"
      The kube-controller-manager logs may contain more details.
    reason: SyncLoadBalancerFailed
    status: "False"
    type: LoadBalancerReady
  - lastTransitionTime: "2020-04-07T09:51:00Z"
    message: DNS management is supported and zones are specified in the cluster DNS
      config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2020-04-07T09:51:00Z"
    message: The wildcard record resource was not found.
    reason: RecordNotFound
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2020-04-07T09:55:30Z"
    message: 'One or more other status conditions indicate a degraded state: LoadBalancerReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded

$ oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.202.91    <pending>     80:32341/TCP,443:32574/TCP   114m
router-internal-default   ClusterIP      172.30.191.124   <none>        80/TCP,443/TCP,1936/TCP      114m

$ oc -n openshift-ingress describe svc router-default
Name:                     router-default
Namespace:                openshift-ingress
Labels:                   app=router
                          ingresscontroller.operator.openshift.io/owning-ingresscontroller=default
                          router=router-default
Annotations:              cloud.google.com/load-balancer-type: Internal
Selector:                 ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
Type:                     LoadBalancer
IP:                       172.30.202.91
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  32341/TCP
Endpoints:                10.128.2.9:80,10.131.0.8:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  32574/TCP
Endpoints:                10.128.2.9:443,10.131.0.8:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     30684
Events:
  Type     Reason                  Age                    From                Message
  ----     ------                  ----                   ----                -------
  Normal   EnsuringLoadBalancer    113m (x5 over 114m)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  113m (x5 over 114m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"
  Normal   EnsuringLoadBalancer    111m (x4 over 112m)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  111m (x4 over 112m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"
  Normal   EnsuringLoadBalancer    108m (x6 over 110m)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  108m (x6 over 110m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"
  Warning  SyncLoadBalancerFailed  42m (x19 over 107m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"


Expected results:
The load balancer should be ready.

Additional info:
Try to install 4.5 private cluster in AWS and Azure and succeed.

Comment 1 Daneyon Hansen 2020-04-08 16:20:37 UTC

It appears that the kube-apiserver does not have the permissions to update status for the service named "router-default". Reassigning to the apiserver team for further investigation.

Comment 2 Yang Yang 2020-04-28 02:13:37 UTC

Adding testblocker as it blocks GCP private cluster installation.

Comment 3 Stefan Schimanski 2020-05-06 11:28:28 UTC

As far as I can see from the logs and comments, this is no apiserver issue. The ingress controller watches events of service controller in kube-controller-manager which in turn also falls into the responsibility of edge team in case of load balancer services. Hence, moving over to edge team.

Comment 4 Andrew McDermott 2020-05-07 15:55:26 UTC

Assigning to Dane to take another look.

Comment 6 Andrew McDermott 2020-05-21 16:34:01 UTC

Marking as urgent as this will be a release blocker unless the k-c-m fix merges.

Comment 10 Yang Yang 2020-05-27 09:30:30 UTC

GCP private cluster could be installed successfully with 4.5.0-0.nightly-2020-05-27-075521.

# oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.9.22   10.0.32.58    80:31552/TCP,443:30302/TCP   34m
router-internal-default   ClusterIP      172.30.80.8   <none>        80/TCP,443/TCP,1936/TCP      34m

Comment 11 Hongan Li 2020-05-28 01:45:09 UTC

thanks yangyang for confirmation, moving to verified.

Comment 12 errata-xmlrpc 2020-07-13 17:26:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.