Bug 1821671

Summary: GCP: the load balancer is not ready when installing private cluster in GCP
Product: OpenShift Container Platform Reporter: Hongan Li <hongli>
Component: NetworkingAssignee: Daneyon Hansen <dhansen>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: amcdermo, aos-bugs, dhansen, mfojtik, yanyang
Version: 4.5Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:26:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1816806    

Description Hongan Li 2020-04-07 12:01:13 UTC
Description of problem:
The load balancer for apps is not ready when installing private cluster in GCP, and the events show:
  Warning  SyncLoadBalancerFailed  42m (x19 over 107m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"


Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-04-07-06214

How reproducible:
100%

Steps to Reproduce:
1. install private cluster in GCP
2. check operator ingress
3.

Actual results:
1. the installation is failed with some operators are degraded.
2. 
$ oc -n openshift-ingress-operator get ingresscontroller/default -o yaml
<---snip--->
spec:
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2020-04-07T09:50:56Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-04-07T09:55:30Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-07T09:55:30Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-04-07T09:51:00Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-04-07T09:51:00Z"
    message: |-
      The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"
      The kube-controller-manager logs may contain more details.
    reason: SyncLoadBalancerFailed
    status: "False"
    type: LoadBalancerReady
  - lastTransitionTime: "2020-04-07T09:51:00Z"
    message: DNS management is supported and zones are specified in the cluster DNS
      config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2020-04-07T09:51:00Z"
    message: The wildcard record resource was not found.
    reason: RecordNotFound
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2020-04-07T09:55:30Z"
    message: 'One or more other status conditions indicate a degraded state: LoadBalancerReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded

$ oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.202.91    <pending>     80:32341/TCP,443:32574/TCP   114m
router-internal-default   ClusterIP      172.30.191.124   <none>        80/TCP,443/TCP,1936/TCP      114m

$ oc -n openshift-ingress describe svc router-default
Name:                     router-default
Namespace:                openshift-ingress
Labels:                   app=router
                          ingresscontroller.operator.openshift.io/owning-ingresscontroller=default
                          router=router-default
Annotations:              cloud.google.com/load-balancer-type: Internal
Selector:                 ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
Type:                     LoadBalancer
IP:                       172.30.202.91
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  32341/TCP
Endpoints:                10.128.2.9:80,10.131.0.8:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  32574/TCP
Endpoints:                10.128.2.9:443,10.131.0.8:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     30684
Events:
  Type     Reason                  Age                    From                Message
  ----     ------                  ----                   ----                -------
  Normal   EnsuringLoadBalancer    113m (x5 over 114m)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  113m (x5 over 114m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"
  Normal   EnsuringLoadBalancer    111m (x4 over 112m)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  111m (x4 over 112m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"
  Normal   EnsuringLoadBalancer    108m (x6 over 110m)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  108m (x6 over 110m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"
  Warning  SyncLoadBalancerFailed  42m (x19 over 107m)    service-controller  Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress"


Expected results:
The load balancer should be ready.

Additional info:
Try to install 4.5 private cluster in AWS and Azure and succeed.

Comment 1 Daneyon Hansen 2020-04-08 16:20:37 UTC
It appears that the kube-apiserver does not have the permissions to update status for the service named "router-default". Reassigning to the apiserver team for further investigation.

Comment 2 Yang Yang 2020-04-28 02:13:37 UTC
Adding testblocker as it blocks GCP private cluster installation.

Comment 3 Stefan Schimanski 2020-05-06 11:28:28 UTC
As far as I can see from the logs and comments, this is no apiserver issue. The ingress controller watches events of service controller in kube-controller-manager which in turn also falls into the responsibility of edge team in case of load balancer services. Hence, moving over to edge team.

Comment 4 Andrew McDermott 2020-05-07 15:55:26 UTC
Assigning to Dane to take another look.

Comment 6 Andrew McDermott 2020-05-21 16:34:01 UTC
Marking as urgent as this will be a release blocker unless the k-c-m fix merges.

Comment 10 Yang Yang 2020-05-27 09:30:30 UTC
GCP private cluster could be installed successfully with 4.5.0-0.nightly-2020-05-27-075521.

# oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.9.22   10.0.32.58    80:31552/TCP,443:30302/TCP   34m
router-internal-default   ClusterIP      172.30.80.8   <none>        80/TCP,443/TCP,1936/TCP      34m

Comment 11 Hongan Li 2020-05-28 01:45:09 UTC
thanks yangyang for confirmation, moving to verified.

Comment 12 errata-xmlrpc 2020-07-13 17:26:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409