Description of problem: The load balancer for apps is not ready when installing private cluster in GCP, and the events show: Warning SyncLoadBalancerFailed 42m (x19 over 107m) service-controller Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress" Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-04-07-06214 How reproducible: 100% Steps to Reproduce: 1. install private cluster in GCP 2. check operator ingress 3. Actual results: 1. the installation is failed with some operators are degraded. 2. $ oc -n openshift-ingress-operator get ingresscontroller/default -o yaml <---snip---> spec: endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService status: availableReplicas: 2 conditions: - lastTransitionTime: "2020-04-07T09:50:56Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2020-04-07T09:55:30Z" status: "True" type: Available - lastTransitionTime: "2020-04-07T09:55:30Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "False" type: DeploymentDegraded - lastTransitionTime: "2020-04-07T09:51:00Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2020-04-07T09:51:00Z" message: |- The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress" The kube-controller-manager logs may contain more details. reason: SyncLoadBalancerFailed status: "False" type: LoadBalancerReady - lastTransitionTime: "2020-04-07T09:51:00Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2020-04-07T09:51:00Z" message: The wildcard record resource was not found. reason: RecordNotFound status: "False" type: DNSReady - lastTransitionTime: "2020-04-07T09:55:30Z" message: 'One or more other status conditions indicate a degraded state: LoadBalancerReady=False' reason: DegradedConditions status: "True" type: Degraded $ oc -n openshift-ingress get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.202.91 <pending> 80:32341/TCP,443:32574/TCP 114m router-internal-default ClusterIP 172.30.191.124 <none> 80/TCP,443/TCP,1936/TCP 114m $ oc -n openshift-ingress describe svc router-default Name: router-default Namespace: openshift-ingress Labels: app=router ingresscontroller.operator.openshift.io/owning-ingresscontroller=default router=router-default Annotations: cloud.google.com/load-balancer-type: Internal Selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default Type: LoadBalancer IP: 172.30.202.91 Port: http 80/TCP TargetPort: http/TCP NodePort: http 32341/TCP Endpoints: 10.128.2.9:80,10.131.0.8:80 Port: https 443/TCP TargetPort: https/TCP NodePort: https 32574/TCP Endpoints: 10.128.2.9:443,10.131.0.8:443 Session Affinity: None External Traffic Policy: Local HealthCheck NodePort: 30684 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal EnsuringLoadBalancer 113m (x5 over 114m) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 113m (x5 over 114m) service-controller Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress" Normal EnsuringLoadBalancer 111m (x4 over 112m) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 111m (x4 over 112m) service-controller Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress" Normal EnsuringLoadBalancer 108m (x6 over 110m) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 108m (x6 over 110m) service-controller Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress" Warning SyncLoadBalancerFailed 42m (x19 over 107m) service-controller Error syncing load balancer: failed to ensure load balancer: services "router-default" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "openshift-ingress" Expected results: The load balancer should be ready. Additional info: Try to install 4.5 private cluster in AWS and Azure and succeed.
It appears that the kube-apiserver does not have the permissions to update status for the service named "router-default". Reassigning to the apiserver team for further investigation.
Adding testblocker as it blocks GCP private cluster installation.
As far as I can see from the logs and comments, this is no apiserver issue. The ingress controller watches events of service controller in kube-controller-manager which in turn also falls into the responsibility of edge team in case of load balancer services. Hence, moving over to edge team.
Assigning to Dane to take another look.
Marking as urgent as this will be a release blocker unless the k-c-m fix merges.
GCP private cluster could be installed successfully with 4.5.0-0.nightly-2020-05-27-075521. # oc -n openshift-ingress get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.9.22 10.0.32.58 80:31552/TCP,443:30302/TCP 34m router-internal-default ClusterIP 172.30.80.8 <none> 80/TCP,443/TCP,1936/TCP 34m
thanks yangyang for confirmation, moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409