1767295 – [GCP] Internal scope of loadBalancer doesn't work

Bug 1767295 - [GCP] Internal scope of loadBalancer doesn't work

Summary: [GCP] Internal scope of loadBalancer doesn't work

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Abhinav Dahiya
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-31 05:55 UTC by Hongan Li
Modified:	2020-05-28 12:30 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:10:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24135	0	'None'	closed	Bug 1767295: UPSTREAM: 84466: gce: skip ensureInstanceGroup for a zone that has no remaining nodes for k8s managed IG	2021-02-08 09:09:18 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:10:30 UTC

Description Hongan Li 2019-10-31 05:55:23 UTC

Description of problem:
create a test ingresscontroller with Internal scope but the LB external IP always shows pending.

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-10-30-223128

How reproducible:
100%

Steps to Reproduce:
1. create a test ingresscontroller with Internal scope
spec:
  defaultCertificate:
    name: router-certs-default
  domain: test.xxx.com
  replicas: 1
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService

2. check the svc 


Actual results:
$ oc get svc -n openshift-ingress
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                      AGE
router-default            LoadBalancer   172.30.208.157   35.224.196.234   80:30348/TCP,443:32226/TCP   154m
router-internal-default   ClusterIP      172.30.109.101   <none>           80/TCP,443/TCP,1936/TCP      154m
router-internal-testint   ClusterIP      172.30.122.106   <none>           80/TCP,443/TCP,1936/TCP      6m21s
router-testint            LoadBalancer   172.30.162.87    <pending>        80:31594/TCP,443:30218/TCP   6m21s

$ oc describe svc router-testint -n openshift-ingress
Name:                     router-testint
Namespace:                openshift-ingress
Labels:                   app=router
                          ingresscontroller.operator.openshift.io/owning-ingresscontroller=testint
                          router=router-testint
Annotations:              cloud.google.com/load-balancer-type: Internal
Selector:                 ingresscontroller.operator.openshift.io/deployment-ingresscontroller=testint
Type:                     LoadBalancer
IP:                       172.30.162.87
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  31594/TCP
Endpoints:                10.128.2.18:80,10.128.2.19:80,10.131.0.17:80 + 1 more...
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  30218/TCP
Endpoints:                10.128.2.18:443,10.128.2.19:443,10.131.0.17:443 + 1 more...
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     31625
Events:
  Type     Reason                  Age                    From                Message
  ----     ------                  ----                   ----                -------
  Warning  SyncLoadBalancerFailed  6m14s (x2 over 6m38s)  service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-a/instances/hongli-9wdh9-w-a-x5wcm' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/hongli-9wdh9-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/hongli-9wdh9-worker-subnet'., wrongSubnetwork
  Normal   EnsuringLoadBalancer    69s (x7 over 6m53s)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  68s (x5 over 6m26s)    service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/hongli-9wdh9-w-b-kfw7c' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/hongli-9wdh9-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/hongli-9wdh9-worker-subnet'., wrongSubnetwork


Expected results:
Internal LB should be provisioned on GCP.

Additional info:
it works with 4.2.2

Comment 1 Dan Mace 2019-10-31 16:04:20 UTC

My suspicion is that we picked up some sort of regression from the k8s 1.16 rebase, probably in the GCP cloud provider code[1].

The cloud provider config[2] in my cluster seems okay, but there's some level of subnet inference happening in the cloud provider code which makes me suspicious of the wrong subnet being discovered for internal load balancers. It's also not yet clear to me what the _correct_ subnet is for internal LBs given the lack of a discrete internal subnet in the infrastructure config. I'll investigate the IPI topology on GCP in this regard and work on getting some more info from the cloud provider code through logging, etc.

We apparently also have an e2e test gap on GCP, as this issue could easily have been discovered through automated testing.

[1] https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/gce/gce_loadbalancer_internal.go#L45
[2] oc get -n openshift-kube-controller-manager configmaps -o yaml

Comment 2 Dan Mace 2019-10-31 17:40:52 UTC

This s collateral damage of an installer issue being tracked in https://jira.coreos.com/browse/CORS-1258.

Abhinav, is there an installer BZ I can associate as a blocker for this report? I can't link Jira. For now, I'm going to keep the bug on our board but assign it to you. I don't know what bugzilla number you want to associate with whatever commits fix the problem.

Comment 3 Gaoyun Pei 2019-11-04 06:31:19 UTC

Also hit this issue in private-cluster installation on gcp.

# oc get svc -n openshift-ingress
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.192.146   <pending>     80:31075/TCP,443:32412/TCP   3h14m
router-internal-default   ClusterIP      172.30.7.90      <none>        80/TCP,443/TCP,1936/TCP      3h14m

# oc describe svc router-default -n openshift-ingress
Name:                     router-default
Namespace:                openshift-ingress
Labels:                   app=router
                          ingresscontroller.operator.openshift.io/owning-ingresscontroller=default
                          router=router-default
Annotations:              cloud.google.com/load-balancer-type: Internal
Selector:                 ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
Type:                     LoadBalancer
IP:                       172.30.192.146
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  31075/TCP
Endpoints:                10.128.2.9:80,10.131.0.9:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  32412/TCP
Endpoints:                10.128.2.9:443,10.131.0.9:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     31371
Events:
  Type     Reason                  Age                   From                Message
  ----     ------                  ----                  ----                -------
  Warning  SyncLoadBalancerFailed  172m                  service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/gpei-4-dk97n-w-b-bmsxx' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/gpei-432-11040234-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/gpei-432-11040234-worker-subnet'., wrongSubnetwork
  Warning  SyncLoadBalancerFailed  121m (x18 over 3h7m)  service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-a/instances/gpei-4-dk97n-w-a-ssf4l' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/gpei-432-11040234-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/gpei-432-11040234-worker-subnet'., wrongSubnetwork
  Normal   EnsuringLoadBalancer    63s (x43 over 3h7m)   service-controller  Ensuring load balancer

Comment 4 Dan Mace 2019-11-06 22:04:37 UTC

I'm reassigning this to installer consistent with https://bugzilla.redhat.com/show_bug.cgi?id=1763727 which is basically the same issue on another platform.

Comment 5 Dan Mace 2019-11-06 22:22:40 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1763727#c8

(In reply to Dan Mace from comment #8)
> Trevor shows the correct config in these release runs:
> 
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-
> installer-e2e-azure-4.3/328/artifacts/e2e-azure/must-gather/quay-io-
> openshift-release-dev-ocp-v4-0-art-dev-sha256-
> 68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/
> openshift-kube-controller-manager/core/configmaps.yaml
> 
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-
> installer-e2e-azure-4.3/327/artifacts/e2e-azure/must-gather/quay-io-
> openshift-release-dev-ocp-v4-0-art-dev-sha256-
> 68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/
> openshift-kube-controller-manager/core/configmaps.yaml
> 
> I was able to reproduce with an accepted CI build
> (registry.svc.ci.openshift.org/ocp/release@sha256:
> c7fe500453fc2a0d194f3b72ab91dbaa43cb48649240476dffc9a96a726a305d). Bottom
> line is it looks like if you're seeing this problem, you're probably using a
> stale release image. How one can easily come to use the wrong image is
> another discussion...

Comment 6 Abhinav Dahiya 2019-11-06 22:56:54 UTC

installed a latest 4.3-nightly cluster on gcp

```
$ oc get clusterversion version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-06-184828   True        False         8m54s   Cluster version is 4.3.0-0.nightly-2019-11-06-184828
```

```
$ cat > ilb.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app
spec:
  selector:
    matchLabels:
      app: hello
  replicas: 3
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
      - name: hello
        image: "gcr.io/google-samples/hello-app:2.0"
---
apiVersion: v1
kind: Service
metadata:
  name: ilb-service
  annotations:
    cloud.google.com/load-balancer-type: "Internal"
  labels:
    app: hello
spec:
  type: LoadBalancer
  selector:
    app: hello
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
[2:52:06] ➜  tmp.MvffSug8gd oc apply -f ilb.yaml
deployment.apps/hello-app created
service/ilb-service created
$ oc get svc
NAME          TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE
ilb-service   LoadBalancer   172.30.128.161   <pending>                              80:30218/TCP   3s
kubernetes    ClusterIP      172.30.0.1       <none>                                 443/TCP        23m
openshift     ExternalName   <none>           kubernetes.default.svc.cluster.local   <none>         14m

### after few minutes
$ oc get svc
NAME          TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE
ilb-service   LoadBalancer   172.30.128.161   10.0.32.5                              80:30218/TCP   2m41s
kubernetes    ClusterIP      172.30.0.1       <none>                                 443/TCP        25m
openshift     ExternalName   <none>           kubernetes.default.svc.cluster.local   <none>         17m
```

So that means that the basic cloud provider stuff is working correctly.

Comment 7 Abhinav Dahiya 2019-11-06 23:08:53 UTC

now following the docs at https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers
on the same cluster as https://bugzilla.redhat.com/show_bug.cgi?id=1767295#c6


```
oc get ingresscontroller -A
NAMESPACE                    NAME      AGE
openshift-ingress-operator   default   33s

oc get svc -n openshift-ingress
NAME                      TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
router-default            LoadBalancer   172.30.78.204   35.237.45.20   80:30242/TCP,443:30367/TCP   62s
router-internal-default   ClusterIP      172.30.88.10    <none>         80/TCP,443/TCP,1936/TCP      62s


cat i-ingress.yaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  namespace: openshift-ingress-operator
  name: internal
spec:
  domain: apps.example.com
  endpointPublishingStrategy:
    type: LoadBalancerService
    loadBalancer:
      scope: Internal

oc create -f i-ingress.yaml
ingresscontroller.operator.openshift.io/internal created

oc get ingresscontroller internal -n openshift-ingress-operator -oyaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2019-11-06T23:05:41Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: internal
  namespace: openshift-ingress-operator
  resourceVersion: "22474"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/internal
  uid: ba9f5050-fc15-4d26-a938-a0d20d1b64f9
spec:
  domain: apps.example.com
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService
status:
  availableReplicas: 0
  conditions:
  - lastTransitionTime: "2019-11-06T23:05:41Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2019-11-06T23:05:42Z"
    message: 'The deployment is unavailable: Deployment does not have minimum availability.'
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2019-11-06T23:05:42Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2019-11-06T23:05:42Z"
    message: The LoadBalancer service is pending
    reason: LoadBalancerPending
    status: "False"
    type: LoadBalancerReady
  - lastTransitionTime: "2019-11-06T23:05:42Z"
    message: DNS management is supported and zones are specified in the cluster DNS
      config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2019-11-06T23:05:42Z"
    message: The wildcard record resource was not found.
    reason: RecordNotFound
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2019-11-06T23:05:42Z"
    status: "False"
    type: Degraded
  domain: apps.example.com
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=internal
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12


oc get svc -n openshift-ingress
NAME                       TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)                      AGE
router-default             LoadBalancer   172.30.78.204    35.237.45.20   80:30242/TCP,443:30367/TCP   3m1s
router-internal            LoadBalancer   172.30.188.126   10.0.32.6      80:31967/TCP,443:31286/TCP   53s
router-internal-default    ClusterIP      172.30.88.10     <none>         80/TCP,443/TCP,1936/TCP      3m1s
router-internal-internal   ClusterIP      172.30.91.150    <none>         80/TCP,443/TCP,1936/TCP      53s


```

Comment 8 Abhinav Dahiya 2019-11-06 23:33:12 UTC

based on previous comments, please if possible try the QA again by installing a new cluster using latest 4.3 nightly

Comment 9 Hongan Li 2019-11-07 02:16:24 UTC

Using same nightly build but still failed with different error messages. Something wrong with QE test env?

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-06-184828   True        False         12m     Cluster version is 4.3.0-0.nightly-2019-11-06-184828

$ oc get svc -n openshift-ingress
NAME                       TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default             LoadBalancer   172.30.145.245   <pending>     80:30602/TCP,443:30278/TCP   15m
router-internal            LoadBalancer   172.30.88.78     <pending>     80:30232/TCP,443:32293/TCP   7m18s
router-internal-default    ClusterIP      172.30.209.171   <none>        80/TCP,443/TCP,1936/TCP      15m
router-internal-internal   ClusterIP      172.30.241.122   <none>        80/TCP,443/TCP,1936/TCP      7m18s

$ oc -n openshift-ingress describe svc router-default
<---snip--->
Events:
  Type     Reason                  Age                  From                Message
  ----     ------                  ----                 ----                -------
  Warning  SyncLoadBalancerFailed  10m (x5 over 16m)    service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[5].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid
  Warning  SyncLoadBalancerFailed  5m21s (x3 over 16m)  service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[4].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid
  Normal   EnsuringLoadBalancer    21s (x9 over 16m)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  16s                  service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[3].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid


$ oc -n openshift-ingress describe svc router-internal
Events:
  Type     Reason                  Age                From                Message
  ----     ------                  ----               ----                -------
  Warning  SyncLoadBalancerFailed  41s (x2 over 67s)  service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[5].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid
  Normal   EnsuringLoadBalancer    21s (x4 over 92s)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  16s (x2 over 57s)  service-controller  Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[3].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid

Comment 11 Hongan Li 2019-11-18 05:21:49 UTC

verified with 4.3.0-0.nightly-2019-11-17-224250 and issue has been fixed.

$ oc get svc -n openshift-ingress
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.129.165   10.0.32.4     80:31806/TCP,443:30884/TCP   4m33s

$ oc get ingresscontroller/default -o yaml -n openshift-ingress-operator
<---snip--->
spec:
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService

Comment 13 errata-xmlrpc 2020-01-23 11:10:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.