Description of problem: create a test ingresscontroller with Internal scope but the LB external IP always shows pending. Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-10-30-223128 How reproducible: 100% Steps to Reproduce: 1. create a test ingresscontroller with Internal scope spec: defaultCertificate: name: router-certs-default domain: test.xxx.com replicas: 1 endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService 2. check the svc Actual results: $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.208.157 35.224.196.234 80:30348/TCP,443:32226/TCP 154m router-internal-default ClusterIP 172.30.109.101 <none> 80/TCP,443/TCP,1936/TCP 154m router-internal-testint ClusterIP 172.30.122.106 <none> 80/TCP,443/TCP,1936/TCP 6m21s router-testint LoadBalancer 172.30.162.87 <pending> 80:31594/TCP,443:30218/TCP 6m21s $ oc describe svc router-testint -n openshift-ingress Name: router-testint Namespace: openshift-ingress Labels: app=router ingresscontroller.operator.openshift.io/owning-ingresscontroller=testint router=router-testint Annotations: cloud.google.com/load-balancer-type: Internal Selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=testint Type: LoadBalancer IP: 172.30.162.87 Port: http 80/TCP TargetPort: http/TCP NodePort: http 31594/TCP Endpoints: 10.128.2.18:80,10.128.2.19:80,10.131.0.17:80 + 1 more... Port: https 443/TCP TargetPort: https/TCP NodePort: https 30218/TCP Endpoints: 10.128.2.18:443,10.128.2.19:443,10.131.0.17:443 + 1 more... Session Affinity: None External Traffic Policy: Local HealthCheck NodePort: 31625 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SyncLoadBalancerFailed 6m14s (x2 over 6m38s) service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-a/instances/hongli-9wdh9-w-a-x5wcm' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/hongli-9wdh9-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/hongli-9wdh9-worker-subnet'., wrongSubnetwork Normal EnsuringLoadBalancer 69s (x7 over 6m53s) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 68s (x5 over 6m26s) service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/hongli-9wdh9-w-b-kfw7c' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/hongli-9wdh9-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/hongli-9wdh9-worker-subnet'., wrongSubnetwork Expected results: Internal LB should be provisioned on GCP. Additional info: it works with 4.2.2
My suspicion is that we picked up some sort of regression from the k8s 1.16 rebase, probably in the GCP cloud provider code[1]. The cloud provider config[2] in my cluster seems okay, but there's some level of subnet inference happening in the cloud provider code which makes me suspicious of the wrong subnet being discovered for internal load balancers. It's also not yet clear to me what the _correct_ subnet is for internal LBs given the lack of a discrete internal subnet in the infrastructure config. I'll investigate the IPI topology on GCP in this regard and work on getting some more info from the cloud provider code through logging, etc. We apparently also have an e2e test gap on GCP, as this issue could easily have been discovered through automated testing. [1] https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/gce/gce_loadbalancer_internal.go#L45 [2] oc get -n openshift-kube-controller-manager configmaps -o yaml
This s collateral damage of an installer issue being tracked in https://jira.coreos.com/browse/CORS-1258. Abhinav, is there an installer BZ I can associate as a blocker for this report? I can't link Jira. For now, I'm going to keep the bug on our board but assign it to you. I don't know what bugzilla number you want to associate with whatever commits fix the problem.
Also hit this issue in private-cluster installation on gcp. # oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.192.146 <pending> 80:31075/TCP,443:32412/TCP 3h14m router-internal-default ClusterIP 172.30.7.90 <none> 80/TCP,443/TCP,1936/TCP 3h14m # oc describe svc router-default -n openshift-ingress Name: router-default Namespace: openshift-ingress Labels: app=router ingresscontroller.operator.openshift.io/owning-ingresscontroller=default router=router-default Annotations: cloud.google.com/load-balancer-type: Internal Selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default Type: LoadBalancer IP: 172.30.192.146 Port: http 80/TCP TargetPort: http/TCP NodePort: http 31075/TCP Endpoints: 10.128.2.9:80,10.131.0.9:80 Port: https 443/TCP TargetPort: https/TCP NodePort: https 32412/TCP Endpoints: 10.128.2.9:443,10.131.0.9:443 Session Affinity: None External Traffic Policy: Local HealthCheck NodePort: 31371 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SyncLoadBalancerFailed 172m service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/gpei-4-dk97n-w-b-bmsxx' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/gpei-432-11040234-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/gpei-432-11040234-worker-subnet'., wrongSubnetwork Warning SyncLoadBalancerFailed 121m (x18 over 3h7m) service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-a/instances/gpei-4-dk97n-w-a-ssf4l' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/gpei-432-11040234-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/gpei-432-11040234-worker-subnet'., wrongSubnetwork Normal EnsuringLoadBalancer 63s (x43 over 3h7m) service-controller Ensuring load balancer
I'm reassigning this to installer consistent with https://bugzilla.redhat.com/show_bug.cgi?id=1763727 which is basically the same issue on another platform.
https://bugzilla.redhat.com/show_bug.cgi?id=1763727#c8 (In reply to Dan Mace from comment #8) > Trevor shows the correct config in these release runs: > > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp- > installer-e2e-azure-4.3/328/artifacts/e2e-azure/must-gather/quay-io- > openshift-release-dev-ocp-v4-0-art-dev-sha256- > 68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/ > openshift-kube-controller-manager/core/configmaps.yaml > > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp- > installer-e2e-azure-4.3/327/artifacts/e2e-azure/must-gather/quay-io- > openshift-release-dev-ocp-v4-0-art-dev-sha256- > 68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/ > openshift-kube-controller-manager/core/configmaps.yaml > > I was able to reproduce with an accepted CI build > (registry.svc.ci.openshift.org/ocp/release@sha256: > c7fe500453fc2a0d194f3b72ab91dbaa43cb48649240476dffc9a96a726a305d). Bottom > line is it looks like if you're seeing this problem, you're probably using a > stale release image. How one can easily come to use the wrong image is > another discussion...
installed a latest 4.3-nightly cluster on gcp ``` $ oc get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-11-06-184828 True False 8m54s Cluster version is 4.3.0-0.nightly-2019-11-06-184828 ``` ``` $ cat > ilb.yaml apiVersion: apps/v1 kind: Deployment metadata: name: hello-app spec: selector: matchLabels: app: hello replicas: 3 template: metadata: labels: app: hello spec: containers: - name: hello image: "gcr.io/google-samples/hello-app:2.0" --- apiVersion: v1 kind: Service metadata: name: ilb-service annotations: cloud.google.com/load-balancer-type: "Internal" labels: app: hello spec: type: LoadBalancer selector: app: hello ports: - port: 80 targetPort: 8080 protocol: TCP [2:52:06] ➜ tmp.MvffSug8gd oc apply -f ilb.yaml deployment.apps/hello-app created service/ilb-service created $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ilb-service LoadBalancer 172.30.128.161 <pending> 80:30218/TCP 3s kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 23m openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 14m ### after few minutes $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ilb-service LoadBalancer 172.30.128.161 10.0.32.5 80:30218/TCP 2m41s kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 25m openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 17m ``` So that means that the basic cloud provider stuff is working correctly.
now following the docs at https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers on the same cluster as https://bugzilla.redhat.com/show_bug.cgi?id=1767295#c6 ``` oc get ingresscontroller -A NAMESPACE NAME AGE openshift-ingress-operator default 33s oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.78.204 35.237.45.20 80:30242/TCP,443:30367/TCP 62s router-internal-default ClusterIP 172.30.88.10 <none> 80/TCP,443/TCP,1936/TCP 62s cat i-ingress.yaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: namespace: openshift-ingress-operator name: internal spec: domain: apps.example.com endpointPublishingStrategy: type: LoadBalancerService loadBalancer: scope: Internal oc create -f i-ingress.yaml ingresscontroller.operator.openshift.io/internal created oc get ingresscontroller internal -n openshift-ingress-operator -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2019-11-06T23:05:41Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 1 name: internal namespace: openshift-ingress-operator resourceVersion: "22474" selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/internal uid: ba9f5050-fc15-4d26-a938-a0d20d1b64f9 spec: domain: apps.example.com endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService status: availableReplicas: 0 conditions: - lastTransitionTime: "2019-11-06T23:05:41Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2019-11-06T23:05:42Z" message: 'The deployment is unavailable: Deployment does not have minimum availability.' reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2019-11-06T23:05:42Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2019-11-06T23:05:42Z" message: The LoadBalancer service is pending reason: LoadBalancerPending status: "False" type: LoadBalancerReady - lastTransitionTime: "2019-11-06T23:05:42Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2019-11-06T23:05:42Z" message: The wildcard record resource was not found. reason: RecordNotFound status: "False" type: DNSReady - lastTransitionTime: "2019-11-06T23:05:42Z" status: "False" type: Degraded domain: apps.example.com endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService observedGeneration: 1 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=internal tlsProfile: ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 minTLSVersion: VersionTLS12 oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.78.204 35.237.45.20 80:30242/TCP,443:30367/TCP 3m1s router-internal LoadBalancer 172.30.188.126 10.0.32.6 80:31967/TCP,443:31286/TCP 53s router-internal-default ClusterIP 172.30.88.10 <none> 80/TCP,443/TCP,1936/TCP 3m1s router-internal-internal ClusterIP 172.30.91.150 <none> 80/TCP,443/TCP,1936/TCP 53s ```
based on previous comments, please if possible try the QA again by installing a new cluster using latest 4.3 nightly
Using same nightly build but still failed with different error messages. Something wrong with QE test env? $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-11-06-184828 True False 12m Cluster version is 4.3.0-0.nightly-2019-11-06-184828 $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.145.245 <pending> 80:30602/TCP,443:30278/TCP 15m router-internal LoadBalancer 172.30.88.78 <pending> 80:30232/TCP,443:32293/TCP 7m18s router-internal-default ClusterIP 172.30.209.171 <none> 80/TCP,443/TCP,1936/TCP 15m router-internal-internal ClusterIP 172.30.241.122 <none> 80/TCP,443/TCP,1936/TCP 7m18s $ oc -n openshift-ingress describe svc router-default <---snip---> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SyncLoadBalancerFailed 10m (x5 over 16m) service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[5].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid Warning SyncLoadBalancerFailed 5m21s (x3 over 16m) service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[4].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid Normal EnsuringLoadBalancer 21s (x9 over 16m) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 16s service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[3].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid $ oc -n openshift-ingress describe svc router-internal Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SyncLoadBalancerFailed 41s (x2 over 67s) service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[5].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid Normal EnsuringLoadBalancer 21s (x4 over 92s) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 16s (x2 over 57s) service-controller Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Invalid value for field 'resource.backends[3].group': 'https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instanceGroups/k8s-ig--8d8682bc12c7a717'. Instance group must have a network to be attached to a backend service. Add an instance to give the instance group a network., invalid
verified with 4.3.0-0.nightly-2019-11-17-224250 and issue has been fixed. $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.129.165 10.0.32.4 80:31806/TCP,443:30884/TCP 4m33s $ oc get ingresscontroller/default -o yaml -n openshift-ingress-operator <---snip---> spec: endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062