Description of problem: Has anyone looked at deploying private LB for ingress router yet? https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers, looks to work on AWS ok but getting issues on azure. fails against subnet as doesn't exist cluster63-99qm4-vnet/cluster63-99qm4-node-subnet the actual subnet that gets created in Azure is worker-subnet not node-subnet, maybe a bug with Naming standards? Version-Release number of selected component (if applicable): 4.2.0 on Azure How reproducible: every time Steps to Reproduce: 1. Destroy ingress router default and re-create using https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers Actual results: Service for internal loadbalancer sits pending with error Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal EnsuringLoadBalancer 2m45s (x9 over 18m) service-controller Ensuring load balancer Warning CreatingLoadBalancerFailed 2m45s (x9 over 18m) service-controller Error creating load balancer (will retry): failed to ensure load balancer for service openshift-ingress/router-default: ensure(openshift-ingress/router-default): lb(cluster63-99qm4-internal) - failed to get subnet: cluster63-99qm4-vnet/cluster63-99qm4-node-subnet The subnet cluster63-99qm4-vnet/cluster63-99qm4-node-subnet does not exist the subnets that get created are clustername-UID-worker-subnet clustername-UID-master-subnet interestingly the NSG for clustername-UID-worker-subnet is called clustername-UID-node-nsg Expected results: service starts correctly with an internalLB ip for azure, guessing this should try and apply against clustername-UID-worker-subnet Additional info: This works ok on AWS just having issues with Azure
I was able to reproduce this with 4.3.0-0.ci-2019-10-22-114945 on Azure. $ oc get -n openshift-ingress events 6s Warning SyncLoadBalancerFailed service/router-internal-apps Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-internal-apps): lb(dmace-kj94v-internal) - failed to get subnet: dmace-kj94v-vnet/dmace-kj94v-node-subnet Could be an issue with the upstream cloud provider code, as our only interface to Azure in this regard is through the `service.beta.kubernetes.io/azure-load-balancer-internal` annotation on a LoadBalancer Service. From there it's up to the k8s cloud provider code to do the right thing.
The problem is invalid cloud provider configuration data provided by the installer and consumed by the kube-controller-manager. I've fixed the installer in https://github.com/openshift/installer/pull/2556, but existing 4.2 Azure clusters will have invalid cloud provider ConfigMap contents which will need repaired by the operator responsible for kube-controller-manager. I'm going to reassign the bug to that component.
I don't think there's more to be done here than doc-ing this. Moving to qa.
checked with 4.3.0-0.nightly-2019-10-29-040037 but failed. $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.105.170 <pending> 80:30024/TCP,443:31426/TCP 10m $ oc describe svc router-default -n openshift-ingress Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal EnsuringLoadBalancer 96s (x4 over 2m11s) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 96s (x4 over 2m11s) service-controller Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): lb(yinzhou-azu-r89xk-internal) - failed to get subnet: yinzhou-azu-r89xk-vnet/yinzhou-azu-r89xk-node-subnet Normal EnsuringLoadBalancer 15s (x4 over 50s) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 14s (x4 over 50s) service-controller Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): lb(yinzhou-azu-r89xk-internal) - failed to get subnet: yinzhou-azu-r89xk-vnet/yinzhou-azu-r89xk-node-subnet
$ oc version Client Version: v4.2.0-alpha.0-249-gc276ecb Server Version: 4.3.0-0.okd-2019-10-29-180250 Kubernetes Version: v1.16.2 Reproduce with: $ oc replace --force --wait --filename - <<EOF apiVersion: operator.openshift.io/v1 kind: IngressController metadata: namespace: openshift-ingress-operator name: default spec: endpointPublishingStrategy: type: LoadBalancerService loadBalancer: scope: Internal EOF Notice the incorrect subnet in the cloud provider config: $ oc extract -n openshift-kube-controller-manager configmaps/cloud-config --keys config --to=- # config { "cloud": "AzurePublicCloud", // ... "location": "centralus", "vnetName": "dmace-7nprq-vnet", "vnetResourceGroup": "dmace-7nprq-rg", "subnetName": "dmace-7nprq-node-subnet", "securityGroupName": "dmace-7nprq-node-nsg", "routeTableName": "dmace-7nprq-node-routetable", // ... } I thought https://github.com/openshift/installer/pull/2556 was the easy fix, but at this point I'd like installer to take a look because the incorrect cluster config here breaks cloud provider config on Azure. Ingress is collateral damage.
Trevor shows the correct config in these release runs: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/328/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/openshift-kube-controller-manager/core/configmaps.yaml https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/327/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/openshift-kube-controller-manager/core/configmaps.yaml I was able to reproduce with an accepted CI build (registry.svc.ci.openshift.org/ocp/release@sha256:c7fe500453fc2a0d194f3b72ab91dbaa43cb48649240476dffc9a96a726a305d). Bottom line is it looks like if you're seeing this problem, you're probably using a stale release image. How one can easily come to use the wrong image is another discussion...
(In reply to Hongan Li from comment #5) > checked with 4.3.0-0.nightly-2019-10-29-040037 but failed. This release seems to have been garbage-collected, but looking for something close in [1] gives the bracketing [2,3]. Checking them: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:a1af6df78ebf893f6781f6b0ce821fc4631dd0afdf421c3fc6d4b32acb94be4e | grep ' installer ' installer https://github.com/openshift/installer b87ca03305176def4bd0443ec1be96e01972d1ac $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:87e2e5095d8efd339f4eb4bf200ab07e5f6274a461af337d72bd62a09dd37fc9 | grep ' installer ' installer https://github.com/openshift/installer a9d73356bfc5046b1d66f674bb46df10199b83a4 Both of those should have the fix: $ git log --first-parent --oneline cfa5d59c6c431..a9d73356bfc504 a9d73356b Merge pull request #2506 from JAORMX/add-gosec ... b87ca0330 Merge pull request #2572 from iamemilio/openstack-comments ... 37fbe86e7 Merge pull request #2556 from ironcladlou/azure-cloudprovider-subnet-fix ... 1657e940e Merge pull request #2490 from Fedosin/kubeconfig_path So I'm not clear on how we were still seeing the *-node-subnet form with 4.3.0-0.nightly-2019-10-29-040037. Can you 'openshift-install create manifests' and look manifests/cloud-provider-config.yaml (or something like that)? [1]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/ [2]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-28-144345/release.txt [3]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-29-073252/release.txt
Created a cluster using latest nightly from today on Azure. ``` oc get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.ci-2019-11-06-181827 True False 2m19s Cluster version is 4.3.0-0.ci-2019-11-06-181827 ``` testing the normal cloud provider.. ``` cat > ilb.yaml apiVersion: apps/v1 kind: Deployment metadata: name: hello-app spec: selector: matchLabels: app: hello replicas: 3 template: metadata: labels: app: hello spec: containers: - name: hello image: "gcr.io/google-samples/hello-app:2.0" --- apiVersion: v1 kind: Service metadata: name: ilb-service annotations: service.beta.kubernetes.io/azure-load-balancer-internal: "true" labels: app: hello spec: type: LoadBalancer selector: app: hello ports: - port: 80 targetPort: 8080 protocol: TCP oc create -f ilb.yaml deployment.apps/hello-app created service/ilb-service created oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ilb-service LoadBalancer 172.30.134.64 10.0.32.7 80:32628/TCP 73s kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 28m openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 19m ``` So the cloud provider is setup correctly to do private ILBs
so now following the docs from https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers and using the same cluster from https://bugzilla.redhat.com/show_bug.cgi?id=1763727#c10 ``` cat i-ingress.yaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: namespace: openshift-ingress-operator name: internal spec: domain: apps.example.com endpointPublishingStrategy: type: LoadBalancerService loadBalancer: scope: Internal oc create -f i-ingress.yaml ingresscontroller.operator.openshift.io/internal created oc get ingresscontroller -A NAMESPACE NAME AGE openshift-ingress-operator default 22m openshift-ingress-operator internal 7s oc get ingresscontroller internal -n openshift-ingress-operator -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2019-11-06T23:21:52Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 1 name: internal namespace: openshift-ingress-operator resourceVersion: "21139" selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/internal uid: 89f35ff4-a943-4f8b-9b99-98106e97eb6d spec: domain: apps.example.com endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService status: availableReplicas: 0 conditions: - lastTransitionTime: "2019-11-06T23:21:52Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2019-11-06T23:21:53Z" message: 'The deployment is unavailable: Deployment does not have minimum availability.' reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2019-11-06T23:21:52Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2019-11-06T23:22:03Z" message: The LoadBalancer service is provisioned reason: LoadBalancerProvisioned status: "True" type: LoadBalancerReady - lastTransitionTime: "2019-11-06T23:21:52Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2019-11-06T23:22:04Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady - lastTransitionTime: "2019-11-06T23:21:52Z" status: "False" type: Degraded domain: apps.example.com endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService observedGeneration: 1 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=internal tlsProfile: ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 minTLSVersion: VersionTLS12 oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.194.93 40.67.186.92 80:30057/TCP,443:32644/TCP 22m router-internal LoadBalancer 172.30.66.26 10.0.32.8 80:32035/TCP,443:30927/TCP 22s router-internal-default ClusterIP 172.30.237.73 <none> 80/TCP,443/TCP,1936/TCP 22m router-internal-internal ClusterIP 172.30.142.93 <none> 80/TCP,443/TCP,1936/TCP 22s ``` the internal ingress is working as expected.
based on previous comments, please if possible try the QA again by installing a new cluster using latest 4.3 nightly
Thank you, Abhinav. Verified with 4.3.0-0.nightly-2019-11-06-184828 and issue has been fixed. $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.130.232 10.0.32.7 80:31085/TCP,443:31309/TCP 36s router-internal LoadBalancer 172.30.171.240 10.0.32.6 80:30424/TCP,443:32364/TCP 7m47s router-internal-default ClusterIP 172.30.126.177 <none> 80/TCP,443/TCP,1936/TCP 36s router-internal-internal ClusterIP 172.30.251.51 <none> 80/TCP,443/TCP,1936/TCP 7m47s
*** Bug 1776672 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062