Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1763727

Summary:	Ingress controller private issuesas per https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers, looks to work on AWS ok but getting issues on azure.
Product:	OpenShift Container Platform	Reporter:	alex glenn <aglenn>
Component:	Installer	Assignee:	Abhinav Dahiya <adahiya>
Installer sub component:	openshift-installer	QA Contact:	Hongan Li <hongli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, arghosh, ckyriaki, dmace, hongli, jialiu, mfojtik, mharri, wking
Version:	4.2.0	Keywords:	TestBlocker
Target Milestone:	---
Target Release:	4.3.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:08:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description alex glenn 2019-10-21 12:51:00 UTC

Description of problem:

Has anyone looked at deploying private LB for ingress router yet? https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers, looks to work on AWS ok but getting issues on azure.

fails against subnet as doesn't exist

cluster63-99qm4-vnet/cluster63-99qm4-node-subnet

the actual subnet that gets created in Azure is worker-subnet not node-subnet, maybe a bug with Naming standards?


Version-Release number of selected component (if applicable):

4.2.0 on Azure


How reproducible:

every time


Steps to Reproduce:
1. Destroy ingress router default and re-create using https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers


Actual results:

Service for internal loadbalancer sits pending with error 


Events:
  Type     Reason                      Age                  From                Message
  ----     ------                      ----                 ----                -------
  Normal   EnsuringLoadBalancer        2m45s (x9 over 18m)  service-controller  Ensuring load balancer
  Warning  CreatingLoadBalancerFailed  2m45s (x9 over 18m)  service-controller  Error creating load balancer (will retry): failed to ensure load balancer for service openshift-ingress/router-default: ensure(openshift-ingress/router-default): lb(cluster63-99qm4-internal) - failed to get subnet: cluster63-99qm4-vnet/cluster63-99qm4-node-subnet

The subnet 

cluster63-99qm4-vnet/cluster63-99qm4-node-subnet

does not exist the subnets that get created are 

clustername-UID-worker-subnet

clustername-UID-master-subnet

interestingly the NSG for 

clustername-UID-worker-subnet

is called clustername-UID-node-nsg


Expected results:

service starts correctly with an internalLB ip for azure, guessing this should try and apply against clustername-UID-worker-subnet


Additional info: This works ok on AWS just having issues with Azure

Comment 1 Dan Mace 2019-10-23 17:41:27 UTC

I was able to reproduce this with 4.3.0-0.ci-2019-10-22-114945 on Azure.

$ oc get -n openshift-ingress events

6s          Warning   SyncLoadBalancerFailed            service/router-internal-apps                 Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-internal-apps): lb(dmace-kj94v-internal) - failed to get subnet: dmace-kj94v-vnet/dmace-kj94v-node-subnet


Could be an issue with the upstream cloud provider code, as our only interface to Azure in this regard is through the `service.beta.kubernetes.io/azure-load-balancer-internal` annotation on a LoadBalancer Service. From there it's up to the k8s cloud provider code to do the right thing.

Comment 2 Dan Mace 2019-10-23 18:49:45 UTC

The problem is invalid cloud provider configuration data provided by the installer and consumed by the kube-controller-manager.

I've fixed the installer in https://github.com/openshift/installer/pull/2556, but existing 4.2 Azure clusters will have invalid cloud provider ConfigMap contents which will need repaired by the operator responsible for kube-controller-manager. I'm going to reassign the bug to that component.

Comment 3 Maciej Szulik 2019-10-28 13:56:02 UTC

I don't think there's more to be done here than doc-ing this. Moving to qa.

Comment 5 Hongan Li 2019-10-29 09:26:15 UTC

checked with 4.3.0-0.nightly-2019-10-29-040037 but failed.

$ oc get svc -n openshift-ingress
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.105.170   <pending>     80:30024/TCP,443:31426/TCP   10m

$ oc describe svc router-default -n openshift-ingress
Events:
  Type     Reason                  Age                  From                Message
  ----     ------                  ----                 ----                -------
  Normal   EnsuringLoadBalancer    96s (x4 over 2m11s)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  96s (x4 over 2m11s)  service-controller  Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): lb(yinzhou-azu-r89xk-internal) - failed to get subnet: yinzhou-azu-r89xk-vnet/yinzhou-azu-r89xk-node-subnet
  Normal   EnsuringLoadBalancer    15s (x4 over 50s)    service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  14s (x4 over 50s)    service-controller  Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): lb(yinzhou-azu-r89xk-internal) - failed to get subnet: yinzhou-azu-r89xk-vnet/yinzhou-azu-r89xk-node-subnet

Comment 7 Dan Mace 2019-11-06 22:01:03 UTC

$ oc version
Client Version: v4.2.0-alpha.0-249-gc276ecb
Server Version: 4.3.0-0.okd-2019-10-29-180250
Kubernetes Version: v1.16.2

Reproduce with:

$ oc replace --force --wait --filename - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  namespace: openshift-ingress-operator
  name: default
spec:
  endpointPublishingStrategy:
    type: LoadBalancerService
    loadBalancer:
      scope: Internal
EOF

Notice the incorrect subnet in the cloud provider config:

$ oc extract -n openshift-kube-controller-manager configmaps/cloud-config --keys config --to=-
# config
{
        "cloud": "AzurePublicCloud",
        // ...
        "location": "centralus",
        "vnetName": "dmace-7nprq-vnet",
        "vnetResourceGroup": "dmace-7nprq-rg",
        "subnetName": "dmace-7nprq-node-subnet",
        "securityGroupName": "dmace-7nprq-node-nsg",
        "routeTableName": "dmace-7nprq-node-routetable",
        // ...
}

I thought https://github.com/openshift/installer/pull/2556 was the easy fix, but at this point I'd like installer to take a look because the incorrect cluster config here breaks cloud provider config on Azure. Ingress is collateral damage.

Comment 8 Dan Mace 2019-11-06 22:22:02 UTC

Trevor shows the correct config in these release runs:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/328/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/openshift-kube-controller-manager/core/configmaps.yaml

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/327/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/openshift-kube-controller-manager/core/configmaps.yaml

I was able to reproduce with an accepted CI build (registry.svc.ci.openshift.org/ocp/release@sha256:c7fe500453fc2a0d194f3b72ab91dbaa43cb48649240476dffc9a96a726a305d). Bottom line is it looks like if you're seeing this problem, you're probably using a stale release image. How one can easily come to use the wrong image is another discussion...

Comment 9 W. Trevor King 2019-11-06 22:34:47 UTC

(In reply to Hongan Li from comment #5)
> checked with 4.3.0-0.nightly-2019-10-29-040037 but failed.

This release seems to have been garbage-collected, but looking for something close in [1] gives the bracketing [2,3].  Checking them:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:a1af6df78ebf893f6781f6b0ce821fc4631dd0afdf421c3fc6d4b32acb94be4e | grep ' installer '
  installer                                     https://github.com/openshift/installer                                     b87ca03305176def4bd0443ec1be96e01972d1ac
$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:87e2e5095d8efd339f4eb4bf200ab07e5f6274a461af337d72bd62a09dd37fc9 | grep ' installer '
  installer                                     https://github.com/openshift/installer                                     a9d73356bfc5046b1d66f674bb46df10199b83a4

Both of those should have the fix:

$ git log --first-parent --oneline cfa5d59c6c431..a9d73356bfc504
a9d73356b Merge pull request #2506 from JAORMX/add-gosec
...
b87ca0330 Merge pull request #2572 from iamemilio/openstack-comments
...
37fbe86e7 Merge pull request #2556 from ironcladlou/azure-cloudprovider-subnet-fix
...
1657e940e Merge pull request #2490 from Fedosin/kubeconfig_path

So I'm not clear on how we were still seeing the *-node-subnet form with 4.3.0-0.nightly-2019-10-29-040037.  Can you 'openshift-install create manifests' and look manifests/cloud-provider-config.yaml (or something like that)?

[1]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/
[2]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-28-144345/release.txt
[3]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-29-073252/release.txt

Comment 10 Abhinav Dahiya 2019-11-06 23:21:24 UTC

Created a cluster using latest nightly from today on Azure.

```
oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.ci-2019-11-06-181827   True        False         2m19s   Cluster version is 4.3.0-0.ci-2019-11-06-181827
```

testing the normal cloud provider..

```
cat > ilb.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app
spec:
  selector:
    matchLabels:
      app: hello
  replicas: 3
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
      - name: hello
        image: "gcr.io/google-samples/hello-app:2.0"
---
apiVersion: v1
kind: Service
metadata:
  name: ilb-service
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  labels:
    app: hello
spec:
  type: LoadBalancer
  selector:
    app: hello
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP


oc create -f ilb.yaml
deployment.apps/hello-app created
service/ilb-service created

oc get svc
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP                            PORT(S)        AGE
ilb-service   LoadBalancer   172.30.134.64   10.0.32.7                              80:32628/TCP   73s
kubernetes    ClusterIP      172.30.0.1      <none>                                 443/TCP        28m
openshift     ExternalName   <none>          kubernetes.default.svc.cluster.local   <none>         19m

```

So the cloud provider is setup correctly to do private ILBs

Comment 11 Abhinav Dahiya 2019-11-06 23:24:18 UTC

so now following the docs from https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers
and using the same cluster from https://bugzilla.redhat.com/show_bug.cgi?id=1763727#c10

```
cat i-ingress.yaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  namespace: openshift-ingress-operator
  name: internal
spec:
  domain: apps.example.com
  endpointPublishingStrategy:
    type: LoadBalancerService
    loadBalancer:
      scope: Internal


oc create -f i-ingress.yaml
ingresscontroller.operator.openshift.io/internal created

oc get ingresscontroller -A
NAMESPACE                    NAME       AGE
openshift-ingress-operator   default    22m
openshift-ingress-operator   internal   7s

oc get ingresscontroller internal -n openshift-ingress-operator -oyaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2019-11-06T23:21:52Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: internal
  namespace: openshift-ingress-operator
  resourceVersion: "21139"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/internal
  uid: 89f35ff4-a943-4f8b-9b99-98106e97eb6d
spec:
  domain: apps.example.com
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService
status:
  availableReplicas: 0
  conditions:
  - lastTransitionTime: "2019-11-06T23:21:52Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2019-11-06T23:21:53Z"
    message: 'The deployment is unavailable: Deployment does not have minimum availability.'
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2019-11-06T23:21:52Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2019-11-06T23:22:03Z"
    message: The LoadBalancer service is provisioned
    reason: LoadBalancerProvisioned
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2019-11-06T23:21:52Z"
    message: DNS management is supported and zones are specified in the cluster DNS
      config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2019-11-06T23:22:04Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady
  - lastTransitionTime: "2019-11-06T23:21:52Z"
    status: "False"
    type: Degraded
  domain: apps.example.com
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=internal
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12

oc get svc -n openshift-ingress
NAME                       TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
router-default             LoadBalancer   172.30.194.93   40.67.186.92   80:30057/TCP,443:32644/TCP   22m
router-internal            LoadBalancer   172.30.66.26    10.0.32.8      80:32035/TCP,443:30927/TCP   22s
router-internal-default    ClusterIP      172.30.237.73   <none>         80/TCP,443/TCP,1936/TCP      22m
router-internal-internal   ClusterIP      172.30.142.93   <none>         80/TCP,443/TCP,1936/TCP      22s

```

the internal ingress is working as expected.

Comment 12 Abhinav Dahiya 2019-11-06 23:33:25 UTC

based on previous comments, please if possible try the QA again by installing a new cluster using latest 4.3 nightly

Comment 13 Hongan Li 2019-11-07 03:10:25 UTC

Thank you, Abhinav.

Verified with 4.3.0-0.nightly-2019-11-06-184828 and issue has been fixed.

$ oc get svc -n openshift-ingress
NAME                       TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default             LoadBalancer   172.30.130.232   10.0.32.7     80:31085/TCP,443:31309/TCP   36s
router-internal            LoadBalancer   172.30.171.240   10.0.32.6     80:30424/TCP,443:32364/TCP   7m47s
router-internal-default    ClusterIP      172.30.126.177   <none>        80/TCP,443/TCP,1936/TCP      36s
router-internal-internal   ClusterIP      172.30.251.51    <none>        80/TCP,443/TCP,1936/TCP      7m47s

Comment 15 Dan Mace 2019-12-03 13:54:44 UTC

*** Bug 1776672 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2020-01-23 11:08:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062