Bug 1763727
| Summary: | Ingress controller private issuesas per https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers, looks to work on AWS ok but getting issues on azure. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | alex glenn <aglenn> |
| Component: | Installer | Assignee: | Abhinav Dahiya <adahiya> |
| Installer sub component: | openshift-installer | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aos-bugs, arghosh, ckyriaki, dmace, hongli, jialiu, mfojtik, mharri, wking |
| Version: | 4.2.0 | Keywords: | TestBlocker |
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-23 11:08:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
alex glenn
2019-10-21 12:51:00 UTC
I was able to reproduce this with 4.3.0-0.ci-2019-10-22-114945 on Azure. $ oc get -n openshift-ingress events 6s Warning SyncLoadBalancerFailed service/router-internal-apps Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-internal-apps): lb(dmace-kj94v-internal) - failed to get subnet: dmace-kj94v-vnet/dmace-kj94v-node-subnet Could be an issue with the upstream cloud provider code, as our only interface to Azure in this regard is through the `service.beta.kubernetes.io/azure-load-balancer-internal` annotation on a LoadBalancer Service. From there it's up to the k8s cloud provider code to do the right thing. The problem is invalid cloud provider configuration data provided by the installer and consumed by the kube-controller-manager. I've fixed the installer in https://github.com/openshift/installer/pull/2556, but existing 4.2 Azure clusters will have invalid cloud provider ConfigMap contents which will need repaired by the operator responsible for kube-controller-manager. I'm going to reassign the bug to that component. I don't think there's more to be done here than doc-ing this. Moving to qa. checked with 4.3.0-0.nightly-2019-10-29-040037 but failed. $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.105.170 <pending> 80:30024/TCP,443:31426/TCP 10m $ oc describe svc router-default -n openshift-ingress Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal EnsuringLoadBalancer 96s (x4 over 2m11s) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 96s (x4 over 2m11s) service-controller Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): lb(yinzhou-azu-r89xk-internal) - failed to get subnet: yinzhou-azu-r89xk-vnet/yinzhou-azu-r89xk-node-subnet Normal EnsuringLoadBalancer 15s (x4 over 50s) service-controller Ensuring load balancer Warning SyncLoadBalancerFailed 14s (x4 over 50s) service-controller Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): lb(yinzhou-azu-r89xk-internal) - failed to get subnet: yinzhou-azu-r89xk-vnet/yinzhou-azu-r89xk-node-subnet $ oc version
Client Version: v4.2.0-alpha.0-249-gc276ecb
Server Version: 4.3.0-0.okd-2019-10-29-180250
Kubernetes Version: v1.16.2
Reproduce with:
$ oc replace --force --wait --filename - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
namespace: openshift-ingress-operator
name: default
spec:
endpointPublishingStrategy:
type: LoadBalancerService
loadBalancer:
scope: Internal
EOF
Notice the incorrect subnet in the cloud provider config:
$ oc extract -n openshift-kube-controller-manager configmaps/cloud-config --keys config --to=-
# config
{
"cloud": "AzurePublicCloud",
// ...
"location": "centralus",
"vnetName": "dmace-7nprq-vnet",
"vnetResourceGroup": "dmace-7nprq-rg",
"subnetName": "dmace-7nprq-node-subnet",
"securityGroupName": "dmace-7nprq-node-nsg",
"routeTableName": "dmace-7nprq-node-routetable",
// ...
}
I thought https://github.com/openshift/installer/pull/2556 was the easy fix, but at this point I'd like installer to take a look because the incorrect cluster config here breaks cloud provider config on Azure. Ingress is collateral damage.
Trevor shows the correct config in these release runs: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/328/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/openshift-kube-controller-manager/core/configmaps.yaml https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/327/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-68bc9e16f9c3f085718ecf04e7876015be795206f80ebc32fff21f4e017623a8/namespaces/openshift-kube-controller-manager/core/configmaps.yaml I was able to reproduce with an accepted CI build (registry.svc.ci.openshift.org/ocp/release@sha256:c7fe500453fc2a0d194f3b72ab91dbaa43cb48649240476dffc9a96a726a305d). Bottom line is it looks like if you're seeing this problem, you're probably using a stale release image. How one can easily come to use the wrong image is another discussion... (In reply to Hongan Li from comment #5) > checked with 4.3.0-0.nightly-2019-10-29-040037 but failed. This release seems to have been garbage-collected, but looking for something close in [1] gives the bracketing [2,3]. Checking them: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:a1af6df78ebf893f6781f6b0ce821fc4631dd0afdf421c3fc6d4b32acb94be4e | grep ' installer ' installer https://github.com/openshift/installer b87ca03305176def4bd0443ec1be96e01972d1ac $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:87e2e5095d8efd339f4eb4bf200ab07e5f6274a461af337d72bd62a09dd37fc9 | grep ' installer ' installer https://github.com/openshift/installer a9d73356bfc5046b1d66f674bb46df10199b83a4 Both of those should have the fix: $ git log --first-parent --oneline cfa5d59c6c431..a9d73356bfc504 a9d73356b Merge pull request #2506 from JAORMX/add-gosec ... b87ca0330 Merge pull request #2572 from iamemilio/openstack-comments ... 37fbe86e7 Merge pull request #2556 from ironcladlou/azure-cloudprovider-subnet-fix ... 1657e940e Merge pull request #2490 from Fedosin/kubeconfig_path So I'm not clear on how we were still seeing the *-node-subnet form with 4.3.0-0.nightly-2019-10-29-040037. Can you 'openshift-install create manifests' and look manifests/cloud-provider-config.yaml (or something like that)? [1]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/ [2]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-28-144345/release.txt [3]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-29-073252/release.txt Created a cluster using latest nightly from today on Azure.
```
oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.0-0.ci-2019-11-06-181827 True False 2m19s Cluster version is 4.3.0-0.ci-2019-11-06-181827
```
testing the normal cloud provider..
```
cat > ilb.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-app
spec:
selector:
matchLabels:
app: hello
replicas: 3
template:
metadata:
labels:
app: hello
spec:
containers:
- name: hello
image: "gcr.io/google-samples/hello-app:2.0"
---
apiVersion: v1
kind: Service
metadata:
name: ilb-service
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
labels:
app: hello
spec:
type: LoadBalancer
selector:
app: hello
ports:
- port: 80
targetPort: 8080
protocol: TCP
oc create -f ilb.yaml
deployment.apps/hello-app created
service/ilb-service created
oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ilb-service LoadBalancer 172.30.134.64 10.0.32.7 80:32628/TCP 73s
kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 28m
openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 19m
```
So the cloud provider is setup correctly to do private ILBs
so now following the docs from https://docs.openshift.com/container-platform/4.2/release_notes/ocp-4-2-release-notes.html#ocp-4-2-enable-ingress-controllers and using the same cluster from https://bugzilla.redhat.com/show_bug.cgi?id=1763727#c10 ``` cat i-ingress.yaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: namespace: openshift-ingress-operator name: internal spec: domain: apps.example.com endpointPublishingStrategy: type: LoadBalancerService loadBalancer: scope: Internal oc create -f i-ingress.yaml ingresscontroller.operator.openshift.io/internal created oc get ingresscontroller -A NAMESPACE NAME AGE openshift-ingress-operator default 22m openshift-ingress-operator internal 7s oc get ingresscontroller internal -n openshift-ingress-operator -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2019-11-06T23:21:52Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 1 name: internal namespace: openshift-ingress-operator resourceVersion: "21139" selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/internal uid: 89f35ff4-a943-4f8b-9b99-98106e97eb6d spec: domain: apps.example.com endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService status: availableReplicas: 0 conditions: - lastTransitionTime: "2019-11-06T23:21:52Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2019-11-06T23:21:53Z" message: 'The deployment is unavailable: Deployment does not have minimum availability.' reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2019-11-06T23:21:52Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2019-11-06T23:22:03Z" message: The LoadBalancer service is provisioned reason: LoadBalancerProvisioned status: "True" type: LoadBalancerReady - lastTransitionTime: "2019-11-06T23:21:52Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2019-11-06T23:22:04Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady - lastTransitionTime: "2019-11-06T23:21:52Z" status: "False" type: Degraded domain: apps.example.com endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService observedGeneration: 1 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=internal tlsProfile: ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 minTLSVersion: VersionTLS12 oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.194.93 40.67.186.92 80:30057/TCP,443:32644/TCP 22m router-internal LoadBalancer 172.30.66.26 10.0.32.8 80:32035/TCP,443:30927/TCP 22s router-internal-default ClusterIP 172.30.237.73 <none> 80/TCP,443/TCP,1936/TCP 22m router-internal-internal ClusterIP 172.30.142.93 <none> 80/TCP,443/TCP,1936/TCP 22s ``` the internal ingress is working as expected. based on previous comments, please if possible try the QA again by installing a new cluster using latest 4.3 nightly Thank you, Abhinav. Verified with 4.3.0-0.nightly-2019-11-06-184828 and issue has been fixed. $ oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.130.232 10.0.32.7 80:31085/TCP,443:31309/TCP 36s router-internal LoadBalancer 172.30.171.240 10.0.32.6 80:30424/TCP,443:32364/TCP 7m47s router-internal-default ClusterIP 172.30.126.177 <none> 80/TCP,443/TCP,1936/TCP 36s router-internal-internal ClusterIP 172.30.251.51 <none> 80/TCP,443/TCP,1936/TCP 7m47s *** Bug 1776672 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |