Description of problem: Installing a private Azure cluster on 4.5, it seems like all cluster operators come up but monitoring does not and the installation fails after that. Version-Release number of the following components: ./openshift-install 4.5.0-rc.1 built from commit 5a90620ee0c6316e4137a75d4eea18b21a87fd3f release image registry.svc.ci.openshift.org/ocp/release@sha256:7ea01a3c4d91f852f480ea40189f1762fcd2e77b8843a0662c471889f0b72028 but also tested with 4.5.0-0.nightly-2020-06-16-045437 How reproducible: Steps to Reproduce: 1. Install a private cluster on Azure with currently available 4.5 builds 2. Example of install-config.yaml used for the private cluster: ~~~ apiVersion: v1 baseDomain: qe.azure.devcluster.openshift.com compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 metadata: creationTimestamp: null name: esstest01 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: azure: baseDomainResourceGroupName: os4-common region: centralus networkResourceGroupName: esimardpvt virtualNetwork: esimard_test_vnet controlPlaneSubnet: esimard_test_master_snet computeSubnet: esimard_test_worker_snet publish: Internal ~~~ Actual results: ~~~ level=info msg="Waiting up to 30m0s for the cluster at https://api.qesspvt4502.qe.azure.devcluster.openshift.com:6443 to initialize..." level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 85% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 85% complete, waiting on authentication, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, machine-api, monitoring" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 85% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 86% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: downloading update" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 4% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 10% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 12% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 86% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete, waiting on authentication, monitoring" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete, waiting on monitoring" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: downloading update" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 5% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 12% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete, waiting on monitoring" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is still updating" level=info msg="Cluster operator insights Disabled is False with : " level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: waiting for Alertmanager: expected 3 replicas, updated 2 and available 2" level=fatal msg="failed to initialize the cluster: Cluster operator monitoring is still updating" ~~~ Expected results: Successful cluster installation. Additional info: I could not reproduce this issue with 4.4.* including 4.4.6
Can you attach the must-gather? Also can you test the 4.6 nightly? https://openshift-release.svc.ci.openshift.org/#4.6.0-0.nightly I have a feeling this has something to do with https://github.com/openshift/installer/pull/3440
Attached must-gather and I am currently testing the latest 4.6 nightly.
I see similar issues with 4.6: ./openshift-install 4.6.0-0.nightly-2020-06-16-214732 built from commit 4e46d0a347533263903beb3349a33f53eee7a6c2 release image registry.svc.ci.openshift.org/ocp/release@sha256:b36eccf60e3a3cedc1208d5049ba552b311b27c3ccc0eb20ed01ab0815a68b01 ~~~ level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 85% complete, waiting on authentication, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, machine-api, monitoring" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 85% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 86% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: downloading update" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 9% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 11% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 13% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 85% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 86% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring" level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.qesspvt4602.qe.azure.devcluster.openshift.com: []" level=info msg="Cluster operator authentication Progressing is Unknown with NoData: " level=info msg="Cluster operator authentication Available is Unknown with NoData: " level=info msg="Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route \"console\" is not available at canonical host []\nOAuthClientSyncProgressing: route \"console\" is not available at canonical host []" level=info msg="Cluster operator console Available is Unknown with NoData: " level=info msg="Cluster operator image-registry Available is False with NoReplicasAvailable: Available: The deployment does not have available replicas\nImagePrunerAvailable: Pruner CronJob has been created" level=info msg="Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed" level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available." level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available." level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default" level=info msg="Cluster operator insights Disabled is False with AsExpected: " level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available" level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available" level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring" ~~~
Etienne > I see similar issues with 4.6: So there were known problems with azure on master(4.6) causing the compute not be created. it was fixed by https://github.com/openshift/machine-api-operator/pull/616 I have a PR that is testing internal at master https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/9553/rehearse-9553-pull-ci-openshift-installer-master-e2e-azure-internal/1273681333717045248 let's see how that goes to narrow where we have the bug.
> http://file.rdu.redhat.com/~esimard/bz1848106/esstest03.tar.gz ``` 2020-06-17T01:48:36.741071573Z E0617 01:48:36.741045 1 azure_loadbalancer.go:159] reconcileLoadBalancer(openshift-config-managed/outbound-provider) failed: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: { 2020-06-17T01:48:36.741071573Z "error": { 2020-06-17T01:48:36.741071573Z "code": "RulesUseSameBackendPortProtocolAndPool", 2020-06-17T01:48:36.741071573Z "message": "Load balancing rules /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/internal_outbound_rule_v4 and /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/a180c6ccc20b4415cacb9eb31376fffb-TCP-27627 with floating IP disabled use the same protocol Tcp and backend port 27627, and must not be used with the same backend address pool /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/backendAddressPools/esstest03-2hn2f.", 2020-06-17T01:48:36.741071573Z "details": [] 2020-06-17T01:48:36.741071573Z } 2020-06-17T01:48:36.741071573Z } 2020-06-17T01:48:36.741152075Z E0617 01:48:36.741096 1 controller.go:244] error processing service openshift-config-managed/outbound-provider (will retry): failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: { 2020-06-17T01:48:36.741152075Z "error": { 2020-06-17T01:48:36.741152075Z "code": "RulesUseSameBackendPortProtocolAndPool", 2020-06-17T01:48:36.741152075Z "message": "Load balancing rules /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/internal_outbound_rule_v4 and /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/a180c6ccc20b4415cacb9eb31376fffb-TCP-27627 with floating IP disabled use the same protocol Tcp and backend port 27627, and must not be used with the same backend address pool /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/backendAddressPools/esstest03-2hn2f.", 2020-06-17T01:48:36.741152075Z "details": [] 2020-06-17T01:48:36.741152075Z } 2020-06-17T01:48:36.741152075Z } 2020-06-17T01:48:36.741305781Z I0617 01:48:36.741241 1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"openshift-config-managed", Name:"outbound-provider", UID:"180c6ccc-20b4-415c-acb9-eb31376fffbf", APIVersion:"v1", ResourceVersion:"9340", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: { 2020-06-17T01:48:36.741305781Z "error": { 2020-06-17T01:48:36.741305781Z "code": "RulesUseSameBackendPortProtocolAndPool", 2020-06-17T01:48:36.741305781Z "message": "Load balancing rules /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/internal_outbound_rule_v4 and /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/a180c6ccc20b4415cacb9eb31376fffb-TCP-27627 with floating IP disabled use the same protocol Tcp and backend port 27627, and must not be used with the same backend address pool /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/backendAddressPools/esstest03-2hn2f.", 2020-06-17T01:48:36.741305781Z "details": [] 2020-06-17T01:48:36.741305781Z } 2020-06-17T01:48:36.741305781Z } ``` So this makes it clear that this bug is from change in https://github.com/openshift/installer/pull/3440 so not have 2 rules for the same port, one created by installer, and the other created by kubenetes service type loadbalancer So i think we can - either use 2 different ports, this should be easy - or use the master changes of outbound rules for egress.
Verified with: 4.5.0-0.nightly-2020-06-22-193506 I was able to provision a private cluster on Azure with that build, without errors. No other issue was detected during basic health checks. ~~~ oc get nodes NAME STATUS ROLES AGE VERSION qesimardpvt45-6kqtv-master-0 Ready master 49m v1.18.3+1b98519 qesimardpvt45-6kqtv-master-1 Ready master 49m v1.18.3+1b98519 qesimardpvt45-6kqtv-master-2 Ready master 49m v1.18.3+1b98519 qesimardpvt45-6kqtv-worker-centralus1-jx8xb Ready worker 34m v1.18.3+1b98519 qesimardpvt45-6kqtv-worker-centralus2-28sr6 Ready worker 34m v1.18.3+1b98519 qesimardpvt45-6kqtv-worker-centralus3-wwr9w Ready worker 35m v1.18.3+1b98519 oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-06-22-193506 True False False 25m cloud-credential 4.5.0-0.nightly-2020-06-22-193506 True False False 52m cluster-autoscaler 4.5.0-0.nightly-2020-06-22-193506 True False False 41m config-operator 4.5.0-0.nightly-2020-06-22-193506 True False False 41m console 4.5.0-0.nightly-2020-06-22-193506 True False False 27m csi-snapshot-controller 4.5.0-0.nightly-2020-06-22-193506 True False False 32m dns 4.5.0-0.nightly-2020-06-22-193506 True False False 47m etcd 4.5.0-0.nightly-2020-06-22-193506 True False False 46m image-registry 4.5.0-0.nightly-2020-06-22-193506 True False False 32m ingress 4.5.0-0.nightly-2020-06-22-193506 True False False 32m insights 4.5.0-0.nightly-2020-06-22-193506 True False False 42m kube-apiserver 4.5.0-0.nightly-2020-06-22-193506 True False False 46m kube-controller-manager 4.5.0-0.nightly-2020-06-22-193506 True False False 46m kube-scheduler 4.5.0-0.nightly-2020-06-22-193506 True False False 46m kube-storage-version-migrator 4.5.0-0.nightly-2020-06-22-193506 True False False 32m machine-api 4.5.0-0.nightly-2020-06-22-193506 True False False 39m machine-approver 4.5.0-0.nightly-2020-06-22-193506 True False False 46m machine-config 4.5.0-0.nightly-2020-06-22-193506 True False False 39m marketplace 4.5.0-0.nightly-2020-06-22-193506 True False False 42m monitoring 4.5.0-0.nightly-2020-06-22-193506 True False False 15m network 4.5.0-0.nightly-2020-06-22-193506 True False False 48m node-tuning 4.5.0-0.nightly-2020-06-22-193506 True False False 48m openshift-apiserver 4.5.0-0.nightly-2020-06-22-193506 True False False 43m openshift-controller-manager 4.5.0-0.nightly-2020-06-22-193506 True False False 42m openshift-samples 4.5.0-0.nightly-2020-06-22-193506 True False False 41m operator-lifecycle-manager 4.5.0-0.nightly-2020-06-22-193506 True False False 47m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-06-22-193506 True False False 47m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-06-22-193506 True False False 43m service-ca 4.5.0-0.nightly-2020-06-22-193506 True False False 48m storage 4.5.0-0.nightly-2020-06-22-193506 True False False 42m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409