1848106 – [Azure][4.5] Private cluster installation fails with Cluster operator monitoring is still updating

Bug 1848106 - [Azure][4.5] Private cluster installation fails with Cluster operator monitoring is still updating

Summary: [Azure][4.5] Private cluster installation fails with Cluster operator monitor...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Abhinav Dahiya
QA Contact:	Etienne Simard
Docs Contact:
URL:
Whiteboard:
Depends On:	1848781
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-17 17:13 UTC by Etienne Simard
Modified:	2020-07-13 17:44 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1848781 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:44:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3775	0	None	closed	Bug 1848106: data/azure: use outbound_rule instead of dummy inbound lb_rule	2020-12-04 15:48:21 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:44:40 UTC

Description Etienne Simard 2020-06-17 17:13:10 UTC

Description of problem: Installing a private Azure cluster on 4.5, it seems like all cluster operators come up but monitoring does not and the installation fails after that.

Version-Release number of the following components:

./openshift-install 4.5.0-rc.1
built from commit 5a90620ee0c6316e4137a75d4eea18b21a87fd3f
release image registry.svc.ci.openshift.org/ocp/release@sha256:7ea01a3c4d91f852f480ea40189f1762fcd2e77b8843a0662c471889f0b72028

but also tested with 4.5.0-0.nightly-2020-06-16-045437

How reproducible:

Steps to Reproduce:
1. Install a private cluster on Azure with currently available 4.5 builds
2. Example of install-config.yaml used for the private cluster:

~~~
apiVersion: v1
baseDomain: qe.azure.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: esstest01
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  azure:
    baseDomainResourceGroupName: os4-common
    region: centralus
    networkResourceGroupName: esimardpvt
    virtualNetwork: esimard_test_vnet
    controlPlaneSubnet: esimard_test_master_snet
    computeSubnet: esimard_test_worker_snet
publish: Internal
~~~

Actual results:

~~~
level=info msg="Waiting up to 30m0s for the cluster at https://api.qesspvt4502.qe.azure.devcluster.openshift.com:6443 to initialize..."
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 85% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 85% complete, waiting on authentication, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, machine-api, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 85% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 86% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: downloading update"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 4% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 10% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 12% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 86% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete, waiting on authentication, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete, waiting on monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: downloading update"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 5% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 12% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-06-16-045437: 87% complete, waiting on monitoring"
level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is still updating"
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: waiting for Alertmanager: expected 3 replicas, updated 2 and available 2"
level=fatal msg="failed to initialize the cluster: Cluster operator monitoring is still updating"
~~~

Expected results:

Successful cluster installation. 

Additional info:

I could not reproduce this issue with 4.4.* including 4.4.6

Comment 1 Abhinav Dahiya 2020-06-17 17:20:50 UTC

Can you attach the must-gather?
Also can you test the 4.6 nightly? https://openshift-release.svc.ci.openshift.org/#4.6.0-0.nightly

I have a feeling this has something to do with https://github.com/openshift/installer/pull/3440

Comment 3 Etienne Simard 2020-06-17 17:34:48 UTC

Attached must-gather and I am currently testing the latest 4.6 nightly.

Comment 4 Etienne Simard 2020-06-17 18:38:02 UTC

I see similar issues with 4.6:


./openshift-install 4.6.0-0.nightly-2020-06-16-214732
built from commit 4e46d0a347533263903beb3349a33f53eee7a6c2
release image registry.svc.ci.openshift.org/ocp/release@sha256:b36eccf60e3a3cedc1208d5049ba552b311b27c3ccc0eb20ed01ab0815a68b01

~~~
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 85% complete, waiting on authentication, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, machine-api, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 85% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 86% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: downloading update"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 9% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 11% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 13% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 85% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-06-16-214732: 86% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring"
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.qesspvt4602.qe.azure.devcluster.openshift.com: []"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=info msg="Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route \"console\" is not available at canonical host []\nOAuthClientSyncProgressing: route \"console\" is not available at canonical host []"
level=info msg="Cluster operator console Available is Unknown with NoData: "
level=info msg="Cluster operator image-registry Available is False with NoReplicasAvailable: Available: The deployment does not have available replicas\nImagePrunerAvailable: Pruner CronJob has been created"
level=info msg="Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed"
level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available."
level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
level=info msg="Cluster operator insights Disabled is False with AsExpected: "
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available"
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready"
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring"
~~~

Comment 6 Abhinav Dahiya 2020-06-18 18:36:59 UTC

Etienne 
> I see similar issues with 4.6:

So there were known problems with azure on master(4.6) causing the compute not be created. it was fixed by https://github.com/openshift/machine-api-operator/pull/616

I have a PR that is testing internal at master https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/9553/rehearse-9553-pull-ci-openshift-installer-master-e2e-azure-internal/1273681333717045248
let's see how that goes to narrow where we have the bug.

Comment 7 Abhinav Dahiya 2020-06-18 18:56:03 UTC

> http://file.rdu.redhat.com/~esimard/bz1848106/esstest03.tar.gz

```
2020-06-17T01:48:36.741071573Z E0617 01:48:36.741045       1 azure_loadbalancer.go:159] reconcileLoadBalancer(openshift-config-managed/outbound-provider) failed: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
2020-06-17T01:48:36.741071573Z   "error": {
2020-06-17T01:48:36.741071573Z     "code": "RulesUseSameBackendPortProtocolAndPool",
2020-06-17T01:48:36.741071573Z     "message": "Load balancing rules /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/internal_outbound_rule_v4 and /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/a180c6ccc20b4415cacb9eb31376fffb-TCP-27627 with floating IP disabled use the same protocol Tcp and backend port 27627, and must not be used with the same backend address pool /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/backendAddressPools/esstest03-2hn2f.",
2020-06-17T01:48:36.741071573Z     "details": []
2020-06-17T01:48:36.741071573Z   }
2020-06-17T01:48:36.741071573Z }
2020-06-17T01:48:36.741152075Z E0617 01:48:36.741096       1 controller.go:244] error processing service openshift-config-managed/outbound-provider (will retry): failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
2020-06-17T01:48:36.741152075Z   "error": {
2020-06-17T01:48:36.741152075Z     "code": "RulesUseSameBackendPortProtocolAndPool",
2020-06-17T01:48:36.741152075Z     "message": "Load balancing rules /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/internal_outbound_rule_v4 and /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/a180c6ccc20b4415cacb9eb31376fffb-TCP-27627 with floating IP disabled use the same protocol Tcp and backend port 27627, and must not be used with the same backend address pool /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/backendAddressPools/esstest03-2hn2f.",
2020-06-17T01:48:36.741152075Z     "details": []
2020-06-17T01:48:36.741152075Z   }
2020-06-17T01:48:36.741152075Z }
2020-06-17T01:48:36.741305781Z I0617 01:48:36.741241       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"openshift-config-managed", Name:"outbound-provider", UID:"180c6ccc-20b4-415c-acb9-eb31376fffbf", APIVersion:"v1", ResourceVersion:"9340", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
2020-06-17T01:48:36.741305781Z   "error": {
2020-06-17T01:48:36.741305781Z     "code": "RulesUseSameBackendPortProtocolAndPool",
2020-06-17T01:48:36.741305781Z     "message": "Load balancing rules /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/internal_outbound_rule_v4 and /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/loadBalancingRules/a180c6ccc20b4415cacb9eb31376fffb-TCP-27627 with floating IP disabled use the same protocol Tcp and backend port 27627, and must not be used with the same backend address pool /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/esstest03-2hn2f-rg/providers/Microsoft.Network/loadBalancers/esstest03-2hn2f/backendAddressPools/esstest03-2hn2f.",
2020-06-17T01:48:36.741305781Z     "details": []
2020-06-17T01:48:36.741305781Z   }
2020-06-17T01:48:36.741305781Z }
```

So this makes it clear that this bug is from change in https://github.com/openshift/installer/pull/3440

so not have 2 rules for the same port, one created by installer,
and the other created by kubenetes service type loadbalancer

So i think we can
- either use 2 different ports, this should be easy
- or use the master changes of outbound rules for egress.

Comment 10 Etienne Simard 2020-06-22 22:26:18 UTC

Verified with: 4.5.0-0.nightly-2020-06-22-193506

I was able to provision a private cluster on Azure with that build, without errors. No other issue was detected during basic health checks.

~~~

oc get nodes
NAME                                          STATUS   ROLES    AGE   VERSION
qesimardpvt45-6kqtv-master-0                  Ready    master   49m   v1.18.3+1b98519
qesimardpvt45-6kqtv-master-1                  Ready    master   49m   v1.18.3+1b98519
qesimardpvt45-6kqtv-master-2                  Ready    master   49m   v1.18.3+1b98519
qesimardpvt45-6kqtv-worker-centralus1-jx8xb   Ready    worker   34m   v1.18.3+1b98519
qesimardpvt45-6kqtv-worker-centralus2-28sr6   Ready    worker   34m   v1.18.3+1b98519
qesimardpvt45-6kqtv-worker-centralus3-wwr9w   Ready    worker   35m   v1.18.3+1b98519

oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-0.nightly-2020-06-22-193506   True        False         False      25m
cloud-credential                           4.5.0-0.nightly-2020-06-22-193506   True        False         False      52m
cluster-autoscaler                         4.5.0-0.nightly-2020-06-22-193506   True        False         False      41m
config-operator                            4.5.0-0.nightly-2020-06-22-193506   True        False         False      41m
console                                    4.5.0-0.nightly-2020-06-22-193506   True        False         False      27m
csi-snapshot-controller                    4.5.0-0.nightly-2020-06-22-193506   True        False         False      32m
dns                                        4.5.0-0.nightly-2020-06-22-193506   True        False         False      47m
etcd                                       4.5.0-0.nightly-2020-06-22-193506   True        False         False      46m
image-registry                             4.5.0-0.nightly-2020-06-22-193506   True        False         False      32m
ingress                                    4.5.0-0.nightly-2020-06-22-193506   True        False         False      32m
insights                                   4.5.0-0.nightly-2020-06-22-193506   True        False         False      42m
kube-apiserver                             4.5.0-0.nightly-2020-06-22-193506   True        False         False      46m
kube-controller-manager                    4.5.0-0.nightly-2020-06-22-193506   True        False         False      46m
kube-scheduler                             4.5.0-0.nightly-2020-06-22-193506   True        False         False      46m
kube-storage-version-migrator              4.5.0-0.nightly-2020-06-22-193506   True        False         False      32m
machine-api                                4.5.0-0.nightly-2020-06-22-193506   True        False         False      39m
machine-approver                           4.5.0-0.nightly-2020-06-22-193506   True        False         False      46m
machine-config                             4.5.0-0.nightly-2020-06-22-193506   True        False         False      39m
marketplace                                4.5.0-0.nightly-2020-06-22-193506   True        False         False      42m
monitoring                                 4.5.0-0.nightly-2020-06-22-193506   True        False         False      15m
network                                    4.5.0-0.nightly-2020-06-22-193506   True        False         False      48m
node-tuning                                4.5.0-0.nightly-2020-06-22-193506   True        False         False      48m
openshift-apiserver                        4.5.0-0.nightly-2020-06-22-193506   True        False         False      43m
openshift-controller-manager               4.5.0-0.nightly-2020-06-22-193506   True        False         False      42m
openshift-samples                          4.5.0-0.nightly-2020-06-22-193506   True        False         False      41m
operator-lifecycle-manager                 4.5.0-0.nightly-2020-06-22-193506   True        False         False      47m
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-06-22-193506   True        False         False      47m
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-06-22-193506   True        False         False      43m
service-ca                                 4.5.0-0.nightly-2020-06-22-193506   True        False         False      48m
storage                                    4.5.0-0.nightly-2020-06-22-193506   True        False         False      42m

Comment 11 errata-xmlrpc 2020-07-13 17:44:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.