Bug 1909006

Summary: [OCP4.7] Installation fails in Azure environment with "Error syncing load balancer: failed to ensure load balancer: invalid ip config ID" errors
Product: OpenShift Container Platform Reporter: Arvind iyengar <aiyengar>
Component: NetworkingAssignee: aos-network-edge-staff <aos-network-edge-staff>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: aos-bugs, sgreene
Version: 4.7Keywords: TestBlocker
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-18 16:35:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Reference must-gather data from one of the failing clusters none

Description Arvind iyengar 2020-12-18 07:07:59 UTC
Description of problem:
Installation in Azure environment ends with a failure. This is consistently noted across multiple installation attempts with different nightly images made where the following error is most commonly noted in the ingress controller deployment logs:
-----
2020-12-18T01:49:09.371Z        INFO    operator.ingress_controller     controller/controller.go:235    reconciling     {"request": "openshift-ingress-operator/default"}
2020-12-18T01:49:09.568Z        ERROR   operator.ingress_controller     controller/controller.go:235    got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: invalid ip config ID /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/xxia18az-djhkb-rg/providers/Microsoft.Network/networkInterfaces/xxia18az-djhkb-master0-nic/ipConfigurations/pipConfig\nThe kube-controller-manager logs may contain more details.), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
-----

The problem is not seen for other envs like AWS/GCP.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-12-17-224915
4.7.0-0.nightly-2020-12-17-201522


How reproducible:
Frequently

Steps to Reproduce:
1. Initiate deployment of cluster using latest ocp v4.7 nightly images in Azure environment

Actual results:
The deployment will end up in a failure and the following errors could be seen in the ingress operator logs:
-----
2020-12-18T04:37:18.108Z        ERROR   operator.ingress_controller     controller/controller.go:235    got retryable error; requeueing {"after": "48.906300155s", "error": "IngressController
 may become degraded soon: LoadBalancerReady=False, CanaryChecksSucceeding=False"}
2020-12-18T04:38:07.002Z        INFO    operator.ingress_controller     controller/controller.go:235    reconciling     {"request": "openshift-ingress-operator/default"}
2020-12-18T04:38:07.118Z        ERROR   operator.canary_controller      wait/wait.go:155        error performing canary route check     {"error": "error sending canary HTTP request: DNS erro
r: Get \"http://canary-openshift-ingress-canary.apps.hongli-az47.qe.azure.devcluster.openshift.com\": dial tcp: lookup canary-openshift-ingress-canary.apps.hongli-az47.qe.azure.devcluster.op
enshift.com on 172.30.0.10:53: no such host"}
2020-12-18T04:38:07.150Z        INFO    operator.status_controller      controller/controller.go:235    Reconciling     {"request": "openshift-ingress-operator/default"}

2020-12-18T04:39:07.301Z        ERROR   operator.ingress_controller     controller/controller.go:235    got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
2020-12-18T04:40:07.301Z        INFO    operator.ingress_controller     controller/controller.go:235    reconciling     {"request": "openshift-ingress-operator/default"}
2020-12-18T04:40:07.344Z        ERROR   operator.canary_controller      wait/wait.go:155        error performing canary route check     {"error": "error sending canary HTTP request: DNS error: Get \"http://canary-openshift-ingress-canary.apps.hongli-az47.qe.azure.devcluster.openshift.com\": dial tcp: lookup canary-openshift-ingress-canary.apps.hongli-az47.qe.azure.devcluster.openshift.com on 172.30.0.10:53: no such host"}
2020-12-18T04:40:07.462Z        ERROR   operator.ingress_controller     controller/controller.go:235    got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: invalid ip config ID /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/hongli-az47-b4tcb-rg/providers/Microsoft.Network/networkInterfaces/hongli-az47-b4tcb-bootstrap-nic/ipConfigurations/bootstrap-nic-ip-v4\nThe kube-controller-manager logs may contain more details.), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
------

Expected results:
The installation should succeed in the Azure environment.

Comment 1 Arvind iyengar 2020-12-18 08:01:20 UTC
Created attachment 1740167 [details]
Reference must-gather data from one of the failing clusters

Comment 2 Stephen Greene 2020-12-18 16:35:40 UTC

*** This bug has been marked as a duplicate of bug 1908389 ***