Bug 1908389

Summary: Loadbalancer Sync failing on Azure
Product: OpenShift Container Platform Reporter: Fabian von Feilitzsch <fabian>
Component: NetworkingAssignee: Stephen Greene <sgreene>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aiyengar, aos-bugs, aravindh, dhansen, esimard, fdeutsch, ffranz, hongli, htariq, jhou, jluhrsen, jspeed, mfojtik, mgugino, mstaeble, sdodson, sgreene, sjenning, wking
Version: 4.7Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
operator conditions authentication operator conditions console operator conditions ingress operator install authentication operator install console operator install ingress
Last Closed: 2021-02-24 15:45:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fabian von Feilitzsch 2020-12-16 15:19:44 UTC
Description of problem:
Ingress failing on Azure with 'SyncLoadBalancerFailed'
Azure cluster setup fails because ingress is broken. KCM reports:

level=error msg=Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: invalid ip config ID /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-xsr7hy3v-9b656-xrfd2-rg/providers/Microsoft.Network/networkInterfaces/ci-op-xsr7hy3v-9b656-xrfd2-master0-nic/ipConfigurations/pipConfig


Version-Release number of selected component (if applicable):
4.7

How reproducible:
100%

Additional info:
Example failing job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.7/1339191619219361792

https://search.ci.openshift.org/?search=failed+to+ensure+load+balancer%3A+invalid+ip+config+ID&maxAge=336h&context=1&type=bug%2Bjunit&name=azure&maxMatches=5&maxBytes=20971520&groupBy=job

First appeared shortly after the 1.20 rebase: https://github.com/openshift/kubernetes/pull/471#event-4110268165

Comment 1 Maciej Szulik 2020-12-16 15:26:26 UTC
Sending this over to network team who own the ingress operator to identify what is missing and needs updating after getting k8s 1.20

Comment 2 aaleman 2020-12-16 16:10:22 UTC
*** Bug 1908052 has been marked as a duplicate of this bug. ***

Comment 6 Michael Gugino 2020-12-17 15:44:46 UTC
We created an issue upstream: https://github.com/kubernetes/enhancements/pull/1116

The person that introduced the breaking change has assigned themselves.  Not sure on time table, we might want a patch to land downstream first with an upstream fix hopefully in the works.

Comment 12 W. Trevor King 2020-12-18 01:15:57 UTC
I've filed [1] upstream with the Availability Set issue.

[1]: https://github.com/kubernetes/kubernetes/issues/97375

Comment 13 Stephen Greene 2020-12-18 16:35:39 UTC
*** Bug 1909006 has been marked as a duplicate of this bug. ***

Comment 15 Scott Dodson 2021-01-05 14:16:29 UTC
*** Bug 1908489 has been marked as a duplicate of this bug. ***

Comment 16 Haseeb Tariq 2021-01-06 22:30:47 UTC
Commenting for the benefit of build watchers and Sippy to link this BZ to the following tests that are currently failing because of the failed cluster installation in an azure environment.
- operator conditions authentication
- operator conditions console
- operator conditions ingress
- operator install authentication
- operator install console
- operator install ingress

Latest failure: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.7/1346875212452335616

Comment 18 Hongan Li 2021-01-07 12:03:56 UTC
verified with 4.7.0-0.nightly-2021-01-07-080803 and passed.

# oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
router-default            LoadBalancer   172.30.104.234   52.252.144.92   80:32233/TCP,443:32292/TCP   36m
router-internal-default   ClusterIP      172.30.208.255   <none>          80/TCP,443/TCP,1936/TCP      36m

# oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.7.0-0.nightly-2021-01-07-080803   True        False         False      30m

creating one custom ingresscontroller also works well
# oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
router-default            LoadBalancer   172.30.104.234   52.252.144.92   80:32233/TCP,443:32292/TCP   38m
router-internal-default   ClusterIP      172.30.208.255   <none>          80/TCP,443/TCP,1936/TCP      38m
router-internal-test      ClusterIP      172.30.211.21    <none>          80/TCP,443/TCP,1936/TCP      53s
router-test               LoadBalancer   172.30.211.65    10.0.32.7       80:30966/TCP,443:32636/TCP   53s

Comment 21 errata-xmlrpc 2021-02-24 15:45:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633