Description of problem: Fresh IPI 4.3.1 install on Azure shows the following event in the openshift-ingress namespace: Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/d0050fd8-4253-465a-bbbc-83fab9da5e49/resourceGroups/node-gllzw-rg/providers/Microsoft.Network/loadBalancers/node-gllzw/backendAddressPools/node-gllzw) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet" This appears to have no consequences however it's super easy to reproduce and the message is alarming to customers. The (in-progress) fix seems to be editing the machine config operator to set CloudProvderRateLimit to 'false': # oc edit machineconfig -o yaml 01-master-kubelet # oc edit machineconfig -o yaml 01-worker-kubelet This then overwrites the line at /etc/kubernetes/cloud.conf that is allegedly causing this message. Version-Release number of selected component (if applicable): 4.3.1, Azure IPI How reproducible: Very Steps to Reproduce: 1. Install 4.3.1 on Azure IPI 2. "oc get events -n openshift-ingress," or wait for the events to populate in the main console events feed Actual results: Rate limiting message Expected results: No rate limiting message Additional info: Must-gather of my testing environment and/or the customer's environment is available on request.
We've seen this from the k8s cloud provider since day one with Azure, so it's not a blocking regression, and is always transient. Disabling rate limiting as a fix seems suspicious, though. Moving to 4.5. There may be an old bug of which this is a duplicate.
Pedro, The ingress operator periodically ensures the default ingresscontroller is present. You should be able to recreate the default-wildcard DNSRecord by deleting ingresscontroller/default: $ oc delete ingresscontroller/default -n openshift-ingress-operator ingresscontroller.operator.openshift.io "default" deleted Wait a minute. $ oc get ingresscontroller/default -n openshift-ingress-operator NAME AGE default 2m20s $ oc get dnsrecord/default-wildcard -n openshift-ingress-operator -o json|jq ".status.zones[0].conditions[0]" { "lastTransitionTime": "2020-04-30T19:28:15Z", "message": "The DNS provider succeeded in ensuring the record", "reason": "ProviderSuccess", "status": "False", "type": "Failed" }
Pedro, After looking into this issue further, here's the proper way to fix your issue: # Remove the finalizer: $ oc patch -n openshift-ingress-operator dnsrecord/default-wildcard --patch '{"metadata":{"finalizers": []}}' --type=merge dnsrecord.ingress.operator.openshift.io/default-wildcard patched # Delete the stuck dnsrecord $ oc delete dnsrecord/default-wildcard -n openshift-ingress-operator dnsrecord.ingress.operator.openshift.io "default-wildcard" deleted # Verify the record has been recreated. $ oc get dnsrecord/default-wildcard -n openshift-ingress-operator NAME AGE default-wildcard 4s # Verify that status of co/ingress: $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.5.0-0.nightly-2020-04-03-194832 True False False 6h1m A dnsrecord record is only created during ingresscontroller creation [0]. [0] https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L268-L274
Thanks Daneyon, we'll try that and get back to you ASAP to confirm the workaround. Best Regards.
Hi again Daneyon, please also note that we have found related BZ#1782516 which already contains a PR[1] which seems to basically deactivate "CloudProviderRateLimit", right? Please note that when the patch is succesfully merged to 4.5, we'll need a proper 4.3 backport, thanks. NOTE: I will also link our case with that BZ, we can continue there if you prefer. [1] - https://github.com/openshift/installer/pull/3259 Best Regards.
[UPDATE] Please disregard the 4.3 backport comment, I can see that BZ#1826073 is already in place for that, thanks.
*** Bug 1837324 has been marked as a duplicate of this bug. ***
Verified with 4.5.0-0.nightly-2020-05-27-202943 and issue has been fixed. didn't see any "rate limited" events or the problem in https://bugzilla.redhat.com/show_bug.cgi?id=1837324.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days