Created attachment 1644095 [details] kube-controller-manager must-gather log Description of problem: On bringing up 4.3 cluster with networkType as OVNKubernetes, I can see rate limiting on Azure API calls. The activity logs on Azure do not give much info except that error thrown by API call for 'Create or Update Load Balancer', but OpenShift logs show this kind of error popping multiple times. Version-Release number of selected component (if applicable): openshift-client-linux-4.3.0-0.nightly-2019-12-09-035405 How reproducible: Always Steps to Reproduce: 1. Create an OpenShift cluster with latest 4.3 nightly in Azure, with networkType as 'OVNKubernetes' 2. Look at Activity logs in Azure, as well as events on OpenShift console Actual results: OpenShift Console events show repeatedly: Srouter-defaultNamespaceNSopenshift-ingress 3 minutes ago Generated from service-controller 46 times in the last hour Error updating load balancer with new hosts map[sumehta-winc8-2tk76-master-0:{} sumehta-winc8-2tk76-master-1:{} sumehta-winc8-2tk76-master-2:{} sumehta-winc8-2tk76-worker-centralus1-mmjkr:{} sumehta-winc8-2tk76-worker-centralus2-qptsj:{} sumehta-winc8-2tk76-worker-centralus3-pq5sk:{} winnode:{}]: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet" In must-gather logs, this error was present in namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-sumehta-winc8-2tk76-master-1/kube-controller-manager-6/kube-controller-manager-6/logs/current.log, where I could see multiple occurrences of the following: 2019-12-09T18:40:48.5418863Z E1209 18:40:48.541874 1 azure_standard.go:714] error: az.EnsureHostInPool(sumehta-winc8-2tk76-master-0), az.vmSet.GetPrimaryInterface.Get(sumehta-winc8-2tk76-master-0, ), err=azure - cloud provider rate limited(read) for operation:NicGet 2019-12-09T18:40:48.5760855Z I1209 18:40:48.576020 1 azure_backoff.go:80] GetVirtualMachineWithRetry(sumehta-winc8-2tk76-master-1): backoff success 2019-12-09T18:40:48.5761176Z E1209 18:40:48.576101 1 azure_standard.go:714] error: az.EnsureHostInPool(sumehta-winc8-2tk76-master-1), az.vmSet.GetPrimaryInterface.Get(sumehta-winc8-2tk76-master-1, ), err=azure - cloud provider rate limited(read) for operation:NicGet 2019-12-09T18:40:48.5762061Z E1209 18:40:48.576176 1 service_controller.go:255] error processing service openshift-ingress/router-default (will retry): failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet" 2019-12-09T18:40:48.5764134Z I1209 18:40:48.576376 1 event.go:255] Event(v1.ObjectReference{Kind:"Service", Namespace:"openshift-ingress", Name:"router-default", UID:"9147d958-b25a-46d3-89c2-f96390d5eb12", APIVersion:"v1", ResourceVersion:"9338", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet" Expected results: Loadbalancer sync/update should not throw any errors. Additional info: must-gather logs mentioned above added as an attachment
That seems to be caused by a limitation in the cloud provider https://github.com/Azure/aks-engine/issues/420. I'd probably expect the config to support limits each Azure resource type (e.g. VMSS, LoadBalancer and RouteTable) Note this configuration is editable by users in-cluster https://github.com/openshift/installer/blob/789b53e43085ce4e5459ede8d88561c737c2809a/pkg/asset/manifests/azure/cloudproviderconfig.go#L47-L60, presumably increasing CloudProviderRateLimitQPS might help.
According to [1], this BZ exists in 4.4 and 4.3. [1] https://search.svc.ci.openshift.org/?search=Error+updating+load+balancer+with+new+hosts&maxAge=168h&context=2&type=bug%2Bjunit
Changing to High due to https://bugzilla.redhat.com/show_bug.cgi?id=1782516#c24
This is also impacting the Quay Setup Operator trying to create Loadbalancers on Azure: https://issues.redhat.com/browse/PROJQUAY-638?focusedCommentId=14060390&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14060390
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409