Bug 1782516 - Rate limiting on Azure
Summary: Rate limiting on Azure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.5.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks: 1810446 1826069
TreeView+ depends on / blocked
 
Reported: 2019-12-11 18:50 UTC by sumehta
Modified: 2023-10-06 18:53 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1810446 1826069 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:12:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kube-controller-manager must-gather log (615.30 KB, text/plain)
2019-12-11 18:50 UTC, sumehta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3259 0 None closed BUG 1782516: Disable client side rate limiting in Azure. 2021-02-10 05:04:36 UTC
Red Hat Knowledge Base (Solution) 4861541 0 None None None 2020-05-19 07:50:41 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:13:05 UTC

Description sumehta 2019-12-11 18:50:03 UTC
Created attachment 1644095 [details]
kube-controller-manager must-gather log

Description of problem:
On bringing up 4.3 cluster with networkType as OVNKubernetes, I can see rate limiting on Azure API calls. The activity logs on Azure do not give much info except that error thrown by API call for 'Create or Update Load Balancer', but OpenShift logs show this kind of error popping multiple times.

Version-Release number of selected component (if applicable):
openshift-client-linux-4.3.0-0.nightly-2019-12-09-035405

How reproducible:
Always

Steps to Reproduce:
1. Create an OpenShift cluster with latest 4.3 nightly in Azure, with networkType as 'OVNKubernetes'
2. Look at Activity logs in Azure, as well as events on OpenShift console

Actual results:
OpenShift Console events show repeatedly:

Srouter-defaultNamespaceNSopenshift-ingress
3 minutes ago
Generated from service-controller
46 times in the last hour
Error updating load balancer with new hosts map[sumehta-winc8-2tk76-master-0:{} sumehta-winc8-2tk76-master-1:{} sumehta-winc8-2tk76-master-2:{} sumehta-winc8-2tk76-worker-centralus1-mmjkr:{} sumehta-winc8-2tk76-worker-centralus2-qptsj:{} sumehta-winc8-2tk76-worker-centralus3-pq5sk:{} winnode:{}]: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"

In must-gather logs, this error was present in namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-sumehta-winc8-2tk76-master-1/kube-controller-manager-6/kube-controller-manager-6/logs/current.log, where I could see multiple occurrences of the following:

2019-12-09T18:40:48.5418863Z E1209 18:40:48.541874       1 azure_standard.go:714] error: az.EnsureHostInPool(sumehta-winc8-2tk76-master-0), az.vmSet.GetPrimaryInterface.Get(sumehta-winc8-2tk76-master-0, ), err=azure - cloud provider rate limited(read) for operation:NicGet
2019-12-09T18:40:48.5760855Z I1209 18:40:48.576020       1 azure_backoff.go:80] GetVirtualMachineWithRetry(sumehta-winc8-2tk76-master-1): backoff success
2019-12-09T18:40:48.5761176Z E1209 18:40:48.576101       1 azure_standard.go:714] error: az.EnsureHostInPool(sumehta-winc8-2tk76-master-1), az.vmSet.GetPrimaryInterface.Get(sumehta-winc8-2tk76-master-1, ), err=azure - cloud provider rate limited(read) for operation:NicGet
2019-12-09T18:40:48.5762061Z E1209 18:40:48.576176       1 service_controller.go:255] error processing service openshift-ingress/router-default (will retry): failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"
2019-12-09T18:40:48.5764134Z I1209 18:40:48.576376       1 event.go:255] Event(v1.ObjectReference{Kind:"Service", Namespace:"openshift-ingress", Name:"router-default", UID:"9147d958-b25a-46d3-89c2-f96390d5eb12", APIVersion:"v1", ResourceVersion:"9338", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"

Expected results:
Loadbalancer sync/update should not throw any errors.

Additional info:
must-gather logs mentioned above added as an attachment

Comment 1 Alberto 2019-12-17 14:25:06 UTC
That seems to be caused by a limitation in the cloud provider https://github.com/Azure/aks-engine/issues/420. I'd probably expect the config to support limits each Azure resource type (e.g. VMSS, LoadBalancer and RouteTable)

Note this configuration is editable by users in-cluster https://github.com/openshift/installer/blob/789b53e43085ce4e5459ede8d88561c737c2809a/pkg/asset/manifests/azure/cloudproviderconfig.go#L47-L60, presumably increasing CloudProviderRateLimitQPS might help.

Comment 24 Daneyon Hansen 2020-04-20 19:39:26 UTC
According to [1], this BZ exists in 4.4 and 4.3.

[1] https://search.svc.ci.openshift.org/?search=Error+updating+load+balancer+with+new+hosts&maxAge=168h&context=2&type=bug%2Bjunit

Comment 25 Daneyon Hansen 2020-04-20 19:42:23 UTC
Changing to High due to https://bugzilla.redhat.com/show_bug.cgi?id=1782516#c24

Comment 26 Daniel Messer 2020-04-28 12:23:41 UTC
This is also impacting the Quay Setup Operator trying to create Loadbalancers on Azure: https://issues.redhat.com/browse/PROJQUAY-638?focusedCommentId=14060390&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14060390

Comment 31 errata-xmlrpc 2020-07-13 17:12:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.