1782516 – Rate limiting on Azure

Bug 1782516 - Rate limiting on Azure

Summary: Rate limiting on Azure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1810446 1826069
TreeView+	depends on / blocked

Reported:	2019-12-11 18:50 UTC by sumehta
Modified:	2024-06-13 22:20 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1810446 1826069 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:12:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kube-controller-manager must-gather log (615.30 KB, text/plain) 2019-12-11 18:50 UTC, sumehta	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3259	None	closed	BUG 1782516: Disable client side rate limiting in Azure.	2021-02-10 05:04:36 UTC
Red Hat Knowledge Base (Solution)	4861541	None	None	None	2020-05-19 07:50:41 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:13:05 UTC

Description sumehta 2019-12-11 18:50:03 UTC

Created attachment 1644095 [details]
kube-controller-manager must-gather log

Description of problem:
On bringing up 4.3 cluster with networkType as OVNKubernetes, I can see rate limiting on Azure API calls. The activity logs on Azure do not give much info except that error thrown by API call for 'Create or Update Load Balancer', but OpenShift logs show this kind of error popping multiple times.

Version-Release number of selected component (if applicable):
openshift-client-linux-4.3.0-0.nightly-2019-12-09-035405

How reproducible:
Always

Steps to Reproduce:
1. Create an OpenShift cluster with latest 4.3 nightly in Azure, with networkType as 'OVNKubernetes'
2. Look at Activity logs in Azure, as well as events on OpenShift console

Actual results:
OpenShift Console events show repeatedly:

Srouter-defaultNamespaceNSopenshift-ingress
3 minutes ago
Generated from service-controller
46 times in the last hour
Error updating load balancer with new hosts map[sumehta-winc8-2tk76-master-0:{} sumehta-winc8-2tk76-master-1:{} sumehta-winc8-2tk76-master-2:{} sumehta-winc8-2tk76-worker-centralus1-mmjkr:{} sumehta-winc8-2tk76-worker-centralus2-qptsj:{} sumehta-winc8-2tk76-worker-centralus3-pq5sk:{} winnode:{}]: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"

In must-gather logs, this error was present in namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-sumehta-winc8-2tk76-master-1/kube-controller-manager-6/kube-controller-manager-6/logs/current.log, where I could see multiple occurrences of the following:

2019-12-09T18:40:48.5418863Z E1209 18:40:48.541874       1 azure_standard.go:714] error: az.EnsureHostInPool(sumehta-winc8-2tk76-master-0), az.vmSet.GetPrimaryInterface.Get(sumehta-winc8-2tk76-master-0, ), err=azure - cloud provider rate limited(read) for operation:NicGet
2019-12-09T18:40:48.5760855Z I1209 18:40:48.576020       1 azure_backoff.go:80] GetVirtualMachineWithRetry(sumehta-winc8-2tk76-master-1): backoff success
2019-12-09T18:40:48.5761176Z E1209 18:40:48.576101       1 azure_standard.go:714] error: az.EnsureHostInPool(sumehta-winc8-2tk76-master-1), az.vmSet.GetPrimaryInterface.Get(sumehta-winc8-2tk76-master-1, ), err=azure - cloud provider rate limited(read) for operation:NicGet
2019-12-09T18:40:48.5762061Z E1209 18:40:48.576176       1 service_controller.go:255] error processing service openshift-ingress/router-default (will retry): failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"
2019-12-09T18:40:48.5764134Z I1209 18:40:48.576376       1 event.go:255] Event(v1.ObjectReference{Kind:"Service", Namespace:"openshift-ingress", Name:"router-default", UID:"9147d958-b25a-46d3-89c2-f96390d5eb12", APIVersion:"v1", ResourceVersion:"9338", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/5f675811-04fa-483f-9709-ffd8a9da03f0/resourceGroups/sumehta-winc8-2tk76-rg/providers/Microsoft.Network/loadBalancers/sumehta-winc8-2tk76/backendAddressPools/sumehta-winc8-2tk76) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"

Expected results:
Loadbalancer sync/update should not throw any errors.

Additional info:
must-gather logs mentioned above added as an attachment

Comment 1 Alberto 2019-12-17 14:25:06 UTC

That seems to be caused by a limitation in the cloud provider https://github.com/Azure/aks-engine/issues/420. I'd probably expect the config to support limits each Azure resource type (e.g. VMSS, LoadBalancer and RouteTable)

Note this configuration is editable by users in-cluster https://github.com/openshift/installer/blob/789b53e43085ce4e5459ede8d88561c737c2809a/pkg/asset/manifests/azure/cloudproviderconfig.go#L47-L60, presumably increasing CloudProviderRateLimitQPS might help.

Comment 24 Daneyon Hansen 2020-04-20 19:39:26 UTC

According to [1], this BZ exists in 4.4 and 4.3.

[1] https://search.svc.ci.openshift.org/?search=Error+updating+load+balancer+with+new+hosts&maxAge=168h&context=2&type=bug%2Bjunit

Comment 25 Daneyon Hansen 2020-04-20 19:42:23 UTC

Changing to High due to https://bugzilla.redhat.com/show_bug.cgi?id=1782516#c24

Comment 26 Daniel Messer 2020-04-28 12:23:41 UTC

This is also impacting the Quay Setup Operator trying to create Loadbalancers on Azure: https://issues.redhat.com/browse/PROJQUAY-638?focusedCommentId=14060390&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14060390

Comment 31 errata-xmlrpc 2020-07-13 17:12:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.