1809354 – 4.3.1 Azure IPI installs show "cloud provider rate limited(read) for operation:NicGet" in openshift-ingress namespace

Bug 1809354 - 4.3.1 Azure IPI installs show "cloud provider rate limited(read) for operation:NicGet" in openshift-ingress namespace

Summary: 4.3.1 Azure IPI installs show "cloud provider rate limited(read) for operatio...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	x86_64
OS:	Other
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1837324 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-02 22:38 UTC by Caden Marchese
Modified:	2024-06-13 22:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The ingress operator was continuously upserting DNS records that it managed on Azure and GCP. Consequence: The cloud-provider API sometimes rate-limited the upsert API calls, causing alarming "cloud provider rate limited" events in the "openshift-ingress" namespace. In addition, the ingress operator's logs showed repeated "upserted DNS record" log messages. Fix: Logic was added to the ingress operator's DNS controller to avoid upserting a DNS record if it is already published and neither the record nor the DNS zone configuration has changed since the controller last upserted the record. Result: The ingress operator makes fewer calls to the cloud-provider API, the operator's logs show fewer "upserted DNS record" log messages, and the operator should not cause "cloud provider rate limited" events.
Clone Of:
Environment:
Last Closed:	2020-07-13 17:17:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift api pull 644	None	closed	Bug 1809354: operatoringress/dnsrecord: Add observedGeneration	2021-02-01 04:18:09 UTC
Github	openshift cluster-ingress-operator pull 390	None	closed	Bug 1809354: dns: Avoid unnecessary updates	2021-02-01 04:18:09 UTC
Github	openshift okd issues 128	None	closed	[Azure] complains about load balancer update in events: "Provider rate limited(read) for operation:NicGet"	2021-02-01 04:18:09 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:17:57 UTC

Description Caden Marchese 2020-03-02 22:38:17 UTC

Description of problem:
Fresh IPI 4.3.1 install on Azure shows the following event in the openshift-ingress namespace:

Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/d0050fd8-4253-465a-bbbc-83fab9da5e49/resourceGroups/node-gllzw-rg/providers/Microsoft.Network/loadBalancers/node-gllzw/backendAddressPools/node-gllzw) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"

This appears to have no consequences however it's super easy to reproduce and the message is alarming to customers. The (in-progress) fix seems to be editing the machine config operator to set CloudProvderRateLimit to 'false':

  # oc edit machineconfig -o yaml 01-master-kubelet
  # oc edit machineconfig -o yaml 01-worker-kubelet

This then overwrites the line at /etc/kubernetes/cloud.conf that is allegedly causing this message. 

Version-Release number of selected component (if applicable):
4.3.1, Azure IPI

How reproducible:
Very

Steps to Reproduce:
1. Install 4.3.1 on Azure IPI
2. "oc get events -n openshift-ingress," or wait for the events to populate in the main console events feed

Actual results:
Rate limiting message

Expected results:
No rate limiting message

Additional info:
Must-gather of my testing environment and/or the customer's environment is available on request.

Comment 2 Dan Mace 2020-03-03 19:20:50 UTC

We've seen this from the k8s cloud provider since day one with Azure, so it's not a blocking regression, and is always transient. Disabling rate limiting as a fix seems suspicious, though. Moving to 4.5. There may be an old bug of which this is a duplicate.

Comment 7 Daneyon Hansen 2020-04-30 19:34:47 UTC

Pedro,

The ingress operator periodically ensures the default ingresscontroller is present. You should be able to recreate the default-wildcard DNSRecord by deleting ingresscontroller/default:

$ oc delete ingresscontroller/default -n openshift-ingress-operator
ingresscontroller.operator.openshift.io "default" deleted

Wait a minute.

$ oc get ingresscontroller/default -n openshift-ingress-operator
NAME      AGE
default   2m20s

$ oc get dnsrecord/default-wildcard -n openshift-ingress-operator -o json|jq ".status.zones[0].conditions[0]"
{
  "lastTransitionTime": "2020-04-30T19:28:15Z",
  "message": "The DNS provider succeeded in ensuring the record",
  "reason": "ProviderSuccess",
  "status": "False",
  "type": "Failed"
}

Comment 9 Daneyon Hansen 2020-05-01 22:30:31 UTC

Pedro,

After looking into this issue further, here's the proper way to fix your issue:

# Remove the finalizer:
$ oc patch -n openshift-ingress-operator dnsrecord/default-wildcard --patch '{"metadata":{"finalizers": []}}' --type=merge
dnsrecord.ingress.operator.openshift.io/default-wildcard patched

# Delete the stuck dnsrecord
$ oc delete dnsrecord/default-wildcard -n openshift-ingress-operator
dnsrecord.ingress.operator.openshift.io "default-wildcard" deleted

# Verify the record has been recreated.
$ oc get dnsrecord/default-wildcard -n openshift-ingress-operator
NAME               AGE
default-wildcard   4s

# Verify that status of co/ingress:
$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.5.0-0.nightly-2020-04-03-194832   True        False         False      6h1m

A dnsrecord record is only created during ingresscontroller creation [0].

[0] https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L268-L274

Comment 10 Pedro Amoedo 2020-05-04 11:26:48 UTC

Thanks Daneyon, we'll try that and get back to you ASAP to confirm the workaround.

Best Regards.

Comment 22 Pedro Amoedo 2020-05-11 15:32:51 UTC

Hi again Daneyon, please also note that we have found related BZ#1782516 which already contains a PR[1] which seems to basically deactivate "CloudProviderRateLimit", right?

Please note that when the patch is succesfully merged to 4.5, we'll need a proper 4.3 backport, thanks.

NOTE: I will also link our case with that BZ, we can continue there if you prefer.

[1] - https://github.com/openshift/installer/pull/3259

Best Regards.

Comment 23 Pedro Amoedo 2020-05-11 15:36:04 UTC

[UPDATE]

Please disregard the 4.3 backport comment, I can see that BZ#1826073 is already in place for that, thanks.

Comment 24 Andrew McDermott 2020-05-21 16:25:53 UTC

*** Bug 1837324 has been marked as a duplicate of this bug. ***

Comment 27 Hongan Li 2020-05-28 07:21:49 UTC

Verified with 4.5.0-0.nightly-2020-05-27-202943 and issue has been fixed.

didn't see any "rate limited" events or the problem in https://bugzilla.redhat.com/show_bug.cgi?id=1837324.

Comment 29 errata-xmlrpc 2020-07-13 17:17:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 30 Red Hat Bugzilla 2024-01-06 04:28:18 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.