Bug 1809354

Summary: 4.3.1 Azure IPI installs show "cloud provider rate limited(read) for operation:NicGet" in openshift-ingress namespace
Product: OpenShift Container Platform Reporter: Caden Marchese <cmarches>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: low CC: amcdermo, aos-bugs, dhansen, hongli, mharri, pamoedom, rdomnu
Version: 4.3.z   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: x86_64   
OS: Other   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The ingress operator was continuously upserting DNS records that it managed on Azure and GCP. Consequence: The cloud-provider API sometimes rate-limited the upsert API calls, causing alarming "cloud provider rate limited" events in the "openshift-ingress" namespace. In addition, the ingress operator's logs showed repeated "upserted DNS record" log messages. Fix: Logic was added to the ingress operator's DNS controller to avoid upserting a DNS record if it is already published and neither the record nor the DNS zone configuration has changed since the controller last upserted the record. Result: The ingress operator makes fewer calls to the cloud-provider API, the operator's logs show fewer "upserted DNS record" log messages, and the operator should not cause "cloud provider rate limited" events.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:17:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Caden Marchese 2020-03-02 22:38:17 UTC
Description of problem:
Fresh IPI 4.3.1 install on Azure shows the following event in the openshift-ingress namespace:

Error syncing load balancer: failed to ensure load balancer: ensure(openshift-ingress/router-default): backendPoolID(/subscriptions/d0050fd8-4253-465a-bbbc-83fab9da5e49/resourceGroups/node-gllzw-rg/providers/Microsoft.Network/loadBalancers/node-gllzw/backendAddressPools/node-gllzw) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:NicGet"

This appears to have no consequences however it's super easy to reproduce and the message is alarming to customers. The (in-progress) fix seems to be editing the machine config operator to set CloudProvderRateLimit to 'false':

  # oc edit machineconfig -o yaml 01-master-kubelet
  # oc edit machineconfig -o yaml 01-worker-kubelet

This then overwrites the line at /etc/kubernetes/cloud.conf that is allegedly causing this message. 

Version-Release number of selected component (if applicable):
4.3.1, Azure IPI

How reproducible:
Very

Steps to Reproduce:
1. Install 4.3.1 on Azure IPI
2. "oc get events -n openshift-ingress," or wait for the events to populate in the main console events feed

Actual results:
Rate limiting message

Expected results:
No rate limiting message

Additional info:
Must-gather of my testing environment and/or the customer's environment is available on request.

Comment 2 Dan Mace 2020-03-03 19:20:50 UTC
We've seen this from the k8s cloud provider since day one with Azure, so it's not a blocking regression, and is always transient. Disabling rate limiting as a fix seems suspicious, though. Moving to 4.5. There may be an old bug of which this is a duplicate.

Comment 7 Daneyon Hansen 2020-04-30 19:34:47 UTC
Pedro,

The ingress operator periodically ensures the default ingresscontroller is present. You should be able to recreate the default-wildcard DNSRecord by deleting ingresscontroller/default:

$ oc delete ingresscontroller/default -n openshift-ingress-operator
ingresscontroller.operator.openshift.io "default" deleted

Wait a minute.

$ oc get ingresscontroller/default -n openshift-ingress-operator
NAME      AGE
default   2m20s

$ oc get dnsrecord/default-wildcard -n openshift-ingress-operator -o json|jq ".status.zones[0].conditions[0]"
{
  "lastTransitionTime": "2020-04-30T19:28:15Z",
  "message": "The DNS provider succeeded in ensuring the record",
  "reason": "ProviderSuccess",
  "status": "False",
  "type": "Failed"
}

Comment 9 Daneyon Hansen 2020-05-01 22:30:31 UTC
Pedro,

After looking into this issue further, here's the proper way to fix your issue:

# Remove the finalizer:
$ oc patch -n openshift-ingress-operator dnsrecord/default-wildcard --patch '{"metadata":{"finalizers": []}}' --type=merge
dnsrecord.ingress.operator.openshift.io/default-wildcard patched

# Delete the stuck dnsrecord
$ oc delete dnsrecord/default-wildcard -n openshift-ingress-operator
dnsrecord.ingress.operator.openshift.io "default-wildcard" deleted

# Verify the record has been recreated.
$ oc get dnsrecord/default-wildcard -n openshift-ingress-operator
NAME               AGE
default-wildcard   4s

# Verify that status of co/ingress:
$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.5.0-0.nightly-2020-04-03-194832   True        False         False      6h1m

A dnsrecord record is only created during ingresscontroller creation [0].

[0] https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L268-L274

Comment 10 Pedro Amoedo 2020-05-04 11:26:48 UTC
Thanks Daneyon, we'll try that and get back to you ASAP to confirm the workaround.

Best Regards.

Comment 22 Pedro Amoedo 2020-05-11 15:32:51 UTC
Hi again Daneyon, please also note that we have found related BZ#1782516 which already contains a PR[1] which seems to basically deactivate "CloudProviderRateLimit", right?

Please note that when the patch is succesfully merged to 4.5, we'll need a proper 4.3 backport, thanks.

NOTE: I will also link our case with that BZ, we can continue there if you prefer.

[1] - https://github.com/openshift/installer/pull/3259

Best Regards.

Comment 23 Pedro Amoedo 2020-05-11 15:36:04 UTC
[UPDATE]

Please disregard the 4.3 backport comment, I can see that BZ#1826073 is already in place for that, thanks.

Comment 24 Andrew McDermott 2020-05-21 16:25:53 UTC
*** Bug 1837324 has been marked as a duplicate of this bug. ***

Comment 27 Hongan Li 2020-05-28 07:21:49 UTC
Verified with 4.5.0-0.nightly-2020-05-27-202943 and issue has been fixed.

didn't see any "rate limited" events or the problem in https://bugzilla.redhat.com/show_bug.cgi?id=1837324.

Comment 29 errata-xmlrpc 2020-07-13 17:17:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 30 Red Hat Bugzilla 2024-01-06 04:28:18 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days