Bug 1853734

Summary: cluster-image-registry-operator calls ListKeys excessively
Product: OpenShift Container Platform Reporter: Jim Minter <jminter>
Component: Image RegistryAssignee: Ricardo Maraschini <rmarasch>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: aos-bugs, jminter, mjudeiki, obulatov, pasik, rmarasch, xiuwang, xtian
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: Avoid calling Azure endpoints too many times. Reason: Azure enforces quotas and the operator was constantly querying for storage account keys. Result: By caching the keys locally for a time we avoid going remotely to get them every time.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:12:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jim Minter 2020-07-03 17:41:18 UTC
Description of problem:
  In IPI mode, cluster-image-registry-operator calls ListKeys on the Azure storage account incessantly.  This is bad behaviour, especially so because the quotas on ListKeys() calls are low (https://aka.ms/srpthrottlinglimits) and others could be affected.

  Ideally, cluster-image-registry-operator should call ListKeys once at start-up, and then subsequently in a rate-limited way if it detects an authentication error using the key that it has cached (e.g. in the unlikely event that the end user has rotated the storage key).


Version-Release number of selected component (if applicable):
  4.3.27


How reproducible:
  Always.


Steps to Reproduce:
1. Run IPI cluster.
2. Check audit logs on registry storage account in Azure portal.

Actual results:
  Lots of ListKeys calls seen.

Expected results:
  Few ListKeys calls seen.

Additional info:
  This BZ is closely associated to https://bugzilla.redhat.com/show_bug.cgi?id=1853643.

Comment 4 Ricardo Maraschini 2020-08-21 12:37:15 UTC
No progress due to higher severity bugs.

Comment 8 XiuJuan Wang 2020-09-11 10:13:38 UTC
From Azure portal console, we could see the event "List Storage Account Keys" has been cached every 5 mins no longer cached several times a minute.
Test with 4.6.0-0.nightly-2020-09-10-195619 azure cluster.

Comment 9 XiuJuan Wang 2020-09-14 03:08:20 UTC
And image_registry_operator_azure_key_cache_requests_total metrics has been added
image_registry_operator_azure_key_cache_requests_total{endpoint="60000",instance="10.129.0.13:60000",job="image-registry-operator",namespace="openshift-image-registry",pod="cluster-image-registry-operator-f5f4469b4-l8k2f",result="hit",service="image-registry-operator"}	2586
image_registry_operator_azure_key_cache_requests_total{endpoint="60000",instance="10.129.0.13:60000",job="image-registry-operator",namespace="openshift-image-registry",pod="cluster-image-registry-operator-f5f4469b4-l8k2f",result="miss",service="image-registry-operator"}	18

Comment 11 errata-xmlrpc 2020-10-27 16:12:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196