Bug 1776665

Summary:	[4.2] Idle OpenShift Image registry queries Azure storage keys about 40 times per minute
Product:	OpenShift Container Platform	Reporter:	Joel Pearson <japearson>
Component:	Image Registry	Assignee:	Corey Daley <cdaley>
Status:	CLOSED ERRATA	QA Contact:	Wenjing Zheng <wzheng>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.2.z	CC:	adam.kaplan, aos-bugs, cdaley, esimard, mharri, obulatov, sdodson, wking
Target Milestone:	---	Keywords:	Reopened, Upgrades
Target Release:	4.5.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1808425 1808505 (view as bug list)		Environment:
Last Closed:	2020-07-13 17:12:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1815539
Bug Blocks:	1808425

Description Joel Pearson 2019-11-26 06:14:54 UTC

Description of problem:
I've noticed a strange thing with the Image Registry running on Azure in OpenShift 4.2.7 (possibly all other versions too).

When the registry is idle, I'm seeing about 40 requests per minute for "List Storage Account Keys" per minute in Azure console, under the resource group "Activity log".  Each request is has a "Started" and "Succeeded" event, so it's 80 events in total per minute.

I looked in the logs of the "cluster-image-registry-operator" and these log entries seem to correlate within about a second of the azure activity log entries.

I1125 05:20:33.804388      14 controller.go:260] event from workqueue successfully processed
I1125 05:20:36.816148      14 controller.go:260] event from workqueue successfully processed
I1125 05:20:39.987651      14 controller.go:260] event from workqueue successfully processed


Version-Release number of selected component (if applicable): 4.2.7


How reproducible: Everytime


Steps to Reproduce:
1. Provision an OpenShift cluster on Azure
2. Go into the Azure console, and open the resource group that openshift is installed in, open the activity log

Actual results:

There is lots of "List Storage Account Keys" events, roughly 80 events per minute (2 events for each "List Storage Account Keys" request)

Expected results:

Significantly less "List Storage Account Keys" events, as this seems rather excessive, especially when the registry is idle.


Additional info:

Comment 1 Corey Daley 2020-01-28 17:40:25 UTC

Submitted pull request: https://github.com/openshift/cluster-image-registry-operator/pull/450

Comment 2 Joel Pearson 2020-02-05 04:44:30 UTC

I saw the above pull request got closed because the plan is to move to multiple controllers.  Would the target still be 4.4.0? Or are there no present plants to split into multiple controllers?

I'm asking because my Azure Admins are still complaining about the significant number of events per minute.

Comment 11 Wenjing Zheng 2020-03-16 11:15:14 UTC

I can see requests appearing every 5 minutes and no one is using the cluster with 4.5.0-0.nightly-2020-03-16-004817, is it expected? since when I verify on 4.3/4.2, not like this.

Comment 15 Scott Dodson 2020-03-16 18:56:30 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

Comment 16 W. Trevor King 2020-03-16 19:14:00 UTC

Tracking the removal of the install-config fallback when ProviderStatus is empty (which is where born-in-4.1 clusters will break, because they have an empty ProviderStatus [1]).

$ git --no-pager log -4 --first-parent --oneline origin/release-4.4
6516529c5 (origin/release-4.4) Merge pull request #483 from openshift-cherrypick-robot/cherry-pick-465-to-release-4.4
9fba4f340 Merge pull request #479 from ricardomaraschini/removed-as-available-4.4
5207d6e39 Merge pull request #478 from openshift-cherrypick-robot/cherry-pick-474-to-release-4.4
561a51010 Merge pull request #471 from openshift-cherrypick-robot/cherry-pick-470-to-release-4.4
$ git --no-pager log -1 --first-parent --oneline origin/release-4.3
f16764832 (origin/release-4.3) Merge pull request #475 from coreydaley/4_3_remove_kubesystem_watch

Checking vs. existing releases:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.1-x86_64 | grep image-registry-operator
  cluster-image-registry-operator                https://github.com/openshift/cluster-image-registry-operator                6516529c5790e0b522a17980bb2186cb08c0b0e1
$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep image-registry-operator
  cluster-image-registry-operator                https://github.com/openshift/cluster-image-registry-operator                6516529c5790e0b522a17980bb2186cb08c0b0e1

So both 4.4.0-rc.0 and 4.4.0-rc.1 will be broken for born-in-4.1 clusters.  

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.5-x86_64 | grep image-registry-operator
  cluster-image-registry-operator               https://github.com/openshift/cluster-image-registry-operator               89794913d7dad5e3f2dd3b6211f0b31da8886508

So 4.3.5 still supports the install-config fallback needed for born-in-4.1 clusters.

[1]: https://github.com/openshift/cluster-image-registry-operator/pull/470#discussion_r392454218

Comment 17 W. Trevor King 2020-03-16 19:26:37 UTC

Oops, removed the Upgrades keyword by accident.  Putting it back...

Comment 19 Wenjing Zheng 2020-03-24 12:07:59 UTC

"List Storage Account Keys" events reduces magnificently(about 6 times one minutes and 5 minutes refresh once) with 4.5.0-0.nightly-2020-03-24-002404.

Comment 20 Wenjing Zheng 2020-03-24 13:10:54 UTC

Just verify on 4.5 fresh cluster, will check with born-in-4.1 clusters(but I will start from 4.2, since Azure is supported since 4.2)

Comment 21 Oleg Bulatov 2020-03-24 14:03:04 UTC

It have to be 4.1. As we don't support Azure on 4.1, we have two independent items to check:

1. number of the "List Storage Account Keys" events (VERIFIED)
2. the operator still works on born-in-4.1 clusters (you can pick any cloud provider)

Comment 22 Wenjing Zheng 2020-03-25 11:49:31 UTC

(In reply to Oleg Bulatov from comment #21)
> It have to be 4.1. As we don't support Azure on 4.1, we have two independent
> items to check:
> 
> 1. number of the "List Storage Account Keys" events (VERIFIED)
> 2. the operator still works on born-in-4.1 clusters (you can pick any cloud
> provider)

born-in-4.1 cluster is blocked by this bug now : (
https://bugzilla.redhat.com/show_bug.cgi?id=1815539

Comment 23 Wenjing Zheng 2020-04-01 11:00:37 UTC

Image registry operator is running well on born-4.1-cluster which is upgrade with below path:
4.1.38-x86_64,4.2.26-x86_64,4.3.9-x86_64,4.4.0-rc.4-x86_64,4.5.0-0.nightly-2020-03-31-225120

Comment 24 Joel Pearson 2020-04-02 20:05:14 UTC

Just wanted to say thanks for everyone who worked on this. I applied this fix to my dev 4.2 cluster a few weeks ago when it became available in the candidate-4.2 channel. It then allowed me to see real errors in the Azure Activity Log, and identity issues with the cluster which had confused me for a while.  So thanks again!

Comment 25 Corey Daley 2020-04-03 15:20:53 UTC

I'm glad that we could get it sorted out for you, and thanks for reporting it.
It was actually affecting all of the storage backends, we just didn't realize it until we were investigating this bug.

Comment 27 errata-xmlrpc 2020-07-13 17:12:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409