Description of problem: I've noticed a strange thing with the Image Registry running on Azure in OpenShift 4.2.7 (possibly all other versions too). When the registry is idle, I'm seeing about 40 requests per minute for "List Storage Account Keys" per minute in Azure console, under the resource group "Activity log". Each request is has a "Started" and "Succeeded" event, so it's 80 events in total per minute. I looked in the logs of the "cluster-image-registry-operator" and these log entries seem to correlate within about a second of the azure activity log entries. I1125 05:20:33.804388 14 controller.go:260] event from workqueue successfully processed I1125 05:20:36.816148 14 controller.go:260] event from workqueue successfully processed I1125 05:20:39.987651 14 controller.go:260] event from workqueue successfully processed Version-Release number of selected component (if applicable): 4.2.7 How reproducible: Everytime Steps to Reproduce: 1. Provision an OpenShift cluster on Azure 2. Go into the Azure console, and open the resource group that openshift is installed in, open the activity log Actual results: There is lots of "List Storage Account Keys" events, roughly 80 events per minute (2 events for each "List Storage Account Keys" request) Expected results: Significantly less "List Storage Account Keys" events, as this seems rather excessive, especially when the registry is idle. Additional info:
Submitted pull request: https://github.com/openshift/cluster-image-registry-operator/pull/450
I saw the above pull request got closed because the plan is to move to multiple controllers. Would the target still be 4.4.0? Or are there no present plants to split into multiple controllers? I'm asking because my Azure Admins are still complaining about the significant number of events per minute.
I can see requests appearing every 5 minutes and no one is using the cluster with 4.5.0-0.nightly-2020-03-16-004817, is it expected? since when I verify on 4.3/4.2, not like this.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.2.z and 4.3.1
Tracking the removal of the install-config fallback when ProviderStatus is empty (which is where born-in-4.1 clusters will break, because they have an empty ProviderStatus [1]). $ git --no-pager log -4 --first-parent --oneline origin/release-4.4 6516529c5 (origin/release-4.4) Merge pull request #483 from openshift-cherrypick-robot/cherry-pick-465-to-release-4.4 9fba4f340 Merge pull request #479 from ricardomaraschini/removed-as-available-4.4 5207d6e39 Merge pull request #478 from openshift-cherrypick-robot/cherry-pick-474-to-release-4.4 561a51010 Merge pull request #471 from openshift-cherrypick-robot/cherry-pick-470-to-release-4.4 $ git --no-pager log -1 --first-parent --oneline origin/release-4.3 f16764832 (origin/release-4.3) Merge pull request #475 from coreydaley/4_3_remove_kubesystem_watch Checking vs. existing releases: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.1-x86_64 | grep image-registry-operator cluster-image-registry-operator https://github.com/openshift/cluster-image-registry-operator 6516529c5790e0b522a17980bb2186cb08c0b0e1 $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep image-registry-operator cluster-image-registry-operator https://github.com/openshift/cluster-image-registry-operator 6516529c5790e0b522a17980bb2186cb08c0b0e1 So both 4.4.0-rc.0 and 4.4.0-rc.1 will be broken for born-in-4.1 clusters. $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.5-x86_64 | grep image-registry-operator cluster-image-registry-operator https://github.com/openshift/cluster-image-registry-operator 89794913d7dad5e3f2dd3b6211f0b31da8886508 So 4.3.5 still supports the install-config fallback needed for born-in-4.1 clusters. [1]: https://github.com/openshift/cluster-image-registry-operator/pull/470#discussion_r392454218
Oops, removed the Upgrades keyword by accident. Putting it back...
"List Storage Account Keys" events reduces magnificently(about 6 times one minutes and 5 minutes refresh once) with 4.5.0-0.nightly-2020-03-24-002404.
Just verify on 4.5 fresh cluster, will check with born-in-4.1 clusters(but I will start from 4.2, since Azure is supported since 4.2)
It have to be 4.1. As we don't support Azure on 4.1, we have two independent items to check: 1. number of the "List Storage Account Keys" events (VERIFIED) 2. the operator still works on born-in-4.1 clusters (you can pick any cloud provider)
(In reply to Oleg Bulatov from comment #21) > It have to be 4.1. As we don't support Azure on 4.1, we have two independent > items to check: > > 1. number of the "List Storage Account Keys" events (VERIFIED) > 2. the operator still works on born-in-4.1 clusters (you can pick any cloud > provider) born-in-4.1 cluster is blocked by this bug now : ( https://bugzilla.redhat.com/show_bug.cgi?id=1815539
Image registry operator is running well on born-4.1-cluster which is upgrade with below path: 4.1.38-x86_64,4.2.26-x86_64,4.3.9-x86_64,4.4.0-rc.4-x86_64,4.5.0-0.nightly-2020-03-31-225120
Just wanted to say thanks for everyone who worked on this. I applied this fix to my dev 4.2 cluster a few weeks ago when it became available in the candidate-4.2 channel. It then allowed me to see real errors in the Azure Activity Log, and identity issues with the cluster which had confused me for a while. So thanks again!
I'm glad that we could get it sorted out for you, and thanks for reporting it. It was actually affecting all of the storage backends, we just didn't realize it until we were investigating this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409