Bug 1776665
Summary: | [4.2] Idle OpenShift Image registry queries Azure storage keys about 40 times per minute | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Joel Pearson <japearson> | |
Component: | Image Registry | Assignee: | Corey Daley <cdaley> | |
Status: | CLOSED ERRATA | QA Contact: | Wenjing Zheng <wzheng> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.2.z | CC: | adam.kaplan, aos-bugs, cdaley, esimard, mharri, obulatov, sdodson, wking | |
Target Milestone: | --- | Keywords: | Reopened, Upgrades | |
Target Release: | 4.5.0 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1808425 1808505 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:12:18 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1815539 | |||
Bug Blocks: | 1808425 |
Description
Joel Pearson
2019-11-26 06:14:54 UTC
Submitted pull request: https://github.com/openshift/cluster-image-registry-operator/pull/450 I saw the above pull request got closed because the plan is to move to multiple controllers. Would the target still be 4.4.0? Or are there no present plants to split into multiple controllers? I'm asking because my Azure Admins are still complaining about the significant number of events per minute. I can see requests appearing every 5 minutes and no one is using the cluster with 4.5.0-0.nightly-2020-03-16-004817, is it expected? since when I verify on 4.3/4.2, not like this. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.2.z and 4.3.1 Tracking the removal of the install-config fallback when ProviderStatus is empty (which is where born-in-4.1 clusters will break, because they have an empty ProviderStatus [1]). $ git --no-pager log -4 --first-parent --oneline origin/release-4.4 6516529c5 (origin/release-4.4) Merge pull request #483 from openshift-cherrypick-robot/cherry-pick-465-to-release-4.4 9fba4f340 Merge pull request #479 from ricardomaraschini/removed-as-available-4.4 5207d6e39 Merge pull request #478 from openshift-cherrypick-robot/cherry-pick-474-to-release-4.4 561a51010 Merge pull request #471 from openshift-cherrypick-robot/cherry-pick-470-to-release-4.4 $ git --no-pager log -1 --first-parent --oneline origin/release-4.3 f16764832 (origin/release-4.3) Merge pull request #475 from coreydaley/4_3_remove_kubesystem_watch Checking vs. existing releases: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.1-x86_64 | grep image-registry-operator cluster-image-registry-operator https://github.com/openshift/cluster-image-registry-operator 6516529c5790e0b522a17980bb2186cb08c0b0e1 $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep image-registry-operator cluster-image-registry-operator https://github.com/openshift/cluster-image-registry-operator 6516529c5790e0b522a17980bb2186cb08c0b0e1 So both 4.4.0-rc.0 and 4.4.0-rc.1 will be broken for born-in-4.1 clusters. $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.5-x86_64 | grep image-registry-operator cluster-image-registry-operator https://github.com/openshift/cluster-image-registry-operator 89794913d7dad5e3f2dd3b6211f0b31da8886508 So 4.3.5 still supports the install-config fallback needed for born-in-4.1 clusters. [1]: https://github.com/openshift/cluster-image-registry-operator/pull/470#discussion_r392454218 Oops, removed the Upgrades keyword by accident. Putting it back... "List Storage Account Keys" events reduces magnificently(about 6 times one minutes and 5 minutes refresh once) with 4.5.0-0.nightly-2020-03-24-002404. Just verify on 4.5 fresh cluster, will check with born-in-4.1 clusters(but I will start from 4.2, since Azure is supported since 4.2) It have to be 4.1. As we don't support Azure on 4.1, we have two independent items to check: 1. number of the "List Storage Account Keys" events (VERIFIED) 2. the operator still works on born-in-4.1 clusters (you can pick any cloud provider) (In reply to Oleg Bulatov from comment #21) > It have to be 4.1. As we don't support Azure on 4.1, we have two independent > items to check: > > 1. number of the "List Storage Account Keys" events (VERIFIED) > 2. the operator still works on born-in-4.1 clusters (you can pick any cloud > provider) born-in-4.1 cluster is blocked by this bug now : ( https://bugzilla.redhat.com/show_bug.cgi?id=1815539 Image registry operator is running well on born-4.1-cluster which is upgrade with below path: 4.1.38-x86_64,4.2.26-x86_64,4.3.9-x86_64,4.4.0-rc.4-x86_64,4.5.0-0.nightly-2020-03-31-225120 Just wanted to say thanks for everyone who worked on this. I applied this fix to my dev 4.2 cluster a few weeks ago when it became available in the candidate-4.2 channel. It then allowed me to see real errors in the Azure Activity Log, and identity issues with the cluster which had confused me for a while. So thanks again! I'm glad that we could get it sorted out for you, and thanks for reporting it. It was actually affecting all of the storage backends, we just didn't realize it until we were investigating this bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |