2026488 – openshift-controller-manager - delete event is repeating pathologically

Bug 2026488 - openshift-controller-manager - delete event is repeating pathologically

Summary: openshift-controller-manager - delete event is repeating pathologically

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-controller-manager
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Adam Kaplan
QA Contact:	Jitendar Singh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-24 20:08 UTC by Adam Kaplan
Modified:	2022-03-10 16:30 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:30:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 26719	0	None	open	Bug 2026488: Drop Early/Late Tests for Build Suite	2021-12-22 16:02:19 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:30:48 UTC

Internal Links: 2034984

Description Adam Kaplan 2021-11-24 20:08:15 UTC

Description of problem:

"[sig-arch] events should not repeat pathologically" test occasionally fails with the following flake:

```
event happened 22 times, something is wrong: ns/openshift-controller-manager daemonset/controller-manager - reason/SuccessfulDelete (combined from similar events): Deleted pod: controller-manager-74rw5
```


Version-Release number of selected component (if applicable): 4.10


How reproducible: Sometimes


Steps to Reproduce:
1.
2.
3.

Actual results:

Test fails - controller-manager pods are repeatedly being deleted

Expected results:

controller-manager pods are relatively stable on cluster install.


Additional info:

See https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-controller-manager/205/pull-ci-openshift-openshift-controller-manager-master-e2e-gcp-builds/1463500691103289344

Comment 1 W. Trevor King 2021-11-30 22:44:13 UTC

Sounds a lot like bug 2004127, which was fixed with some library-go bumps.  I dunno if it's the same root cause this time or not.

Comment 2 W. Trevor King 2021-11-30 22:48:43 UTC

Doesn't seem all that common, but there are a number of hits if I stretch back to the past 14d:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=ns%2Fopenshift-controller-manager+daemonset%2Fcontroller-manager+-+reason%2FSuccessfulDelete.*Deleted+pod%3A+controller-manager&maxAge=336h&ty
pe=junit' | grep 'failures match' | sort
pull-ci-openshift-builder-master-e2e-aws-builds (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
pull-ci-openshift-builder-master-openshift-e2e-aws-builds-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
pull-ci-openshift-openshift-controller-manager-master-e2e-gcp-builds (all) - 7 runs, 71% failed, 40% of failures match = 29% impact
pull-ci-openshift-openshift-controller-manager-master-openshift-e2e-aws-builds-techpreview (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
pull-ci-openshift-origin-master-e2e-gcp-builds (all) - 69 runs, 52% failed, 33% of failures match = 17% impact
pull-ci-openshift-origin-release-4.9-e2e-gcp-builds (all) - 9 runs, 33% failed, 33% of failures match = 11% impact
rehearse-23377-pull-ci-openshift-origin-release-4.9-e2e-gcp-builds (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-23961-pull-ci-openshift-origin-release-4.9-e2e-gcp-builds (all) - 3 runs, 33% failed, 100% of failures match = 33% impact

Comment 3 Adam Kaplan 2021-12-01 15:42:05 UTC

I suspect that we have logic in the operator that is triggering unnecessary rollouts of the ocm DaemonSet. As you can see in the failure log, we're hitting this in the OCP build suite. We should have only one rollout after the internal registry publishes its hostname.

Comment 10 errata-xmlrpc 2022-03-10 16:30:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.