Bug 2026488

Summary: openshift-controller-manager - delete event is repeating pathologically
Product: OpenShift Container Platform Reporter: Adam Kaplan <adam.kaplan>
Component: openshift-controller-managerAssignee: Adam Kaplan <adam.kaplan>
openshift-controller-manager sub component: controller-manager QA Contact: Jitendar Singh <jitsingh>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, gmontero, wking
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:30:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Adam Kaplan 2021-11-24 20:08:15 UTC
Description of problem:

"[sig-arch] events should not repeat pathologically" test occasionally fails with the following flake:

event happened 22 times, something is wrong: ns/openshift-controller-manager daemonset/controller-manager - reason/SuccessfulDelete (combined from similar events): Deleted pod: controller-manager-74rw5

Version-Release number of selected component (if applicable): 4.10

How reproducible: Sometimes

Steps to Reproduce:

Actual results:

Test fails - controller-manager pods are repeatedly being deleted

Expected results:

controller-manager pods are relatively stable on cluster install.

Additional info:

See https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-controller-manager/205/pull-ci-openshift-openshift-controller-manager-master-e2e-gcp-builds/1463500691103289344

Comment 1 W. Trevor King 2021-11-30 22:44:13 UTC
Sounds a lot like bug 2004127, which was fixed with some library-go bumps.  I dunno if it's the same root cause this time or not.

Comment 2 W. Trevor King 2021-11-30 22:48:43 UTC
Doesn't seem all that common, but there are a number of hits if I stretch back to the past 14d:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=ns%2Fopenshift-controller-manager+daemonset%2Fcontroller-manager+-+reason%2FSuccessfulDelete.*Deleted+pod%3A+controller-manager&maxAge=336h&ty
pe=junit' | grep 'failures match' | sort
pull-ci-openshift-builder-master-e2e-aws-builds (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
pull-ci-openshift-builder-master-openshift-e2e-aws-builds-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
pull-ci-openshift-openshift-controller-manager-master-e2e-gcp-builds (all) - 7 runs, 71% failed, 40% of failures match = 29% impact
pull-ci-openshift-openshift-controller-manager-master-openshift-e2e-aws-builds-techpreview (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
pull-ci-openshift-origin-master-e2e-gcp-builds (all) - 69 runs, 52% failed, 33% of failures match = 17% impact
pull-ci-openshift-origin-release-4.9-e2e-gcp-builds (all) - 9 runs, 33% failed, 33% of failures match = 11% impact
rehearse-23377-pull-ci-openshift-origin-release-4.9-e2e-gcp-builds (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-23961-pull-ci-openshift-origin-release-4.9-e2e-gcp-builds (all) - 3 runs, 33% failed, 100% of failures match = 33% impact

Comment 3 Adam Kaplan 2021-12-01 15:42:05 UTC
I suspect that we have logic in the operator that is triggering unnecessary rollouts of the ocm DaemonSet. As you can see in the failure log, we're hitting this in the OCP build suite. We should have only one rollout after the internal registry publishes its hostname.

Comment 10 errata-xmlrpc 2022-03-10 16:30:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.