Bug 1921892

Summary: MAO: controller runtime manager closes event recorder
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: Cloud ComputeAssignee: Michael Gugino <mgugino>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium    
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:37:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Gugino 2021-01-28 18:59:19 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/791/pull-ci-openshift-machine-api-operator-master-unit/1354463724173266944

 E0127 16:21:57.389596   10599 controller.go:167] Failed to reconcile MachineSet "ms-test/foo": failed to update machine set status: MachineSet.machine.openshift.io "foo" not found
E0127 16:21:57.389740   10599 controller.go:267] controller-runtime/manager/controller/machineset_controller "msg"="Reconciler error" "error"="failed to update machine set status: MachineSet.machine.openshift.io \"foo\" not found" "name"="foo" "namespace"="ms-test" 
E0127 16:21:57.389870   10599 runtime.go:78] Observed a panic: "send on closed channel" (send on closed channel) 

This appears to be caused by a couple of compounding issues and is affecting our CI unit tests.

First, we call the manager cancel context: https://github.com/openshift/machine-api-operator/blob/release-4.7/pkg/controller/machineset/machineset_controller_test.go#L68

We don't wait for the manager to exit.  What appears to happen is one or more subsequent tests run, the canceled manager kills the recorder prior to stopping the reconcile loop, the canceled manager reconciles an object from the next test (or possibly a requeued object from it's own test), the recorder is unable to write because the manager closed it first.

Comment 1 Michael Gugino 2021-01-28 19:04:32 UTC
Appears possibly related to this change: https://github.com/kubernetes-sigs/controller-runtime/pull/1089

This is not a blocker as it will only impact our tests.

Comment 8 errata-xmlrpc 2021-07-27 22:37:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438