Description of problem: The CNO panics with a concurrent map read/write Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: It panics, see e.G. here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_hypershift/1453/pull-ci-openshift-hypershift-main-e2e-aws/1535225055389159424/artifacts/e2e-aws/test-e2e/artifacts/TestUpgradeControlPlane_PreTeardownClusterDump/namespaces/e2e-clusters-kr5mb-example-5pbxw/core/pods/logs/cluster-network-operator-6b644ff8b9-zvxnq-cluster-network-operator-previous.log Expected results: Additional info:
Some initial analysis. This one is odd since the Go panic doesn't tell us anything about the writer. The map being accessed is typeToGVK from k8s.io/apimachinery/pkg/runtime/scheme.go. The only place the map is written is in scheme.go AddKnownTypeWithName(). That gets called from various places to register API types. One place is the CNO's pkg/client/client.go NewClusterClient() function which calls "RegisterTypes(c.Scheme())". When the CNO starts up, it registers a startFunc (operator.RunOperator) with a command created by NewControllerCommandConfig() in main.go. The command created by NewControllerCommandConfig() is then executed by main() via command.Execute(). This leads to StartController() from library-go/pkg/controller/controllercmd/cmd.go, which then does: builder := NewController(c.componentName, c.startFunc). WithKubeConfigFile(c.basicFlags.KubeConfigFile, nil). WithComponentNamespace(c.basicFlags.Namespace). WithLeaderElection(config.LeaderElection, c.basicFlags.Namespace, c.componentName+"-lock"). WithVersion(c.version). WithEventRecorderOptions(events.RecommendedClusterSingletonCorrelatorOptions()). WithRestartOnChange(exitOnChangeReactorCh, startingFileContent, observedFiles...). WithComponentOwnerReference(c.ComponentOwnerReference) if !c.DisableServing { builder = builder.WithServer(config.ServingInfo, config.Authentication, config.Authorization) } return builder.Run(controllerCtx, unstructuredConfig) That builder.Run() from library-go/pkg/controller/controllercmd/builder.go starts leader election. When this instance of the CNO acquires the lease it runs: func (b ControllerBuilder) getOnStartedLeadingFunc(controllerContext *ControllerContext, gracefulTerminationDuration time.Duration) func(ctx context.Context) { return func(ctx context.Context) { stoppedCh := make(chan struct{}) go func() { defer close(stoppedCh) ---> if err := b.startFunc(ctx, controllerContext); err != nil { b.nonZeroExitFn(fmt.Sprintf("graceful termination failed, controllers failed with error: %v", err)) } }() That startFunc executed in a goroutine is pkg/operator/operator.go::RunOperator() which calls pkg/client/client.go::NewClient() which calls NewClusterClient() which calls RegisterTypes(c.Scheme()) which writes to the map in question. After acquiring the lease and (eventually) calling the startFunc() via OnStartedLeading, leader election quickly attempts to renew the lease perhaps because the RetryPeriod default is very short, like 2 seconds (?) and isn't customized by the CNO. That could lead to accessing the Scheme map read-only at the same time as we are calling NewClusterClient() that writes to it. I don't think registering types after init time is supposed to be OK, but that's what the CNO is doing... so perhaps moving that to init is the best fix.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069