Bug 2095756

Summary: CNO panics with concurrent map read/write
Product: OpenShift Container Platform Reporter: aaleman
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: dcbw, rravaiol
Version: 4.11   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:17:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description aaleman 2022-06-10 13:21:35 UTC
Description of problem:

The CNO panics with a concurrent map read/write

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

It panics, see e.G. here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_hypershift/1453/pull-ci-openshift-hypershift-main-e2e-aws/1535225055389159424/artifacts/e2e-aws/test-e2e/artifacts/TestUpgradeControlPlane_PreTeardownClusterDump/namespaces/e2e-clusters-kr5mb-example-5pbxw/core/pods/logs/cluster-network-operator-6b644ff8b9-zvxnq-cluster-network-operator-previous.log

Expected results:

Additional info:

Comment 1 Dan Williams 2022-06-10 14:50:28 UTC
Some initial analysis. This one is odd since the Go panic doesn't tell us anything about the writer.

The map being accessed is typeToGVK from k8s.io/apimachinery/pkg/runtime/scheme.go.

The only place the map is written is in scheme.go AddKnownTypeWithName(). That gets called from various places to register API types.

One place is the CNO's pkg/client/client.go NewClusterClient() function which calls "RegisterTypes(c.Scheme())".

When the CNO starts up, it registers a startFunc (operator.RunOperator) with a command created by NewControllerCommandConfig() in main.go.

The command created by NewControllerCommandConfig() is then executed by main() via command.Execute(). This leads to StartController() from library-go/pkg/controller/controllercmd/cmd.go, which then does:

	builder := NewController(c.componentName, c.startFunc).
		WithKubeConfigFile(c.basicFlags.KubeConfigFile, nil).
		WithLeaderElection(config.LeaderElection, c.basicFlags.Namespace, c.componentName+"-lock").
		WithRestartOnChange(exitOnChangeReactorCh, startingFileContent, observedFiles...).

	if !c.DisableServing {
		builder = builder.WithServer(config.ServingInfo, config.Authentication, config.Authorization)

	return builder.Run(controllerCtx, unstructuredConfig)

That builder.Run() from library-go/pkg/controller/controllercmd/builder.go starts leader election. When this instance of the CNO acquires the lease it runs:

func (b ControllerBuilder) getOnStartedLeadingFunc(controllerContext *ControllerContext, gracefulTerminationDuration time.Duration) func(ctx context.Context) {
	return func(ctx context.Context) {
		stoppedCh := make(chan struct{})
		go func() {
			defer close(stoppedCh)
--->			if err := b.startFunc(ctx, controllerContext); err != nil {
				b.nonZeroExitFn(fmt.Sprintf("graceful termination failed, controllers failed with error: %v", err))

That startFunc executed in a goroutine is pkg/operator/operator.go::RunOperator() which calls pkg/client/client.go::NewClient() which calls NewClusterClient() which calls RegisterTypes(c.Scheme()) which writes to the map in question.

After acquiring the lease and (eventually) calling the startFunc() via OnStartedLeading, leader election quickly attempts to renew the lease perhaps because the RetryPeriod default is very short, like 2 seconds (?) and isn't customized by the CNO. That could lead to accessing the Scheme map read-only at the same time as we are calling NewClusterClient() that writes to it.

I don't think registering types after init time is supposed to be OK, but that's what the CNO is doing... so perhaps moving that to init is the best fix.

Comment 6 errata-xmlrpc 2022-08-10 11:17:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.