Bug 1841828

Summary: ovnkube-master sometimes stops handling pod updates.
Product: OpenShift Container Platform Reporter: Casey Callendrello <cdc>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: avishnoi, wking, xtian
Version: 4.5   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:01:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1845202    
Attachments:
Description Flags
stacktrace with deadlock none

Description Casey Callendrello 2020-05-29 15:00:49 UTC
Description of problem:
Sometimes the ovnkube-master stops processing pod updates

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/174/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn/1176/artifacts/e2e-gcp-ovn/pods/openshift-ovn-kubernetes_ovnkube-master-82blj_ovnkube-master.log at 15:14

Version-Release number of selected component (if applicable):
4.5

How reproducible:

Seems to happen in about 1-in-20 CI runs.

Comment 3 Casey Callendrello 2020-06-03 14:26:04 UTC
Found the deadlock; it has to do with nested handler adds.

Comment 4 Casey Callendrello 2020-06-03 14:29:33 UTC
Created attachment 1694838 [details]
stacktrace with deadlock

Comment 5 Casey Callendrello 2020-06-03 15:07:54 UTC
Found the issue:

One goroutine is blocked on send:

goroutine 61 [chan send, 32 minutes]:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).enqueueEvent(0xc0002eb180, 0x0, 0x0, 0x172c620, 0xc000b19400, 0xc00134f8a0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:192 +0x17f
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedQueuedHandler.func3(0x172c620, 0xc000b19400)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:238 +0xd6
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:212
k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0x18, 0xc0003563a0, 0xc00037c180)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:660 +0x1d6
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc0004dbdd8, 0x0, 0x0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:292 +0x51
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:652 +0x79

But the receiver is also blocked:

goroutine 11 [semacquire, 32 minutes]:
sync.runtime_SemacquireMutex(0xc0002eb18c, 0xc0014b5f00, 0x0)
	/home/c1/go/go/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
	/home/c1/go/go/src/sync/rwmutex.go:50
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachQueuedHandler(0xc0002eb180, 0xc00097ef10)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:84 +0x23a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedQueuedHandler.func2.1(0xc000a94fc0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:227 +0x52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).processEvents(0xc0002eb180, 0xc000432b40, 0xc0000a6660)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:159 +0x3e
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.newQueuedInformer
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:343 +0xfa

Comment 7 Casey Callendrello 2020-06-03 16:36:20 UTC
No, that's wrong. It's a nested handler add:

goroutine 116 [semacquire]:
sync.runtime_SemacquireMutex(0xc000305658, 0x0, 0x0)
	/home/c1/go/go/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).Lock(0xc000305650)
	/home/c1/go/go/src/sync/rwmutex.go:103 +0x88
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).addHandler(0xc0003bb900, 0x1a05920, 0x172c620, 0xc001144500, 0x19, 0xc001188440, 0x19c8bc0, 0xc00089fd80, 0x0, 0x0, ...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:543 +0x247
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).AddFilteredPodHandler(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:578
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).handlePeerNamespaceAndPodSelector.func1(0x172b720, 0xc000fb86e0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1258 +0x311
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:198
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd(0xc00034a7c0, 0x19c8bc0, 0xc00089e100, 0x172b720, 0xc000fb86e0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:230 +0x69
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnAdd(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:38
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.newInformer.func1(0xc0008e2600, 0xc00114c200, 0x6, 0x8)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:328 +0x92
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).addHandler(0xc0003059d0, 0x3b, 0xc00034a7c0, 0x19c8bc0, 0xc00089e100, 0xc00114c200, 0x6, 0x8, 0x15cbfc0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:124 +0xaf
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).addHandler(0xc0003bb900, 0x1a05920, 0x172b720, 0x0, 0x0, 0xc001188460, 0x19c8bc0, 0xc00089e100, 0x0, 0x0, ...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:559 +0x451
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).AddFilteredNamespaceHandler(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:628
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).handlePeerNamespaceAndPodSelector(0xc0008c9440, 0xc000868180, 0xc000dc5d80, 0xc001188460, 0xc001188440, 0xc0010a29a0, 0x14, 0xc0014031d0, 0xc00037a9c0)

Comment 8 Casey Callendrello 2020-06-03 17:30:49 UTC
Aha, but golang readlocks are more subtle; if another goroutine wants to Lock(), RLock() may block regardless.

Upstream PR, for backport: https://github.com/ovn-org/ovn-kubernetes/pull/1379

Comment 9 Casey Callendrello 2020-06-08 16:00:54 UTC
Changes are in 4.6; setting to Modified.

Comment 14 W. Trevor King 2020-06-17 16:32:02 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 15 W. Trevor King 2020-06-17 16:38:15 UTC
Also not clear to me is how this bug was temporarily ON_QA to unblock the 4.5 backport [1] but is now back in POST, despite not having the related patches in 4.6:

$ git log --oneline origin/release-4.6 | grep 'master: batch loadbalancer\|factory: remove unecessary'
$ git log --oneline origin/release-4.5 | grep 'master: batch loadbalancer\|factory: remove unecessary'
c481cd4e master: batch loadbalancer acl updates
ef7f3b2b factory: remove unecessary locking for enqueueEevent

How will [1,2] flow into release-4.6?

[1]: https://github.com/ovn-org/ovn-kubernetes/pull/1378
[2]: https://github.com/ovn-org/ovn-kubernetes/pull/1384

Comment 16 Scott Dodson 2020-06-18 15:54:42 UTC
I'd followed up on this in Slack because I too was confused by the state. The master branch of openshift/ovn-kubernetes will receive a full rebase from upstream and the fixes have been backported to 4.5 already, which seems like an abuse of our process but I've been assured the fixes are in 4.5 and QE seems to have VERIFIED the 4.5 bug so I'm not concerned beyond the fact that there's a risk of regression in 4.6 if we somehow forget to rebase, lets not do that.

In the future I think it'd be best even if there's a pending rebase from upstream to pull an UPSTREAM commit into master ahead of the rebase so there's no risk.

ref https://coreos.slack.com/archives/CDCP2LA9L/p1592495186201700

Comment 17 Casey Callendrello 2020-06-23 13:09:22 UTC
Yeah, it was somewhat an abuse of process. However, the OpenShift processes and procedures don't interact well with upstream-first projects. We don't really use "our" master branch at all, just upstream master and the release cuts.

Comment 21 errata-xmlrpc 2020-10-27 16:01:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 22 W. Trevor King 2021-04-05 17:47:10 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475