Bug 1841828 - ovnkube-master sometimes stops handling pod updates.
Summary: ovnkube-master sometimes stops handling pod updates.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.0
Assignee: Casey Callendrello
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 1845202
TreeView+ depends on / blocked
 
Reported: 2020-05-29 15:00 UTC by Casey Callendrello
Modified: 2020-10-27 16:02 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:01:56 UTC
Target Upstream Version:


Attachments (Terms of Use)
stacktrace with deadlock (94.91 KB, text/plain)
2020-06-03 14:29 UTC, Casey Callendrello
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 190 None closed Bug 1841828: Upstream merge 2020-06-17 2020-08-10 06:13:33 UTC
Red Hat Product Errata RHBA-2020:4196 None None None 2020-10-27 16:02:28 UTC

Description Casey Callendrello 2020-05-29 15:00:49 UTC
Description of problem:
Sometimes the ovnkube-master stops processing pod updates

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/174/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn/1176/artifacts/e2e-gcp-ovn/pods/openshift-ovn-kubernetes_ovnkube-master-82blj_ovnkube-master.log at 15:14

Version-Release number of selected component (if applicable):
4.5

How reproducible:

Seems to happen in about 1-in-20 CI runs.

Comment 3 Casey Callendrello 2020-06-03 14:26:04 UTC
Found the deadlock; it has to do with nested handler adds.

Comment 4 Casey Callendrello 2020-06-03 14:29:33 UTC
Created attachment 1694838 [details]
stacktrace with deadlock

Comment 5 Casey Callendrello 2020-06-03 15:07:54 UTC
Found the issue:

One goroutine is blocked on send:

goroutine 61 [chan send, 32 minutes]:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).enqueueEvent(0xc0002eb180, 0x0, 0x0, 0x172c620, 0xc000b19400, 0xc00134f8a0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:192 +0x17f
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedQueuedHandler.func3(0x172c620, 0xc000b19400)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:238 +0xd6
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:212
k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0x18, 0xc0003563a0, 0xc00037c180)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:660 +0x1d6
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc0004dbdd8, 0x0, 0x0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:292 +0x51
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:652 +0x79

But the receiver is also blocked:

goroutine 11 [semacquire, 32 minutes]:
sync.runtime_SemacquireMutex(0xc0002eb18c, 0xc0014b5f00, 0x0)
	/home/c1/go/go/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
	/home/c1/go/go/src/sync/rwmutex.go:50
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachQueuedHandler(0xc0002eb180, 0xc00097ef10)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:84 +0x23a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedQueuedHandler.func2.1(0xc000a94fc0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:227 +0x52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).processEvents(0xc0002eb180, 0xc000432b40, 0xc0000a6660)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:159 +0x3e
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.newQueuedInformer
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:343 +0xfa

Comment 7 Casey Callendrello 2020-06-03 16:36:20 UTC
No, that's wrong. It's a nested handler add:

goroutine 116 [semacquire]:
sync.runtime_SemacquireMutex(0xc000305658, 0x0, 0x0)
	/home/c1/go/go/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).Lock(0xc000305650)
	/home/c1/go/go/src/sync/rwmutex.go:103 +0x88
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).addHandler(0xc0003bb900, 0x1a05920, 0x172c620, 0xc001144500, 0x19, 0xc001188440, 0x19c8bc0, 0xc00089fd80, 0x0, 0x0, ...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:543 +0x247
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).AddFilteredPodHandler(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:578
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).handlePeerNamespaceAndPodSelector.func1(0x172b720, 0xc000fb86e0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1258 +0x311
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:198
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd(0xc00034a7c0, 0x19c8bc0, 0xc00089e100, 0x172b720, 0xc000fb86e0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:230 +0x69
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnAdd(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:38
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.newInformer.func1(0xc0008e2600, 0xc00114c200, 0x6, 0x8)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:328 +0x92
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).addHandler(0xc0003059d0, 0x3b, 0xc00034a7c0, 0x19c8bc0, 0xc00089e100, 0xc00114c200, 0x6, 0x8, 0x15cbfc0)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:124 +0xaf
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).addHandler(0xc0003bb900, 0x1a05920, 0x172b720, 0x0, 0x0, 0xc001188460, 0x19c8bc0, 0xc00089e100, 0x0, 0x0, ...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:559 +0x451
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).AddFilteredNamespaceHandler(...)
	/home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:628
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).handlePeerNamespaceAndPodSelector(0xc0008c9440, 0xc000868180, 0xc000dc5d80, 0xc001188460, 0xc001188440, 0xc0010a29a0, 0x14, 0xc0014031d0, 0xc00037a9c0)

Comment 8 Casey Callendrello 2020-06-03 17:30:49 UTC
Aha, but golang readlocks are more subtle; if another goroutine wants to Lock(), RLock() may block regardless.

Upstream PR, for backport: https://github.com/ovn-org/ovn-kubernetes/pull/1379

Comment 9 Casey Callendrello 2020-06-08 16:00:54 UTC
Changes are in 4.6; setting to Modified.

Comment 14 W. Trevor King 2020-06-17 16:32:02 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 15 W. Trevor King 2020-06-17 16:38:15 UTC
Also not clear to me is how this bug was temporarily ON_QA to unblock the 4.5 backport [1] but is now back in POST, despite not having the related patches in 4.6:

$ git log --oneline origin/release-4.6 | grep 'master: batch loadbalancer\|factory: remove unecessary'
$ git log --oneline origin/release-4.5 | grep 'master: batch loadbalancer\|factory: remove unecessary'
c481cd4e master: batch loadbalancer acl updates
ef7f3b2b factory: remove unecessary locking for enqueueEevent

How will [1,2] flow into release-4.6?

[1]: https://github.com/ovn-org/ovn-kubernetes/pull/1378
[2]: https://github.com/ovn-org/ovn-kubernetes/pull/1384

Comment 16 Scott Dodson 2020-06-18 15:54:42 UTC
I'd followed up on this in Slack because I too was confused by the state. The master branch of openshift/ovn-kubernetes will receive a full rebase from upstream and the fixes have been backported to 4.5 already, which seems like an abuse of our process but I've been assured the fixes are in 4.5 and QE seems to have VERIFIED the 4.5 bug so I'm not concerned beyond the fact that there's a risk of regression in 4.6 if we somehow forget to rebase, lets not do that.

In the future I think it'd be best even if there's a pending rebase from upstream to pull an UPSTREAM commit into master ahead of the rebase so there's no risk.

ref https://coreos.slack.com/archives/CDCP2LA9L/p1592495186201700

Comment 17 Casey Callendrello 2020-06-23 13:09:22 UTC
Yeah, it was somewhat an abuse of process. However, the OpenShift processes and procedures don't interact well with upstream-first projects. We don't really use "our" master branch at all, just upstream master and the release cuts.

Comment 21 errata-xmlrpc 2020-10-27 16:01:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.