Description of problem: Sometimes the ovnkube-master stops processing pod updates https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/174/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn/1176/artifacts/e2e-gcp-ovn/pods/openshift-ovn-kubernetes_ovnkube-master-82blj_ovnkube-master.log at 15:14 Version-Release number of selected component (if applicable): 4.5 How reproducible: Seems to happen in about 1-in-20 CI runs.
Found the deadlock; it has to do with nested handler adds.
Created attachment 1694838 [details] stacktrace with deadlock
Found the issue: One goroutine is blocked on send: goroutine 61 [chan send, 32 minutes]: github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).enqueueEvent(0xc0002eb180, 0x0, 0x0, 0x172c620, 0xc000b19400, 0xc00134f8a0) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:192 +0x17f github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedQueuedHandler.func3(0x172c620, 0xc000b19400) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:238 +0xd6 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:212 k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0x18, 0xc0003563a0, 0xc00037c180) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:660 +0x1d6 k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc0004dbdd8, 0x0, 0x0) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:292 +0x51 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:652 +0x79 But the receiver is also blocked: goroutine 11 [semacquire, 32 minutes]: sync.runtime_SemacquireMutex(0xc0002eb18c, 0xc0014b5f00, 0x0) /home/c1/go/go/src/runtime/sema.go:71 +0x47 sync.(*RWMutex).RLock(...) /home/c1/go/go/src/sync/rwmutex.go:50 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachQueuedHandler(0xc0002eb180, 0xc00097ef10) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:84 +0x23a github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedQueuedHandler.func2.1(0xc000a94fc0) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:227 +0x52 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).processEvents(0xc0002eb180, 0xc000432b40, 0xc0000a6660) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:159 +0x3e created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.newQueuedInformer /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:343 +0xfa
No, that's wrong. It's a nested handler add: goroutine 116 [semacquire]: sync.runtime_SemacquireMutex(0xc000305658, 0x0, 0x0) /home/c1/go/go/src/runtime/sema.go:71 +0x47 sync.(*RWMutex).Lock(0xc000305650) /home/c1/go/go/src/sync/rwmutex.go:103 +0x88 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).addHandler(0xc0003bb900, 0x1a05920, 0x172c620, 0xc001144500, 0x19, 0xc001188440, 0x19c8bc0, 0xc00089fd80, 0x0, 0x0, ...) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:543 +0x247 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).AddFilteredPodHandler(...) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:578 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).handlePeerNamespaceAndPodSelector.func1(0x172b720, 0xc000fb86e0) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1258 +0x311 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:198 k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd(0xc00034a7c0, 0x19c8bc0, 0xc00089e100, 0x172b720, 0xc000fb86e0) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:230 +0x69 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnAdd(...) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:38 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.newInformer.func1(0xc0008e2600, 0xc00114c200, 0x6, 0x8) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:328 +0x92 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).addHandler(0xc0003059d0, 0x3b, 0xc00034a7c0, 0x19c8bc0, 0xc00089e100, 0xc00114c200, 0x6, 0x8, 0x15cbfc0) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:124 +0xaf github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).addHandler(0xc0003bb900, 0x1a05920, 0x172b720, 0x0, 0x0, 0xc001188460, 0x19c8bc0, 0xc00089e100, 0x0, 0x0, ...) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:559 +0x451 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).AddFilteredNamespaceHandler(...) /home/c1/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/factory.go:628 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).handlePeerNamespaceAndPodSelector(0xc0008c9440, 0xc000868180, 0xc000dc5d80, 0xc001188460, 0xc001188440, 0xc0010a29a0, 0x14, 0xc0014031d0, 0xc00037a9c0)
Aha, but golang readlocks are more subtle; if another goroutine wants to Lock(), RLock() may block regardless. Upstream PR, for backport: https://github.com/ovn-org/ovn-kubernetes/pull/1379
Changes are in 4.6; setting to Modified.
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Also not clear to me is how this bug was temporarily ON_QA to unblock the 4.5 backport [1] but is now back in POST, despite not having the related patches in 4.6: $ git log --oneline origin/release-4.6 | grep 'master: batch loadbalancer\|factory: remove unecessary' $ git log --oneline origin/release-4.5 | grep 'master: batch loadbalancer\|factory: remove unecessary' c481cd4e master: batch loadbalancer acl updates ef7f3b2b factory: remove unecessary locking for enqueueEevent How will [1,2] flow into release-4.6? [1]: https://github.com/ovn-org/ovn-kubernetes/pull/1378 [2]: https://github.com/ovn-org/ovn-kubernetes/pull/1384
I'd followed up on this in Slack because I too was confused by the state. The master branch of openshift/ovn-kubernetes will receive a full rebase from upstream and the fixes have been backported to 4.5 already, which seems like an abuse of our process but I've been assured the fixes are in 4.5 and QE seems to have VERIFIED the 4.5 bug so I'm not concerned beyond the fact that there's a risk of regression in 4.6 if we somehow forget to rebase, lets not do that. In the future I think it'd be best even if there's a pending rebase from upstream to pull an UPSTREAM commit into master ahead of the rebase so there's no risk. ref https://coreos.slack.com/archives/CDCP2LA9L/p1592495186201700
Yeah, it was somewhat an abuse of process. However, the OpenShift processes and procedures don't interact well with upstream-first projects. We don't really use "our" master branch at all, just upstream master and the release cuts.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475