Bug 1440977
| Summary: | [3.4] Router hangs on deadlock | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Eric Rich <erich> | |
| Component: | Networking | Assignee: | Ben Bennett <bbennett> | |
| Networking sub component: | router | QA Contact: | zhaozhanqi <zzhao> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | unspecified | CC: | aos-bugs, bbennett, bvincell, clichybi, cryan, dakini, dyocum, eparis, erjones, gkf42, jeder, jkaur, jswensso, kyankovi, lans.carstensen, maschmid, mrobson, mwhittin, myllynen, nnosenzo, nschuetz, pdwyer, rhowe, sreber, stwalter, tdawson, tkimura, tobias.brunner, vwalek, zhaliu | |
| Version: | 3.4.0 | Keywords: | OpsBlocker | |
| Target Milestone: | --- | |||
| Target Release: | 3.4.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1442859 1442860 (view as bug list) | Environment: | ||
| Last Closed: | 2017-04-26 05:37:03 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1442859, 1442860, 1442863 | |||
|
Description
Eric Rich
2017-04-10 21:42:20 UTC
*** Bug 1441732 has been marked as a duplicate of this bug. *** Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/fad6679bcba9182a4bea9b6380b7312b730723f7 Prevent the router from deadlocking itself when calling Commit() The router reload function (Commit()) was changed so that it could be rate limited. The rate limiter tracks changes in a kcache.Queue object. It will coalesce changes to the same call so that if three calls to Invoke() the rate limited function happen before the next time it is allowed to process, then only one will occur. Our problem was that we were doing: Thread 1 (the rate-limiter background process): - Wake up - Ratelimit sees there is work to be done and calls fifo.go's Pop() function - Fifo.go acquires a fifo lock and call the processing function - Router.go's commitAndReload() function acquires a lock on the router object Thread 2 (triggered by the event handler that commit's changes to the router): - Get the event and process it - Since there are changes to be made, call router.Commit() - Commit() grabs a lock on the router object - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work - In order to queue the work... it acquires a lock on the fifo So thread 1 locks: fifo then router; thread 2 locks: router then fifo. If you get unlucky, those threads deadlock and you never process another event. The fix is to release the lock on the router object in our Commit() function before we call Invoke on the rate limited function. The lock is not actually protecting anything at that point since the rate limited function does its own locking, and is run in a separate thread anyway. Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977) *** Bug 1418124 has been marked as a duplicate of this bug. *** *** Bug 1438033 has been marked as a duplicate of this bug. *** This has been merged into ocp and is in OCP v3.4.1.18 or newer. This problem is difficult to reproduce, and there is no good way to verify this bug. I will keep monitoring the routers' status and logs in the online free int environment.On another hands , I shall try to find the reproducer. Verified this bug on v3.4.1.18 using same steps with https://bugzilla.redhat.com/show_bug.cgi?id=1442860#c4 please correct me if this step is not enough Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1129 |