Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1440977

Summary: [3.4] Router hangs on deadlock
Product: OpenShift Container Platform Reporter: Eric Rich <erich>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aos-bugs, bbennett, bvincell, clichybi, cryan, dakini, dyocum, eparis, erjones, gkf42, jeder, jkaur, jswensso, kyankovi, lans.carstensen, maschmid, mrobson, mwhittin, myllynen, nnosenzo, nschuetz, pdwyer, rhowe, sreber, stwalter, tdawson, tkimura, tobias.brunner, vwalek, zhaliu
Version: 3.4.0Keywords: OpsBlocker
Target Milestone: ---   
Target Release: 3.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1442859 1442860 (view as bug list) Environment:
Last Closed: 2017-04-26 05:37:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1442859, 1442860, 1442863    

Description Eric Rich 2017-04-10 21:42:20 UTC
Description of problem:

Customer are reporting that routes are not updating and that the router is not updating. 

As seen by looking at router logs (no new logs in hrs), and config/map file changes are not see in hrs as well. 

Version-Release number of selected component (if applicable): 3.4

How reproducible: Undetermined

Steps to Reproduce:
1. The more frequent / regular route updates are, we believe this happens more regularly. 

Actual results: Applications (maniacally on new routes) show 503 errors, showing the router 503 error page. 

Expected results: The router should not stop pulling updates from OCP, or should restart if its having issues pulling such routeing information. 

Additional info:

Comment 3 Ben Bennett 2017-04-12 17:29:57 UTC
*** Bug 1441732 has been marked as a duplicate of this bug. ***

Comment 6 openshift-github-bot 2017-04-14 13:40:21 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/fad6679bcba9182a4bea9b6380b7312b730723f7
Prevent the router from deadlocking itself when calling Commit()

The router reload function (Commit()) was changed so that it could be
rate limited.  The rate limiter tracks changes in a kcache.Queue
object.  It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
 Thread 1 (the rate-limiter background process):
   - Wake up
   - Ratelimit sees there is work to be done and calls fifo.go's Pop() function
   - Fifo.go acquires a fifo lock and call the processing function
   - Router.go's commitAndReload() function acquires a lock on the router object
 Thread 2 (triggered by the event handler that commit's changes to the router):
   - Get the event and process it
   - Since there are changes to be made, call router.Commit()
   - Commit() grabs a lock on the router object
   - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
   - In order to queue the work... it acquires a lock on the fifo

So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function.  The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)

Comment 7 Stefanie Forrester 2017-04-17 14:23:22 UTC
*** Bug 1418124 has been marked as a duplicate of this bug. ***

Comment 9 Ben Bennett 2017-04-18 15:41:59 UTC
*** Bug 1438033 has been marked as a duplicate of this bug. ***

Comment 14 Eric Paris 2017-04-19 18:04:24 UTC
https://github.com/openshift/ose/pull/705

Comment 15 Troy Dawson 2017-04-20 15:57:56 UTC
This has been merged into ocp and is in OCP v3.4.1.18 or newer.

Comment 17 zhaliu 2017-04-21 08:33:43 UTC
This problem is difficult to reproduce, and there is no good way to verify this bug. I will keep monitoring the routers' status and logs in the online free int environment.On another hands , I shall try to find the reproducer.

Comment 19 zhaozhanqi 2017-04-21 13:44:00 UTC
Verified this bug on v3.4.1.18

using same steps with https://bugzilla.redhat.com/show_bug.cgi?id=1442860#c4

please correct me if this step is not enough

Comment 23 errata-xmlrpc 2017-04-26 05:37:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129