Bug 1440977 - [3.4] Router hangs on deadlock
Summary: [3.4] Router hangs on deadlock
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.4.z
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1418124 1438033 1441732 (view as bug list)
Depends On:
Blocks: 1442859 1442860 1442863
TreeView+ depends on / blocked
 
Reported: 2017-04-10 21:42 UTC by Eric Rich
Modified: 2022-08-04 22:20 UTC (History)
30 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1442859 1442860 (view as bug list)
Environment:
Last Closed: 2017-04-26 05:37:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 13717 0 None closed Prevent the router from deadlocking itself when calling Commit() 2020-12-14 16:59:10 UTC
Red Hat Knowledge Base (Solution) 2995641 0 None None None 2017-04-10 21:46:14 UTC
Red Hat Product Errata RHBA-2017:1129 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update 2017-04-26 09:35:35 UTC

Description Eric Rich 2017-04-10 21:42:20 UTC
Description of problem:

Customer are reporting that routes are not updating and that the router is not updating. 

As seen by looking at router logs (no new logs in hrs), and config/map file changes are not see in hrs as well. 

Version-Release number of selected component (if applicable): 3.4

How reproducible: Undetermined

Steps to Reproduce:
1. The more frequent / regular route updates are, we believe this happens more regularly. 

Actual results: Applications (maniacally on new routes) show 503 errors, showing the router 503 error page. 

Expected results: The router should not stop pulling updates from OCP, or should restart if its having issues pulling such routeing information. 

Additional info:

Comment 3 Ben Bennett 2017-04-12 17:29:57 UTC
*** Bug 1441732 has been marked as a duplicate of this bug. ***

Comment 6 openshift-github-bot 2017-04-14 13:40:21 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/fad6679bcba9182a4bea9b6380b7312b730723f7
Prevent the router from deadlocking itself when calling Commit()

The router reload function (Commit()) was changed so that it could be
rate limited.  The rate limiter tracks changes in a kcache.Queue
object.  It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
 Thread 1 (the rate-limiter background process):
   - Wake up
   - Ratelimit sees there is work to be done and calls fifo.go's Pop() function
   - Fifo.go acquires a fifo lock and call the processing function
   - Router.go's commitAndReload() function acquires a lock on the router object
 Thread 2 (triggered by the event handler that commit's changes to the router):
   - Get the event and process it
   - Since there are changes to be made, call router.Commit()
   - Commit() grabs a lock on the router object
   - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
   - In order to queue the work... it acquires a lock on the fifo

So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function.  The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)

Comment 7 Stefanie Forrester 2017-04-17 14:23:22 UTC
*** Bug 1418124 has been marked as a duplicate of this bug. ***

Comment 9 Ben Bennett 2017-04-18 15:41:59 UTC
*** Bug 1438033 has been marked as a duplicate of this bug. ***

Comment 14 Eric Paris 2017-04-19 18:04:24 UTC
https://github.com/openshift/ose/pull/705

Comment 15 Troy Dawson 2017-04-20 15:57:56 UTC
This has been merged into ocp and is in OCP v3.4.1.18 or newer.

Comment 17 zhaliu 2017-04-21 08:33:43 UTC
This problem is difficult to reproduce, and there is no good way to verify this bug. I will keep monitoring the routers' status and logs in the online free int environment.On another hands , I shall try to find the reproducer.

Comment 19 zhaozhanqi 2017-04-21 13:44:00 UTC
Verified this bug on v3.4.1.18

using same steps with https://bugzilla.redhat.com/show_bug.cgi?id=1442860#c4

please correct me if this step is not enough

Comment 23 errata-xmlrpc 2017-04-26 05:37:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129


Note You need to log in before you can comment on or make changes to this bug.