1440977 – [3.4] Router hangs on deadlock

Bug 1440977 - [3.4] Router hangs on deadlock

Summary: [3.4] Router hangs on deadlock

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.4.z
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1418124 1438033 1441732 (view as bug list)
Depends On:
Blocks:	1442859 1442860 1442863
TreeView+	depends on / blocked

Reported:	2017-04-10 21:42 UTC by Eric Rich
Modified:	2022-08-04 22:20 UTC (History)
CC List:	30 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1442859 1442860 (view as bug list)
Environment:
Last Closed:	2017-04-26 05:37:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 13717	None	closed	Prevent the router from deadlocking itself when calling Commit()	2020-12-14 16:59:10 UTC
Red Hat Knowledge Base (Solution)	2995641	None	None	None	2017-04-10 21:46:14 UTC
Red Hat Product Errata	RHBA-2017:1129	normal	SHIPPED_LIVE	OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update	2017-04-26 09:35:35 UTC

Description Eric Rich 2017-04-10 21:42:20 UTC

Description of problem:

Customer are reporting that routes are not updating and that the router is not updating. 

As seen by looking at router logs (no new logs in hrs), and config/map file changes are not see in hrs as well. 

Version-Release number of selected component (if applicable): 3.4

How reproducible: Undetermined

Steps to Reproduce:
1. The more frequent / regular route updates are, we believe this happens more regularly. 

Actual results: Applications (maniacally on new routes) show 503 errors, showing the router 503 error page. 

Expected results: The router should not stop pulling updates from OCP, or should restart if its having issues pulling such routeing information. 

Additional info:

Comment 3 Ben Bennett 2017-04-12 17:29:57 UTC

*** Bug 1441732 has been marked as a duplicate of this bug. ***

Comment 6 openshift-github-bot 2017-04-14 13:40:21 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/fad6679bcba9182a4bea9b6380b7312b730723f7
Prevent the router from deadlocking itself when calling Commit()

The router reload function (Commit()) was changed so that it could be
rate limited.  The rate limiter tracks changes in a kcache.Queue
object.  It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
 Thread 1 (the rate-limiter background process):
   - Wake up
   - Ratelimit sees there is work to be done and calls fifo.go's Pop() function
   - Fifo.go acquires a fifo lock and call the processing function
   - Router.go's commitAndReload() function acquires a lock on the router object
 Thread 2 (triggered by the event handler that commit's changes to the router):
   - Get the event and process it
   - Since there are changes to be made, call router.Commit()
   - Commit() grabs a lock on the router object
   - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
   - In order to queue the work... it acquires a lock on the fifo

So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function.  The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)

Comment 7 Stefanie Forrester 2017-04-17 14:23:22 UTC

*** Bug 1418124 has been marked as a duplicate of this bug. ***

Comment 9 Ben Bennett 2017-04-18 15:41:59 UTC

*** Bug 1438033 has been marked as a duplicate of this bug. ***

Comment 14 Eric Paris 2017-04-19 18:04:24 UTC

https://github.com/openshift/ose/pull/705

Comment 15 Troy Dawson 2017-04-20 15:57:56 UTC

This has been merged into ocp and is in OCP v3.4.1.18 or newer.

Comment 17 zhaliu 2017-04-21 08:33:43 UTC

This problem is difficult to reproduce, and there is no good way to verify this bug. I will keep monitoring the routers' status and logs in the online free int environment.On another hands , I shall try to find the reproducer.

Comment 19 zhaozhanqi 2017-04-21 13:44:00 UTC

Verified this bug on v3.4.1.18

using same steps with https://bugzilla.redhat.com/show_bug.cgi?id=1442860#c4

please correct me if this step is not enough

Comment 23 errata-xmlrpc 2017-04-26 05:37:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
bbennett
bvincell
clichybi
cryan
dakini
dyocum
eparis
erjones
gkf42
jeder
jkaur
jswensso
kyankovi
lans.carstensen
maschmid
mrobson
mwhittin
myllynen
nnosenzo
nschuetz
pdwyer
rhowe
sreber
stwalter
tdawson
tkimura
tobias.brunner
vwalek
zhaliu