Description of problem: Customer are reporting that routes are not updating and that the router is not updating. As seen by looking at router logs (no new logs in hrs), and config/map file changes are not see in hrs as well. Version-Release number of selected component (if applicable): 3.4 How reproducible: Undetermined Steps to Reproduce: 1. The more frequent / regular route updates are, we believe this happens more regularly. Actual results: Applications (maniacally on new routes) show 503 errors, showing the router 503 error page. Expected results: The router should not stop pulling updates from OCP, or should restart if its having issues pulling such routeing information. Additional info:
*** Bug 1441732 has been marked as a duplicate of this bug. ***
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/fad6679bcba9182a4bea9b6380b7312b730723f7 Prevent the router from deadlocking itself when calling Commit() The router reload function (Commit()) was changed so that it could be rate limited. The rate limiter tracks changes in a kcache.Queue object. It will coalesce changes to the same call so that if three calls to Invoke() the rate limited function happen before the next time it is allowed to process, then only one will occur. Our problem was that we were doing: Thread 1 (the rate-limiter background process): - Wake up - Ratelimit sees there is work to be done and calls fifo.go's Pop() function - Fifo.go acquires a fifo lock and call the processing function - Router.go's commitAndReload() function acquires a lock on the router object Thread 2 (triggered by the event handler that commit's changes to the router): - Get the event and process it - Since there are changes to be made, call router.Commit() - Commit() grabs a lock on the router object - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work - In order to queue the work... it acquires a lock on the fifo So thread 1 locks: fifo then router; thread 2 locks: router then fifo. If you get unlucky, those threads deadlock and you never process another event. The fix is to release the lock on the router object in our Commit() function before we call Invoke on the rate limited function. The lock is not actually protecting anything at that point since the rate limited function does its own locking, and is run in a separate thread anyway. Fixes bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
*** Bug 1418124 has been marked as a duplicate of this bug. ***
*** Bug 1438033 has been marked as a duplicate of this bug. ***
https://github.com/openshift/ose/pull/705
This has been merged into ocp and is in OCP v3.4.1.18 or newer.
This problem is difficult to reproduce, and there is no good way to verify this bug. I will keep monitoring the routers' status and logs in the online free int environment.On another hands , I shall try to find the reproducer.
Verified this bug on v3.4.1.18 using same steps with https://bugzilla.redhat.com/show_bug.cgi?id=1442860#c4 please correct me if this step is not enough
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1129