Description of problem: Websocket notifications leaks ESTABLISHED connections under certain conditions, this shows up as an increasing number of ESTABLISHED connections eventually exhausting the max concurrent connections and freezing the UI. Version-Release number of selected component (if applicable): 5.7.0.14 How reproducible: Relatively regular, QE has an environment where they can consistently see this error. Steps to Reproduce: 1. At the moment the steps comprise of running through a set of the test suites QE has 2. 3. Actual results: During certain operations, seemingly refreshing the requests page and creating service dialogs, the notifications calls seem to somehow remain incomplete leading to an increase in the number of connections left open. Expected results: There should be no leaking connections. Additional info:
So, we've been investigating this over the last week, and to cut a long story short, what we originally thought was something that we could only replicate in a QE environment, turned out to be something that we could replicate on any appliance using the new websocket connections. The problem seems to be worst in Firefox, but portions of it can be seen in Chrome too. Here is what we observe. When loading a page, a connection is made to the /ws/notifications uri which in turn initiates a connection from apache to port 5000 on the internal ruby process (rack). When navigating to a new page, this connection is not cleaned up and remains for a long period of time, even after the browser is closed. The more pages that are visited, the more ESTABLISHED connections stack up internally. What we also saw was that Firefox can also rack up connections that seem to hang. When the number of these is greater than the number of max connections for the browser, certain aspects of the UI refuse to function. There appears to be situations where Firefox and other browsers are unable to make the connection to the WebSocket service. When this happens it often leaks a connection. Occasionally, this can lead to some kind of infinite retry which just fills up the connections with stale hanging TCP connections. Often there seems to be one connection that is used, and another that is just stale, though it is kept alive by the browser. Through some intensive debugging, we discovered that it seems to be down to an issue with Apache. Apache 2.4.6, which is shipped with RHEL7 currently, seems to show the problem, Apache 2.4.18, which is in SCL also shows the issue. 2.4.23, which is currently in Fedora 24 does not seem to yield the same error. Specifically, no hanging connections on port 5000 internally and similarly it seems that the issue is gone in the browsers too. I'd like to do a little more testing, but it seems like a major upgrade of apache is needed to solve this bug.
Chris, can you reach out to the SCL team and see what it will take to get the newer apache version available downstream.
So this is also causing remote connections failures. Adding the fedora24 container in front of the appliance to proxy the web connections removes the issue completely for whatever reason.
is there a way to work around this issue?
David, do you know if there is any workaround?
Unfortunately there is no other solution than updating httpd as the bug is in Apache's mod_proxy_wstunnel. The only thing that can help is disabling WebsocketWorker but this will turn off asynchronous notifications and VM remote consoles.
Waiting for scratch build of Apache with the updated module to be tested to see if new module indeed fixes problem. If it does, can discuss options for hot-fix and z-stream release.
https://github.com/ManageIQ/manageiq-gems-pending/pull/217
https://github.com/ManageIQ/manageiq-appliance/pull/126
New commit detected on ManageIQ/manageiq-appliance/master: https://github.com/ManageIQ/manageiq-appliance/commit/242ee1edddc890ee87a46488fa4a83cce9da97d1 commit 242ee1edddc890ee87a46488fa4a83cce9da97d1 Author: Dávid Halász <dhalasz> AuthorDate: Wed Jun 28 16:20:55 2017 +0200 Commit: Dávid Halász <dhalasz> CommitDate: Wed Jun 28 17:54:52 2017 +0200 Disable connection reuse for WebSocket connections in Apache This is a temporary workaround for the issue described here: https://bugzilla.redhat.com/show_bug.cgi?id=1404354 This can be reverted after httpd is updated to 2.4.25 or newer .../httpd/conf.d/manageiq-balancer-websocket.conf | 23 ++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
https://github.com/ManageIQ/manageiq-gems-pending/pull/219
New commit detected on ManageIQ/manageiq-gems-pending/fine: https://github.com/ManageIQ/manageiq-gems-pending/commit/65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8 commit 65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8 Author: Dávid Halász <dhalasz> AuthorDate: Wed Jun 28 13:29:45 2017 +0200 Commit: Dávid Halász <dhalasz> CommitDate: Thu Jun 29 17:24:37 2017 +0200 Disable connection reuse for WebSocket connections in Apache This is a temporary workaround for the issue described here: https://bugzilla.redhat.com/show_bug.cgi?id=1404354 This can be reverted after httpd is updated to 2.4.25 or newer lib/gems/pending/util/miq_apache/miq_apache.rb | 10 ++++++++-- spec/util/miq_apache/conf_spec.rb | 25 ++++++++++++++++++++++++- 2 files changed, 32 insertions(+), 3 deletions(-)
Verified