Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1404354 - websocket connection leaks causing failed connections
websocket connection leaks causing failed connections
Status: CLOSED CURRENTRELEASE
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Build (Show other bugs)
5.6.0
Unspecified Unspecified
high Severity high
: GA
: 5.9.0
Assigned To: Gregg Tanzillo
Pete Savage
: TestOnly, ZStream
Depends On:
Blocks: 1395782 1468281 1468633
  Show dependency treegraph
 
Reported: 2016-12-13 11:15 EST by Pete Savage
Modified: 2018-03-13 04:59 EDT (History)
18 users (show)

See Also:
Fixed In Version: 5.9.0.1
Doc Type: Known Issue
Doc Text:
At current, connecting to virtual machines using HTML5 console access inconsistently fails. This is due to an issue in the underlying Apache web server related to web socket connections, which are used for remote console access to virtual machines. As a workaround, retry the connection. In the event that the connection fails again, wait a minute and retry again. This issue is currently being investigated by engineering, who seek to have a solution in the first update to Red Hat CloudForms 4.2.
Story Points: ---
Clone Of:
: 1468281 1468633 (view as bug list)
Environment:
Last Closed: 2018-03-06 10:02:28 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core


Attachments (Terms of Use)

  None (edit)
Description Pete Savage 2016-12-13 11:15:33 EST
Description of problem: Websocket notifications leaks ESTABLISHED connections under certain conditions, this shows up as an increasing number of ESTABLISHED connections eventually exhausting the max concurrent connections and freezing the UI.


Version-Release number of selected component (if applicable): 5.7.0.14


How reproducible: Relatively regular, QE has an environment where they can consistently see this error.


Steps to Reproduce:
1. At the moment the steps comprise of running through a set of the test suites QE has
2.
3.

Actual results: During certain operations, seemingly refreshing the requests page and creating service dialogs, the notifications calls seem to somehow remain incomplete leading to an increase in the number of connections left open.


Expected results: There should be no leaking connections.


Additional info:
Comment 3 Pete Savage 2016-12-16 05:38:16 EST
So, we've been investigating this over the last week, and to cut a long story short, what we originally thought was something that we could only replicate in a QE environment, turned out to be something that we could replicate on any appliance using the new websocket connections. The problem seems to be worst in Firefox, but portions of it can be seen in Chrome too.

Here is what we observe. When loading a page, a connection is made to the /ws/notifications uri which in turn initiates a connection from apache to port 5000 on the internal ruby process (rack). When navigating to a new page, this connection is not cleaned up and remains for a long period of time, even after the browser is closed. The more pages that are visited, the more ESTABLISHED connections stack up internally.

What we also saw was that Firefox can also rack up connections that seem to hang. When the number of these is greater than the number of max connections for the browser, certain aspects of the UI refuse to function.

There appears to be situations where Firefox and other browsers are unable to make the connection to the WebSocket service. When this happens it often leaks a connection. Occasionally, this can lead to some kind of infinite retry which just fills up the connections with stale hanging TCP connections.

Often there seems to be one connection that is used, and another that is just stale, though it is kept alive by the browser.

Through some intensive debugging, we discovered that it seems to be down to an issue with Apache. Apache 2.4.6, which is shipped with RHEL7 currently, seems to show the problem, Apache 2.4.18, which is in SCL also shows the issue. 2.4.23, which is currently in Fedora 24 does not seem to yield the same error. Specifically, no hanging connections on port 5000 internally and similarly it seems that the issue is gone in the browsers too.

I'd like to do a little more testing, but it seems like a major upgrade of apache is needed to solve this bug.
Comment 4 Dave Johnson 2016-12-19 13:51:33 EST
Chris, can you reach out to the SCL team and see what it will take to get the newer apache version available downstream.
Comment 5 Dave Johnson 2016-12-20 23:45:14 EST
So this is also causing remote connections failures.  Adding the fedora24 container in front of the appliance to proxy the web connections removes the issue completely for whatever reason.
Comment 11 Felix Dewaleyne 2017-02-10 11:23:42 EST
is there a way to work around this issue?
Comment 12 Satoe Imaishi 2017-02-10 12:36:03 EST
David, do you know if there is any workaround?
Comment 13 Dávid Halász 2017-02-13 03:21:05 EST
Unfortunately there is no other solution than updating httpd as the bug is in Apache's mod_proxy_wstunnel. The only thing that can help is disabling WebsocketWorker but this will turn off asynchronous notifications and VM remote consoles.
Comment 22 Marianne Feifer 2017-06-14 15:00:49 EDT
Waiting for scratch build of Apache with the updated module to be tested to see if new module indeed fixes problem.  If it does, can discuss options for hot-fix and z-stream release.
Comment 29 CFME Bot 2017-06-28 12:37:47 EDT
New commit detected on ManageIQ/manageiq-appliance/master:
https://github.com/ManageIQ/manageiq-appliance/commit/242ee1edddc890ee87a46488fa4a83cce9da97d1

commit 242ee1edddc890ee87a46488fa4a83cce9da97d1
Author:     Dávid Halász <dhalasz@redhat.com>
AuthorDate: Wed Jun 28 16:20:55 2017 +0200
Commit:     Dávid Halász <dhalasz@redhat.com>
CommitDate: Wed Jun 28 17:54:52 2017 +0200

    Disable connection reuse for WebSocket connections in Apache
    
    This is a temporary workaround for the issue described here:
    https://bugzilla.redhat.com/show_bug.cgi?id=1404354
    
    This can be reverted after httpd is updated to 2.4.25 or newer

 .../httpd/conf.d/manageiq-balancer-websocket.conf  | 23 ++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)
Comment 32 CFME Bot 2017-07-06 09:58:33 EDT
New commit detected on ManageIQ/manageiq-gems-pending/fine:
https://github.com/ManageIQ/manageiq-gems-pending/commit/65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8

commit 65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8
Author:     Dávid Halász <dhalasz@redhat.com>
AuthorDate: Wed Jun 28 13:29:45 2017 +0200
Commit:     Dávid Halász <dhalasz@redhat.com>
CommitDate: Thu Jun 29 17:24:37 2017 +0200

    Disable connection reuse for WebSocket connections in Apache
    
    This is a temporary workaround for the issue described here:
    https://bugzilla.redhat.com/show_bug.cgi?id=1404354
    
    This can be reverted after httpd is updated to 2.4.25 or newer

 lib/gems/pending/util/miq_apache/miq_apache.rb | 10 ++++++++--
 spec/util/miq_apache/conf_spec.rb              | 25 ++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 3 deletions(-)
Comment 35 Pete Savage 2018-02-26 02:58:23 EST
Verified

Note You need to log in before you can comment on or make changes to this bug.