Bug 1404354 - websocket connection leaks causing failed connections
Summary: websocket connection leaks causing failed connections
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Build
Version: 5.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.9.0
Assignee: Gregg Tanzillo
QA Contact: Pete Savage
URL:
Whiteboard:
Depends On:
Blocks: 1395782 1468281 1468633
TreeView+ depends on / blocked
 
Reported: 2016-12-13 16:15 UTC by Pete Savage
Modified: 2021-06-10 11:44 UTC (History)
18 users (show)

Fixed In Version: 5.9.0.1
Doc Type: Known Issue
Doc Text:
At current, connecting to virtual machines using HTML5 console access inconsistently fails. This is due to an issue in the underlying Apache web server related to web socket connections, which are used for remote console access to virtual machines. As a workaround, retry the connection. In the event that the connection fails again, wait a minute and retry again. This issue is currently being investigated by engineering, who seek to have a solution in the first update to Red Hat CloudForms 4.2.
Clone Of:
: 1468281 1468633 (view as bug list)
Environment:
Last Closed: 2018-03-06 15:02:28 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1395782 0 unspecified CLOSED Trying to connect to VM console randomly fails on RHV environments 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1455312 0 low CLOSED HTML5 Console Does Not Connect Sometimes on First Attempt 2021-02-22 00:41:40 UTC

Internal Links: 1395782 1455312

Description Pete Savage 2016-12-13 16:15:33 UTC
Description of problem: Websocket notifications leaks ESTABLISHED connections under certain conditions, this shows up as an increasing number of ESTABLISHED connections eventually exhausting the max concurrent connections and freezing the UI.


Version-Release number of selected component (if applicable): 5.7.0.14


How reproducible: Relatively regular, QE has an environment where they can consistently see this error.


Steps to Reproduce:
1. At the moment the steps comprise of running through a set of the test suites QE has
2.
3.

Actual results: During certain operations, seemingly refreshing the requests page and creating service dialogs, the notifications calls seem to somehow remain incomplete leading to an increase in the number of connections left open.


Expected results: There should be no leaking connections.


Additional info:

Comment 3 Pete Savage 2016-12-16 10:38:16 UTC
So, we've been investigating this over the last week, and to cut a long story short, what we originally thought was something that we could only replicate in a QE environment, turned out to be something that we could replicate on any appliance using the new websocket connections. The problem seems to be worst in Firefox, but portions of it can be seen in Chrome too.

Here is what we observe. When loading a page, a connection is made to the /ws/notifications uri which in turn initiates a connection from apache to port 5000 on the internal ruby process (rack). When navigating to a new page, this connection is not cleaned up and remains for a long period of time, even after the browser is closed. The more pages that are visited, the more ESTABLISHED connections stack up internally.

What we also saw was that Firefox can also rack up connections that seem to hang. When the number of these is greater than the number of max connections for the browser, certain aspects of the UI refuse to function.

There appears to be situations where Firefox and other browsers are unable to make the connection to the WebSocket service. When this happens it often leaks a connection. Occasionally, this can lead to some kind of infinite retry which just fills up the connections with stale hanging TCP connections.

Often there seems to be one connection that is used, and another that is just stale, though it is kept alive by the browser.

Through some intensive debugging, we discovered that it seems to be down to an issue with Apache. Apache 2.4.6, which is shipped with RHEL7 currently, seems to show the problem, Apache 2.4.18, which is in SCL also shows the issue. 2.4.23, which is currently in Fedora 24 does not seem to yield the same error. Specifically, no hanging connections on port 5000 internally and similarly it seems that the issue is gone in the browsers too.

I'd like to do a little more testing, but it seems like a major upgrade of apache is needed to solve this bug.

Comment 4 Dave Johnson 2016-12-19 18:51:33 UTC
Chris, can you reach out to the SCL team and see what it will take to get the newer apache version available downstream.

Comment 5 Dave Johnson 2016-12-21 04:45:14 UTC
So this is also causing remote connections failures.  Adding the fedora24 container in front of the appliance to proxy the web connections removes the issue completely for whatever reason.

Comment 11 Felix Dewaleyne 2017-02-10 16:23:42 UTC
is there a way to work around this issue?

Comment 12 Satoe Imaishi 2017-02-10 17:36:03 UTC
David, do you know if there is any workaround?

Comment 13 Dávid Halász 2017-02-13 08:21:05 UTC
Unfortunately there is no other solution than updating httpd as the bug is in Apache's mod_proxy_wstunnel. The only thing that can help is disabling WebsocketWorker but this will turn off asynchronous notifications and VM remote consoles.

Comment 22 Marianne Feifer 2017-06-14 19:00:49 UTC
Waiting for scratch build of Apache with the updated module to be tested to see if new module indeed fixes problem.  If it does, can discuss options for hot-fix and z-stream release.

Comment 29 CFME Bot 2017-06-28 16:37:47 UTC
New commit detected on ManageIQ/manageiq-appliance/master:
https://github.com/ManageIQ/manageiq-appliance/commit/242ee1edddc890ee87a46488fa4a83cce9da97d1

commit 242ee1edddc890ee87a46488fa4a83cce9da97d1
Author:     Dávid Halász <dhalasz>
AuthorDate: Wed Jun 28 16:20:55 2017 +0200
Commit:     Dávid Halász <dhalasz>
CommitDate: Wed Jun 28 17:54:52 2017 +0200

    Disable connection reuse for WebSocket connections in Apache
    
    This is a temporary workaround for the issue described here:
    https://bugzilla.redhat.com/show_bug.cgi?id=1404354
    
    This can be reverted after httpd is updated to 2.4.25 or newer

 .../httpd/conf.d/manageiq-balancer-websocket.conf  | 23 ++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

Comment 32 CFME Bot 2017-07-06 13:58:33 UTC
New commit detected on ManageIQ/manageiq-gems-pending/fine:
https://github.com/ManageIQ/manageiq-gems-pending/commit/65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8

commit 65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8
Author:     Dávid Halász <dhalasz>
AuthorDate: Wed Jun 28 13:29:45 2017 +0200
Commit:     Dávid Halász <dhalasz>
CommitDate: Thu Jun 29 17:24:37 2017 +0200

    Disable connection reuse for WebSocket connections in Apache
    
    This is a temporary workaround for the issue described here:
    https://bugzilla.redhat.com/show_bug.cgi?id=1404354
    
    This can be reverted after httpd is updated to 2.4.25 or newer

 lib/gems/pending/util/miq_apache/miq_apache.rb | 10 ++++++++--
 spec/util/miq_apache/conf_spec.rb              | 25 ++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 3 deletions(-)

Comment 35 Pete Savage 2018-02-26 07:58:23 UTC
Verified


Note You need to log in before you can comment on or make changes to this bug.