Bug 1404354

Summary: websocket connection leaks causing failed connections
Product: Red Hat CloudForms Management Engine Reporter: Pete Savage <psavage>
Component: BuildAssignee: Gregg Tanzillo <gtanzill>
Status: CLOSED CURRENTRELEASE QA Contact: Pete Savage <psavage>
Severity: high Docs Contact:
Priority: high    
Version: 5.6.0CC: adahms, cpelland, dajohnso, dhalasz, fdewaley, gekis, hkataria, jhardy, jorton, jpazdziora, jrafanie, mfeifer, mpovolny, obarenbo, psavage, rspagnol, simaishi, tachoi
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.9.0.1 Doc Type: Known Issue
Doc Text:
At current, connecting to virtual machines using HTML5 console access inconsistently fails. This is due to an issue in the underlying Apache web server related to web socket connections, which are used for remote console access to virtual machines. As a workaround, retry the connection. In the event that the connection fails again, wait a minute and retry again. This issue is currently being investigated by engineering, who seek to have a solution in the first update to Red Hat CloudForms 4.2.
Story Points: ---
Clone Of:
: 1468281 1468633 (view as bug list) Environment:
Last Closed: 2018-03-06 15:02:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1395782, 1468281, 1468633    

Description Pete Savage 2016-12-13 16:15:33 UTC
Description of problem: Websocket notifications leaks ESTABLISHED connections under certain conditions, this shows up as an increasing number of ESTABLISHED connections eventually exhausting the max concurrent connections and freezing the UI.


Version-Release number of selected component (if applicable): 5.7.0.14


How reproducible: Relatively regular, QE has an environment where they can consistently see this error.


Steps to Reproduce:
1. At the moment the steps comprise of running through a set of the test suites QE has
2.
3.

Actual results: During certain operations, seemingly refreshing the requests page and creating service dialogs, the notifications calls seem to somehow remain incomplete leading to an increase in the number of connections left open.


Expected results: There should be no leaking connections.


Additional info:

Comment 3 Pete Savage 2016-12-16 10:38:16 UTC
So, we've been investigating this over the last week, and to cut a long story short, what we originally thought was something that we could only replicate in a QE environment, turned out to be something that we could replicate on any appliance using the new websocket connections. The problem seems to be worst in Firefox, but portions of it can be seen in Chrome too.

Here is what we observe. When loading a page, a connection is made to the /ws/notifications uri which in turn initiates a connection from apache to port 5000 on the internal ruby process (rack). When navigating to a new page, this connection is not cleaned up and remains for a long period of time, even after the browser is closed. The more pages that are visited, the more ESTABLISHED connections stack up internally.

What we also saw was that Firefox can also rack up connections that seem to hang. When the number of these is greater than the number of max connections for the browser, certain aspects of the UI refuse to function.

There appears to be situations where Firefox and other browsers are unable to make the connection to the WebSocket service. When this happens it often leaks a connection. Occasionally, this can lead to some kind of infinite retry which just fills up the connections with stale hanging TCP connections.

Often there seems to be one connection that is used, and another that is just stale, though it is kept alive by the browser.

Through some intensive debugging, we discovered that it seems to be down to an issue with Apache. Apache 2.4.6, which is shipped with RHEL7 currently, seems to show the problem, Apache 2.4.18, which is in SCL also shows the issue. 2.4.23, which is currently in Fedora 24 does not seem to yield the same error. Specifically, no hanging connections on port 5000 internally and similarly it seems that the issue is gone in the browsers too.

I'd like to do a little more testing, but it seems like a major upgrade of apache is needed to solve this bug.

Comment 4 Dave Johnson 2016-12-19 18:51:33 UTC
Chris, can you reach out to the SCL team and see what it will take to get the newer apache version available downstream.

Comment 5 Dave Johnson 2016-12-21 04:45:14 UTC
So this is also causing remote connections failures.  Adding the fedora24 container in front of the appliance to proxy the web connections removes the issue completely for whatever reason.

Comment 11 Felix Dewaleyne 2017-02-10 16:23:42 UTC
is there a way to work around this issue?

Comment 12 Satoe Imaishi 2017-02-10 17:36:03 UTC
David, do you know if there is any workaround?

Comment 13 Dávid Halász 2017-02-13 08:21:05 UTC
Unfortunately there is no other solution than updating httpd as the bug is in Apache's mod_proxy_wstunnel. The only thing that can help is disabling WebsocketWorker but this will turn off asynchronous notifications and VM remote consoles.

Comment 22 Marianne Feifer 2017-06-14 19:00:49 UTC
Waiting for scratch build of Apache with the updated module to be tested to see if new module indeed fixes problem.  If it does, can discuss options for hot-fix and z-stream release.

Comment 29 CFME Bot 2017-06-28 16:37:47 UTC
New commit detected on ManageIQ/manageiq-appliance/master:
https://github.com/ManageIQ/manageiq-appliance/commit/242ee1edddc890ee87a46488fa4a83cce9da97d1

commit 242ee1edddc890ee87a46488fa4a83cce9da97d1
Author:     Dávid Halász <dhalasz>
AuthorDate: Wed Jun 28 16:20:55 2017 +0200
Commit:     Dávid Halász <dhalasz>
CommitDate: Wed Jun 28 17:54:52 2017 +0200

    Disable connection reuse for WebSocket connections in Apache
    
    This is a temporary workaround for the issue described here:
    https://bugzilla.redhat.com/show_bug.cgi?id=1404354
    
    This can be reverted after httpd is updated to 2.4.25 or newer

 .../httpd/conf.d/manageiq-balancer-websocket.conf  | 23 ++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

Comment 32 CFME Bot 2017-07-06 13:58:33 UTC
New commit detected on ManageIQ/manageiq-gems-pending/fine:
https://github.com/ManageIQ/manageiq-gems-pending/commit/65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8

commit 65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8
Author:     Dávid Halász <dhalasz>
AuthorDate: Wed Jun 28 13:29:45 2017 +0200
Commit:     Dávid Halász <dhalasz>
CommitDate: Thu Jun 29 17:24:37 2017 +0200

    Disable connection reuse for WebSocket connections in Apache
    
    This is a temporary workaround for the issue described here:
    https://bugzilla.redhat.com/show_bug.cgi?id=1404354
    
    This can be reverted after httpd is updated to 2.4.25 or newer

 lib/gems/pending/util/miq_apache/miq_apache.rb | 10 ++++++++--
 spec/util/miq_apache/conf_spec.rb              | 25 ++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 3 deletions(-)

Comment 35 Pete Savage 2018-02-26 07:58:23 UTC
Verified