Bug 1404354
| Summary: | websocket connection leaks causing failed connections | |||
|---|---|---|---|---|
| Product: | Red Hat CloudForms Management Engine | Reporter: | Pete Savage <psavage> | |
| Component: | Build | Assignee: | Gregg Tanzillo <gtanzill> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Pete Savage <psavage> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 5.6.0 | CC: | adahms, cpelland, dajohnso, dhalasz, fdewaley, gekis, hkataria, jhardy, jorton, jpazdziora, jrafanie, mfeifer, mpovolny, obarenbo, psavage, rspagnol, simaishi, tachoi | |
| Target Milestone: | GA | Keywords: | TestOnly, ZStream | |
| Target Release: | 5.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | 5.9.0.1 | Doc Type: | Known Issue | |
| Doc Text: |
At current, connecting to virtual machines using HTML5 console access inconsistently fails. This is due to an issue in the underlying Apache web server related to web socket connections, which are used for remote console access to virtual machines. As a workaround, retry the connection. In the event that the connection fails again, wait a minute and retry again. This issue is currently being investigated by engineering, who seek to have a solution in the first update to Red Hat CloudForms 4.2.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1468281 1468633 (view as bug list) | Environment: | ||
| Last Closed: | 2018-03-06 15:02:28 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | CFME Core | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1395782, 1468281, 1468633 | |||
|
Description
Pete Savage
2016-12-13 16:15:33 UTC
So, we've been investigating this over the last week, and to cut a long story short, what we originally thought was something that we could only replicate in a QE environment, turned out to be something that we could replicate on any appliance using the new websocket connections. The problem seems to be worst in Firefox, but portions of it can be seen in Chrome too. Here is what we observe. When loading a page, a connection is made to the /ws/notifications uri which in turn initiates a connection from apache to port 5000 on the internal ruby process (rack). When navigating to a new page, this connection is not cleaned up and remains for a long period of time, even after the browser is closed. The more pages that are visited, the more ESTABLISHED connections stack up internally. What we also saw was that Firefox can also rack up connections that seem to hang. When the number of these is greater than the number of max connections for the browser, certain aspects of the UI refuse to function. There appears to be situations where Firefox and other browsers are unable to make the connection to the WebSocket service. When this happens it often leaks a connection. Occasionally, this can lead to some kind of infinite retry which just fills up the connections with stale hanging TCP connections. Often there seems to be one connection that is used, and another that is just stale, though it is kept alive by the browser. Through some intensive debugging, we discovered that it seems to be down to an issue with Apache. Apache 2.4.6, which is shipped with RHEL7 currently, seems to show the problem, Apache 2.4.18, which is in SCL also shows the issue. 2.4.23, which is currently in Fedora 24 does not seem to yield the same error. Specifically, no hanging connections on port 5000 internally and similarly it seems that the issue is gone in the browsers too. I'd like to do a little more testing, but it seems like a major upgrade of apache is needed to solve this bug. Chris, can you reach out to the SCL team and see what it will take to get the newer apache version available downstream. So this is also causing remote connections failures. Adding the fedora24 container in front of the appliance to proxy the web connections removes the issue completely for whatever reason. is there a way to work around this issue? David, do you know if there is any workaround? Unfortunately there is no other solution than updating httpd as the bug is in Apache's mod_proxy_wstunnel. The only thing that can help is disabling WebsocketWorker but this will turn off asynchronous notifications and VM remote consoles. Waiting for scratch build of Apache with the updated module to be tested to see if new module indeed fixes problem. If it does, can discuss options for hot-fix and z-stream release. New commit detected on ManageIQ/manageiq-appliance/master: https://github.com/ManageIQ/manageiq-appliance/commit/242ee1edddc890ee87a46488fa4a83cce9da97d1 commit 242ee1edddc890ee87a46488fa4a83cce9da97d1 Author: Dávid Halász <dhalasz> AuthorDate: Wed Jun 28 16:20:55 2017 +0200 Commit: Dávid Halász <dhalasz> CommitDate: Wed Jun 28 17:54:52 2017 +0200 Disable connection reuse for WebSocket connections in Apache This is a temporary workaround for the issue described here: https://bugzilla.redhat.com/show_bug.cgi?id=1404354 This can be reverted after httpd is updated to 2.4.25 or newer .../httpd/conf.d/manageiq-balancer-websocket.conf | 23 ++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) New commit detected on ManageIQ/manageiq-gems-pending/fine: https://github.com/ManageIQ/manageiq-gems-pending/commit/65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8 commit 65842a4ca9a18ec0771aec0cfb2f4f416e3e91e8 Author: Dávid Halász <dhalasz> AuthorDate: Wed Jun 28 13:29:45 2017 +0200 Commit: Dávid Halász <dhalasz> CommitDate: Thu Jun 29 17:24:37 2017 +0200 Disable connection reuse for WebSocket connections in Apache This is a temporary workaround for the issue described here: https://bugzilla.redhat.com/show_bug.cgi?id=1404354 This can be reverted after httpd is updated to 2.4.25 or newer lib/gems/pending/util/miq_apache/miq_apache.rb | 10 ++++++++-- spec/util/miq_apache/conf_spec.rb | 25 ++++++++++++++++++++++++- 2 files changed, 32 insertions(+), 3 deletions(-) Verified |