Bug 1257324 - novncproxy got wedged/unresponsive
novncproxy got wedged/unresponsive
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-puppet-modules (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
high Severity unspecified
: z2
: 7.0 (Kilo)
Assigned To: Emilien Macchi
Udi Shkalim
: Triaged, ZStream
: 1257328 1257753 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-26 15:33 EDT by Jeremy
Modified: 2018-02-08 05:52 EST (History)
20 users (show)

See Also:
Fixed In Version: openstack-puppet-modules-2015.1.8-14.el7ost
Doc Type: Bug Fix
Doc Text:
When connecting from a web browser to the novncproxy service load-balanced by HAproxy, it is necessary for the browser to stay connected to the same server, because if HAproxy load-balances to another novncproxy node, the browser loses the connection, which then times out with the "Connection Reset By Peer" error. This update configures novnc HAproxy to balance using "source" mode to ensure the server remains the same while the connection is established.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-08 08:23:52 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
novnc3 (978.05 KB, application/x-gzip)
2015-08-28 14:33 EDT, Jeremy
no flags Details
novnc2 (3.19 MB, application/x-gzip)
2015-08-28 14:33 EDT, Jeremy
no flags Details
novnc1 (13.91 MB, application/x-gzip)
2015-08-28 14:34 EDT, Jeremy
no flags Details
Error broken pipe (4.97 KB, text/plain)
2015-09-25 15:17 EDT, Marius Cornea
no flags Details
Error token expired (3.64 KB, text/plain)
2015-09-25 15:18 EDT, Marius Cornea
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 219464 None None None Never

  None (edit)
Description Jeremy 2015-08-26 15:33:43 EDT
Description of problem:novnxproxy got wedged/unresponsive - client could not connect -killing off and restarting resolved 


Version-Release number of selected component (if applicable):
7.0

How reproducible:unknown


Steps to Reproduce:
1.unknown
2.
3.

Actual results:
novnxproxy got wedged/unresponsive

Expected results:
novnxproxy not wedged/unresponsive

Additional info:
Customer ran into this issue, which prevented them from accessing instance console via horizon. once we restarted novncproxy the problem was resolved. We would like to know the root cause of why the service became unresponsive..
Comment 4 Benjamin Schmaus 2015-08-27 11:43:51 EDT
openstack-nova-novncproxy-clone wedges/stops working

workaround to resolve 

pcs resource restart openstack-nova-novncproxy-clone
wait for pcs to timeout (few seconds) - resources will go into unmanaged state
systemctl status openstack-nova-novncproxy
service is stuck deactivating, but will eventually go into failed state
pcs resource cleanup openstack-nova-novncproxy-clone
Console is functioning again

problem appears to repeat within 24hrs and using workaround above resolves
Comment 5 Jeremy 2015-08-28 14:33:13 EDT
Created attachment 1068111 [details]
novnc3
Comment 6 Jeremy 2015-08-28 14:33:40 EDT
Created attachment 1068113 [details]
novnc2
Comment 7 Jeremy 2015-08-28 14:34:11 EDT
Created attachment 1068115 [details]
novnc1
Comment 9 Benjamin Schmaus 2015-09-01 16:33:10 EDT
This BZ could be related to BZ# 1257328
Comment 10 Eoghan Glynn 2015-09-01 16:41:37 EDT
There are many occurances of the following traceback in the novnc logs:

2015-08-27 13:27:49.834 24068 DEBUG nova.console.websocketproxy [-] Reaing zombies, active child count is 0 vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828
2015-08-27 13:27:50.730 24068 DEBUG nova.console.websocketproxy [-] 172.17.80.14: new handler Process vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828
2015-08-27 13:27:50.739 24141 INFO nova.console.websocketproxy [-] handler exception: [Errno 104] Connection reset by peer
2015-08-27 13:27:50.740 24141 DEBUG nova.console.websocketproxy [-] exception vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy Traceback (most recent call last):
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 878, in top_new_client
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     client = self.do_handshake(startsock, address)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 813, in do_handshake
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.RequestHandlerClass(retsock, address, self)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/nova/console/websocketproxy.py", line 156, in __init__
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     websockify.ProxyRequestHandler.__init__(self, *args, **kwargs)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 112, in __init__
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     SimpleHTTPRequestHandler.__init__(self, req, addr, server)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.handle()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 544, in handle
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     SimpleHTTPRequestHandler.handle(self)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.handle_one_request()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     method()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 514, in do_GET
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     SimpleHTTPRequestHandler.do_GET(self)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 44, in do_GET
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     f = self.send_head()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 91, in send_head
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.send_response(200)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/BaseHTTPServer.py", line 396, in send_response
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.send_header('Date', self.date_time_string())
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/BaseHTTPServer.py", line 401, in send_header
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.wfile.write("%s: %s\r\n" % (keyword, value))
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/socket.py", line 324, in write
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.flush()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/socket.py", line 303, in flush
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self._sock.sendall(view[write_offset:write_offset+buffer_size])
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     tail = self.send(data, flags)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     total_sent += fd.send(data[total_sent:], flags)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy error: [Errno 104] Connection reset by peer
Comment 11 Eoghan Glynn 2015-09-01 16:50:14 EDT
Is HAProxy being used in this deployment?

If so, please post the haproxy config file (/etc/haproxy/haproxy.cfg IIRC).

There are known issues with the configured httpchk breaking novnc.
Comment 12 Benjamin Schmaus 2015-09-01 16:53:49 EDT
haproxy is in use and httpdchk  was used as the check type for the stanza.

Is there an alternative check (tcp) to use to resolve?

Fetching the sosreport to extract the haproxy.cfg
Comment 13 Eoghan Glynn 2015-09-01 16:57:54 EDT
Also can we clarify how this deployment was made?

e.g. via rhel-osp-director or custom deployent tooling?
Comment 15 Benjamin Schmaus 2015-09-01 17:01:02 EDT
OSP7 Director was used for deployment
Comment 16 Eoghan Glynn 2015-09-01 17:07:05 EDT
On the basis of:

  https://github.com/redhat-cip/puppet-openstack-cloud/issues/561

and the fix:

  https://github.com/redhat-cip/puppet-openstack-cloud/commit/1fcfb9b5

I suspect "balance source" may also be required here.
Comment 18 Emilien Macchi 2015-09-01 17:35:47 EDT
I did the patch upstream: https://review.openstack.org/219464
Comment 19 Chris Lincoln 2015-09-02 14:10:25 EDT
Hello,

Customer is wanting an explanation as to why the upstream patch will solve the problem.

"Chris,
Can you provide more detail on why RH believes this is the fix?  We are not bouncing around to different controllers. The novncproxy process is getting wedged or hung. I need the push for the explanation and fix pushed harder. We have to restart all of these services multiple times each day right now."

Are we positive that the http load balancing instead of TCP load balancing is causing the hang? If so, is there an explanation I can use to communicate to the customer?

I plan to guide them through the haproxy.cfg changes that are being merged into this patch manually to get them going right away, I was able to confirm that there's no regular Puppet runs that would change the cfg file back.
Comment 20 Ryan O'Hara 2015-09-02 14:55:16 EDT
Does the customer have logs from haproxy? Would be interesting to see if timeout is occuring in haproxy, which would be easily identified in the logs. I think I have a fix for that.
Comment 21 Ryan O'Hara 2015-09-02 15:36:10 EDT
If the novncproxy connection is being closed by haproxy due to client/server timeout, you can fix this by setting 'timeout tunnel 1h' in the novncproxy section of haproxy.cfg.
Comment 22 Emilien Macchi 2015-09-02 15:41:12 EDT
or... change the balance mode to 'source' so it does not timeout.
Comment 25 Mike Burns 2015-09-03 08:53:21 EDT
moving to openstack-puppet-modules component b/c the proposed fix is in puppet-tripleo.
Comment 35 Udi Shkalim 2015-09-22 04:02:31 EDT
On Version
openstack-puppet-modules-2015.1.8-17.el7ost.noarch

[heat-admin@overcloud-controller-0 ~]$ cat /etc/haproxy/haproxy.cfg   | grep -i -A 6 novnc
listen nova_novncproxy
  bind 10.35.180.10:6080 
  bind 172.17.0.10:6080 
  balance source
  server overcloud-controller-0 172.17.0.14:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 172.17.0.17:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 172.17.0.16:6080 check fall 5 inter 2000 rise 2

Only "balance source" is present according to the upstream patch.

TCP keep-alive values are as stated above.
[heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl 
1
[heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_time 
5
[heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_probes 
5

Is this enough to verify or I'm missing anything?
Comment 37 Udi Shkalim 2015-09-24 04:37:38 EDT
Verified On Version
openstack-puppet-modules-2015.1.8-17.el7ost.noarch
haproxy configuration updated as stated above.
Instance console worked for 30 min after installation without issues.
Comment 38 Marius Cornea 2015-09-25 15:02:52 EDT
*** Bug 1257753 has been marked as a duplicate of this bug. ***
Comment 39 Marius Cornea 2015-09-25 15:15:36 EDT
I still get disconnects with the config provided by openstack-puppet-modules-2015.1.8-17.el7ost.noarch

I identified 2 types of disconnects:

1. The browser shows 'Server disconnected' and it can be recovered by refreshing the window. Error log shows [Errno 32] Broken pipe. Log attached. 

2. The browser shows 'Failed to connect to server (code: 1006)' which is related to the token being expired. Log attached.

Haproxy config below:

[root@overcloud-controller-0 ~]# grep -A6 novnc /etc/haproxy/haproxy.cfg 
listen nova_novncproxy
  bind 172.16.20.10:6080 
  bind 172.16.23.10:6080 
  balance source
  server overcloud-controller-0 172.16.20.15:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 172.16.20.13:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 172.16.20.14:6080 check fall 5 inter 2000 rise 2
Comment 40 Marius Cornea 2015-09-25 15:17 EDT
Created attachment 1077221 [details]
Error broken pipe
Comment 41 Marius Cornea 2015-09-25 15:18 EDT
Created attachment 1077222 [details]
Error token expired
Comment 42 Ryan O'Hara 2015-09-27 19:04:06 EDT
(In reply to Marius Cornea from comment #39)
> I still get disconnects with the config provided by
> openstack-puppet-modules-2015.1.8-17.el7ost.noarch
> 
> I identified 2 types of disconnects:
> 
> 1. The browser shows 'Server disconnected' and it can be recovered by
> refreshing the window. Error log shows [Errno 32] Broken pipe. Log attached. 
> 
> 2. The browser shows 'Failed to connect to server (code: 1006)' which is
> related to the token being expired. Log attached.
> 
> Haproxy config below:
> 
> [root@overcloud-controller-0 ~]# grep -A6 novnc /etc/haproxy/haproxy.cfg 
> listen nova_novncproxy
>   bind 172.16.20.10:6080 
>   bind 172.16.23.10:6080 
>   balance source
>   server overcloud-controller-0 172.16.20.15:6080 check fall 5 inter 2000
> rise 2
>   server overcloud-controller-1 172.16.20.13:6080 check fall 5 inter 2000
> rise 2
>   server overcloud-controller-2 172.16.20.14:6080 check fall 5 inter 2000
> rise 2

See comments 21, 30 and 31. I've suggested using a tunnel timeout. Did you try that?
Comment 43 Marius Cornea 2015-09-27 20:13:04 EDT
I've just tried it and I couldn't reproduce the disconnects anymore, the nova-novncproxy.log looks clean now. 

Here's my haproxy config:

listen nova_novncproxy
  bind 172.16.20.11:6080 
  bind 172.16.23.10:6080 
  balance source
  timeout tunnel 1h
  server overcloud-controller-0 172.16.20.14:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 172.16.20.12:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 172.16.20.15:6080 check fall 5 inter 2000 rise 2
Comment 44 Emilien Macchi 2015-09-28 09:23:49 EDT
This is still unclear how it got solved.
Could you please post your packages versions?
Did you change something manually?
Comment 45 Marius Cornea 2015-09-28 09:39:52 EDT
I manually added 'timeout tunnel 1h' to the novnc haproxy config as suggested before. 

Package version is openstack-puppet-modules-2015.1.8-19.el7ost.noarch.
Comment 46 Benjamin Schmaus 2015-09-28 10:01:26 EDT
@Emilien,

We never added the timeout tunnel 1h on our client, however we did remove the httpchk get / statement, which the timing of the check seemed to correlate with the connection reset statement.

We did however add a tcpka check, however in hindsight I do not think that is required.

My thoughts are that adding the balanced source fixed the issue for the client being able to access the console properly.

And that removing the httpchk get / cleared up the connection resets in the logs.

The client in our example has been running clean for a week now.
Comment 49 Marius Cornea 2015-09-28 19:01:00 EDT
I did some more testing with and without tunnel timeout to understand the issue here. What I noticed is that when not setting the tunnel timeout the vnc session would eventually end up as disconnected with the log showing the Broken pipe error. The console can be recovered by opening a new console session. While using the tunnel timeout the vnc session was kept connected over long period. 

Nevertheless these are different issues than the ones reported in this ticket. Therefore I am going to move this bug back to verified and open a new one for setting tunnel timeout in the novnc haproxy config.
Comment 50 Marius Cornea 2015-09-28 19:20:39 EDT
I filed BZ#1267043 for setting timeout tunnel in the nova_novncproxy section.
Comment 51 Benjamin Schmaus 2015-09-29 11:42:26 EDT
*** Bug 1257328 has been marked as a duplicate of this bug. ***
Comment 55 errata-xmlrpc 2015-10-08 08:23:52 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:1872

Note You need to log in before you can comment on or make changes to this bug.