Bug 1257324

Summary: novncproxy got wedged/unresponsive
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-puppet-modulesAssignee: Emilien Macchi <emacchi>
Status: CLOSED ERRATA QA Contact: Udi Shkalim <ushkalim>
Severity: unspecified Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: bschmaus, calfonso, clincoln, david.costakos, dh3, dmaley, eglynn, emacchi, jmelvin, kmorey, lbezdick, mburns, mcornea, oblaut, rbiba, rhel-osp-director-maint, rohara, sputhenp, yeylon
Target Milestone: z2Keywords: Triaged, ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-puppet-modules-2015.1.8-14.el7ost Doc Type: Bug Fix
Doc Text:
When connecting from a web browser to the novncproxy service load-balanced by HAproxy, it is necessary for the browser to stay connected to the same server, because if HAproxy load-balances to another novncproxy node, the browser loses the connection, which then times out with the "Connection Reset By Peer" error. This update configures novnc HAproxy to balance using "source" mode to ensure the server remains the same while the connection is established.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-08 12:23:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
novnc3
none
novnc2
none
novnc1
none
Error broken pipe
none
Error token expired none

Description Jeremy 2015-08-26 19:33:43 UTC
Description of problem:novnxproxy got wedged/unresponsive - client could not connect -killing off and restarting resolved 


Version-Release number of selected component (if applicable):
7.0

How reproducible:unknown


Steps to Reproduce:
1.unknown
2.
3.

Actual results:
novnxproxy got wedged/unresponsive

Expected results:
novnxproxy not wedged/unresponsive

Additional info:
Customer ran into this issue, which prevented them from accessing instance console via horizon. once we restarted novncproxy the problem was resolved. We would like to know the root cause of why the service became unresponsive..

Comment 4 Benjamin Schmaus 2015-08-27 15:43:51 UTC
openstack-nova-novncproxy-clone wedges/stops working

workaround to resolve 

pcs resource restart openstack-nova-novncproxy-clone
wait for pcs to timeout (few seconds) - resources will go into unmanaged state
systemctl status openstack-nova-novncproxy
service is stuck deactivating, but will eventually go into failed state
pcs resource cleanup openstack-nova-novncproxy-clone
Console is functioning again

problem appears to repeat within 24hrs and using workaround above resolves

Comment 5 Jeremy 2015-08-28 18:33:13 UTC
Created attachment 1068111 [details]
novnc3

Comment 6 Jeremy 2015-08-28 18:33:40 UTC
Created attachment 1068113 [details]
novnc2

Comment 7 Jeremy 2015-08-28 18:34:11 UTC
Created attachment 1068115 [details]
novnc1

Comment 9 Benjamin Schmaus 2015-09-01 20:33:10 UTC
This BZ could be related to BZ# 1257328

Comment 10 Eoghan Glynn 2015-09-01 20:41:37 UTC
There are many occurances of the following traceback in the novnc logs:

2015-08-27 13:27:49.834 24068 DEBUG nova.console.websocketproxy [-] Reaing zombies, active child count is 0 vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828
2015-08-27 13:27:50.730 24068 DEBUG nova.console.websocketproxy [-] 172.17.80.14: new handler Process vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828
2015-08-27 13:27:50.739 24141 INFO nova.console.websocketproxy [-] handler exception: [Errno 104] Connection reset by peer
2015-08-27 13:27:50.740 24141 DEBUG nova.console.websocketproxy [-] exception vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy Traceback (most recent call last):
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 878, in top_new_client
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     client = self.do_handshake(startsock, address)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 813, in do_handshake
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.RequestHandlerClass(retsock, address, self)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/nova/console/websocketproxy.py", line 156, in __init__
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     websockify.ProxyRequestHandler.__init__(self, *args, **kwargs)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 112, in __init__
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     SimpleHTTPRequestHandler.__init__(self, req, addr, server)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.handle()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 544, in handle
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     SimpleHTTPRequestHandler.handle(self)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.handle_one_request()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     method()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 514, in do_GET
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     SimpleHTTPRequestHandler.do_GET(self)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 44, in do_GET
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     f = self.send_head()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 91, in send_head
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.send_response(200)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/BaseHTTPServer.py", line 396, in send_response
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.send_header('Date', self.date_time_string())
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/BaseHTTPServer.py", line 401, in send_header
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.wfile.write("%s: %s\r\n" % (keyword, value))
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/socket.py", line 324, in write
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self.flush()
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib64/python2.7/socket.py", line 303, in flush
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     self._sock.sendall(view[write_offset:write_offset+buffer_size])
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     tail = self.send(data, flags)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy     total_sent += fd.send(data[total_sent:], flags)
2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy error: [Errno 104] Connection reset by peer

Comment 11 Eoghan Glynn 2015-09-01 20:50:14 UTC
Is HAProxy being used in this deployment?

If so, please post the haproxy config file (/etc/haproxy/haproxy.cfg IIRC).

There are known issues with the configured httpchk breaking novnc.

Comment 12 Benjamin Schmaus 2015-09-01 20:53:49 UTC
haproxy is in use and httpdchk  was used as the check type for the stanza.

Is there an alternative check (tcp) to use to resolve?

Fetching the sosreport to extract the haproxy.cfg

Comment 13 Eoghan Glynn 2015-09-01 20:57:54 UTC
Also can we clarify how this deployment was made?

e.g. via rhel-osp-director or custom deployent tooling?

Comment 15 Benjamin Schmaus 2015-09-01 21:01:02 UTC
OSP7 Director was used for deployment

Comment 16 Eoghan Glynn 2015-09-01 21:07:05 UTC
On the basis of:

  https://github.com/redhat-cip/puppet-openstack-cloud/issues/561

and the fix:

  https://github.com/redhat-cip/puppet-openstack-cloud/commit/1fcfb9b5

I suspect "balance source" may also be required here.

Comment 18 Emilien Macchi 2015-09-01 21:35:47 UTC
I did the patch upstream: https://review.openstack.org/219464

Comment 19 Chris Lincoln 2015-09-02 18:10:25 UTC
Hello,

Customer is wanting an explanation as to why the upstream patch will solve the problem.

"Chris,
Can you provide more detail on why RH believes this is the fix?  We are not bouncing around to different controllers. The novncproxy process is getting wedged or hung. I need the push for the explanation and fix pushed harder. We have to restart all of these services multiple times each day right now."

Are we positive that the http load balancing instead of TCP load balancing is causing the hang? If so, is there an explanation I can use to communicate to the customer?

I plan to guide them through the haproxy.cfg changes that are being merged into this patch manually to get them going right away, I was able to confirm that there's no regular Puppet runs that would change the cfg file back.

Comment 20 Ryan O'Hara 2015-09-02 18:55:16 UTC
Does the customer have logs from haproxy? Would be interesting to see if timeout is occuring in haproxy, which would be easily identified in the logs. I think I have a fix for that.

Comment 21 Ryan O'Hara 2015-09-02 19:36:10 UTC
If the novncproxy connection is being closed by haproxy due to client/server timeout, you can fix this by setting 'timeout tunnel 1h' in the novncproxy section of haproxy.cfg.

Comment 22 Emilien Macchi 2015-09-02 19:41:12 UTC
or... change the balance mode to 'source' so it does not timeout.

Comment 25 Mike Burns 2015-09-03 12:53:21 UTC
moving to openstack-puppet-modules component b/c the proposed fix is in puppet-tripleo.

Comment 35 Udi Shkalim 2015-09-22 08:02:31 UTC
On Version
openstack-puppet-modules-2015.1.8-17.el7ost.noarch

[heat-admin@overcloud-controller-0 ~]$ cat /etc/haproxy/haproxy.cfg   | grep -i -A 6 novnc
listen nova_novncproxy
  bind 10.35.180.10:6080 
  bind 172.17.0.10:6080 
  balance source
  server overcloud-controller-0 172.17.0.14:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 172.17.0.17:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 172.17.0.16:6080 check fall 5 inter 2000 rise 2

Only "balance source" is present according to the upstream patch.

TCP keep-alive values are as stated above.
[heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl 
1
[heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_time 
5
[heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_probes 
5

Is this enough to verify or I'm missing anything?

Comment 37 Udi Shkalim 2015-09-24 08:37:38 UTC
Verified On Version
openstack-puppet-modules-2015.1.8-17.el7ost.noarch
haproxy configuration updated as stated above.
Instance console worked for 30 min after installation without issues.

Comment 38 Marius Cornea 2015-09-25 19:02:52 UTC
*** Bug 1257753 has been marked as a duplicate of this bug. ***

Comment 39 Marius Cornea 2015-09-25 19:15:36 UTC
I still get disconnects with the config provided by openstack-puppet-modules-2015.1.8-17.el7ost.noarch

I identified 2 types of disconnects:

1. The browser shows 'Server disconnected' and it can be recovered by refreshing the window. Error log shows [Errno 32] Broken pipe. Log attached. 

2. The browser shows 'Failed to connect to server (code: 1006)' which is related to the token being expired. Log attached.

Haproxy config below:

[root@overcloud-controller-0 ~]# grep -A6 novnc /etc/haproxy/haproxy.cfg 
listen nova_novncproxy
  bind 172.16.20.10:6080 
  bind 172.16.23.10:6080 
  balance source
  server overcloud-controller-0 172.16.20.15:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 172.16.20.13:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 172.16.20.14:6080 check fall 5 inter 2000 rise 2

Comment 40 Marius Cornea 2015-09-25 19:17:39 UTC
Created attachment 1077221 [details]
Error broken pipe

Comment 41 Marius Cornea 2015-09-25 19:18:16 UTC
Created attachment 1077222 [details]
Error token expired

Comment 42 Ryan O'Hara 2015-09-27 23:04:06 UTC
(In reply to Marius Cornea from comment #39)
> I still get disconnects with the config provided by
> openstack-puppet-modules-2015.1.8-17.el7ost.noarch
> 
> I identified 2 types of disconnects:
> 
> 1. The browser shows 'Server disconnected' and it can be recovered by
> refreshing the window. Error log shows [Errno 32] Broken pipe. Log attached. 
> 
> 2. The browser shows 'Failed to connect to server (code: 1006)' which is
> related to the token being expired. Log attached.
> 
> Haproxy config below:
> 
> [root@overcloud-controller-0 ~]# grep -A6 novnc /etc/haproxy/haproxy.cfg 
> listen nova_novncproxy
>   bind 172.16.20.10:6080 
>   bind 172.16.23.10:6080 
>   balance source
>   server overcloud-controller-0 172.16.20.15:6080 check fall 5 inter 2000
> rise 2
>   server overcloud-controller-1 172.16.20.13:6080 check fall 5 inter 2000
> rise 2
>   server overcloud-controller-2 172.16.20.14:6080 check fall 5 inter 2000
> rise 2

See comments 21, 30 and 31. I've suggested using a tunnel timeout. Did you try that?

Comment 43 Marius Cornea 2015-09-28 00:13:04 UTC
I've just tried it and I couldn't reproduce the disconnects anymore, the nova-novncproxy.log looks clean now. 

Here's my haproxy config:

listen nova_novncproxy
  bind 172.16.20.11:6080 
  bind 172.16.23.10:6080 
  balance source
  timeout tunnel 1h
  server overcloud-controller-0 172.16.20.14:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 172.16.20.12:6080 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 172.16.20.15:6080 check fall 5 inter 2000 rise 2

Comment 44 Emilien Macchi 2015-09-28 13:23:49 UTC
This is still unclear how it got solved.
Could you please post your packages versions?
Did you change something manually?

Comment 45 Marius Cornea 2015-09-28 13:39:52 UTC
I manually added 'timeout tunnel 1h' to the novnc haproxy config as suggested before. 

Package version is openstack-puppet-modules-2015.1.8-19.el7ost.noarch.

Comment 46 Benjamin Schmaus 2015-09-28 14:01:26 UTC
@Emilien,

We never added the timeout tunnel 1h on our client, however we did remove the httpchk get / statement, which the timing of the check seemed to correlate with the connection reset statement.

We did however add a tcpka check, however in hindsight I do not think that is required.

My thoughts are that adding the balanced source fixed the issue for the client being able to access the console properly.

And that removing the httpchk get / cleared up the connection resets in the logs.

The client in our example has been running clean for a week now.

Comment 49 Marius Cornea 2015-09-28 23:01:00 UTC
I did some more testing with and without tunnel timeout to understand the issue here. What I noticed is that when not setting the tunnel timeout the vnc session would eventually end up as disconnected with the log showing the Broken pipe error. The console can be recovered by opening a new console session. While using the tunnel timeout the vnc session was kept connected over long period. 

Nevertheless these are different issues than the ones reported in this ticket. Therefore I am going to move this bug back to verified and open a new one for setting tunnel timeout in the novnc haproxy config.

Comment 50 Marius Cornea 2015-09-28 23:20:39 UTC
I filed BZ#1267043 for setting timeout tunnel in the nova_novncproxy section.

Comment 51 Benjamin Schmaus 2015-09-29 15:42:26 UTC
*** Bug 1257328 has been marked as a duplicate of this bug. ***

Comment 55 errata-xmlrpc 2015-10-08 12:23:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:1872