Bug 1257324
Summary: | novncproxy got wedged/unresponsive | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jeremy <jmelvin> | ||||||||||||
Component: | openstack-puppet-modules | Assignee: | Emilien Macchi <emacchi> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> | ||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 7.0 (Kilo) | CC: | bschmaus, calfonso, clincoln, david.costakos, dh3, dmaley, eglynn, emacchi, jmelvin, kmorey, lbezdick, mburns, mcornea, oblaut, rbiba, rhel-osp-director-maint, rohara, sputhenp, yeylon | ||||||||||||
Target Milestone: | z2 | Keywords: | Triaged, ZStream | ||||||||||||
Target Release: | 7.0 (Kilo) | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | openstack-puppet-modules-2015.1.8-14.el7ost | Doc Type: | Bug Fix | ||||||||||||
Doc Text: |
When connecting from a web browser to the novncproxy service load-balanced by HAproxy, it is necessary for the browser to stay connected to the same server, because if HAproxy load-balances to another novncproxy node, the browser loses the connection, which then times out with the "Connection Reset By Peer" error. This update configures novnc HAproxy to balance using "source" mode to ensure the server remains the same while the connection is established.
|
Story Points: | --- | ||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2015-10-08 12:23:52 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Jeremy
2015-08-26 19:33:43 UTC
openstack-nova-novncproxy-clone wedges/stops working workaround to resolve pcs resource restart openstack-nova-novncproxy-clone wait for pcs to timeout (few seconds) - resources will go into unmanaged state systemctl status openstack-nova-novncproxy service is stuck deactivating, but will eventually go into failed state pcs resource cleanup openstack-nova-novncproxy-clone Console is functioning again problem appears to repeat within 24hrs and using workaround above resolves Created attachment 1068111 [details]
novnc3
Created attachment 1068113 [details]
novnc2
Created attachment 1068115 [details]
novnc1
This BZ could be related to BZ# 1257328 There are many occurances of the following traceback in the novnc logs: 2015-08-27 13:27:49.834 24068 DEBUG nova.console.websocketproxy [-] Reaing zombies, active child count is 0 vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828 2015-08-27 13:27:50.730 24068 DEBUG nova.console.websocketproxy [-] 172.17.80.14: new handler Process vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828 2015-08-27 13:27:50.739 24141 INFO nova.console.websocketproxy [-] handler exception: [Errno 104] Connection reset by peer 2015-08-27 13:27:50.740 24141 DEBUG nova.console.websocketproxy [-] exception vmsg /usr/lib/python2.7/site-packages/websockify/websocket.py:828 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy Traceback (most recent call last): 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 878, in top_new_client 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy client = self.do_handshake(startsock, address) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 813, in do_handshake 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy self.RequestHandlerClass(retsock, address, self) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib/python2.7/site-packages/nova/console/websocketproxy.py", line 156, in __init__ 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy websockify.ProxyRequestHandler.__init__(self, *args, **kwargs) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 112, in __init__ 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy SimpleHTTPRequestHandler.__init__(self, req, addr, server) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/SocketServer.py", line 649, in __init__ 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy self.handle() 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 544, in handle 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy SimpleHTTPRequestHandler.handle(self) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/BaseHTTPServer.py", line 340, in handle 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy self.handle_one_request() 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/BaseHTTPServer.py", line 328, in handle_one_request 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy method() 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib/python2.7/site-packages/websockify/websocket.py", line 514, in do_GET 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy SimpleHTTPRequestHandler.do_GET(self) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 44, in do_GET 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy f = self.send_head() 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 91, in send_head 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy self.send_response(200) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/BaseHTTPServer.py", line 396, in send_response 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy self.send_header('Date', self.date_time_string()) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/BaseHTTPServer.py", line 401, in send_header 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy self.wfile.write("%s: %s\r\n" % (keyword, value)) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/socket.py", line 324, in write 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy self.flush() 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib64/python2.7/socket.py", line 303, in flush 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy self._sock.sendall(view[write_offset:write_offset+buffer_size]) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy tail = self.send(data, flags) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy total_sent += fd.send(data[total_sent:], flags) 2015-08-27 13:27:50.740 24141 TRACE nova.console.websocketproxy error: [Errno 104] Connection reset by peer Is HAProxy being used in this deployment? If so, please post the haproxy config file (/etc/haproxy/haproxy.cfg IIRC). There are known issues with the configured httpchk breaking novnc. haproxy is in use and httpdchk was used as the check type for the stanza. Is there an alternative check (tcp) to use to resolve? Fetching the sosreport to extract the haproxy.cfg Also can we clarify how this deployment was made? e.g. via rhel-osp-director or custom deployent tooling? OSP7 Director was used for deployment On the basis of: https://github.com/redhat-cip/puppet-openstack-cloud/issues/561 and the fix: https://github.com/redhat-cip/puppet-openstack-cloud/commit/1fcfb9b5 I suspect "balance source" may also be required here. I did the patch upstream: https://review.openstack.org/219464 Hello, Customer is wanting an explanation as to why the upstream patch will solve the problem. "Chris, Can you provide more detail on why RH believes this is the fix? We are not bouncing around to different controllers. The novncproxy process is getting wedged or hung. I need the push for the explanation and fix pushed harder. We have to restart all of these services multiple times each day right now." Are we positive that the http load balancing instead of TCP load balancing is causing the hang? If so, is there an explanation I can use to communicate to the customer? I plan to guide them through the haproxy.cfg changes that are being merged into this patch manually to get them going right away, I was able to confirm that there's no regular Puppet runs that would change the cfg file back. Does the customer have logs from haproxy? Would be interesting to see if timeout is occuring in haproxy, which would be easily identified in the logs. I think I have a fix for that. If the novncproxy connection is being closed by haproxy due to client/server timeout, you can fix this by setting 'timeout tunnel 1h' in the novncproxy section of haproxy.cfg. or... change the balance mode to 'source' so it does not timeout. moving to openstack-puppet-modules component b/c the proposed fix is in puppet-tripleo. On Version openstack-puppet-modules-2015.1.8-17.el7ost.noarch [heat-admin@overcloud-controller-0 ~]$ cat /etc/haproxy/haproxy.cfg | grep -i -A 6 novnc listen nova_novncproxy bind 10.35.180.10:6080 bind 172.17.0.10:6080 balance source server overcloud-controller-0 172.17.0.14:6080 check fall 5 inter 2000 rise 2 server overcloud-controller-1 172.17.0.17:6080 check fall 5 inter 2000 rise 2 server overcloud-controller-2 172.17.0.16:6080 check fall 5 inter 2000 rise 2 Only "balance source" is present according to the upstream patch. TCP keep-alive values are as stated above. [heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl 1 [heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_time 5 [heat-admin@overcloud-controller-0 ~]$ cat /proc/sys/net/ipv4/tcp_keepalive_probes 5 Is this enough to verify or I'm missing anything? Verified On Version openstack-puppet-modules-2015.1.8-17.el7ost.noarch haproxy configuration updated as stated above. Instance console worked for 30 min after installation without issues. *** Bug 1257753 has been marked as a duplicate of this bug. *** I still get disconnects with the config provided by openstack-puppet-modules-2015.1.8-17.el7ost.noarch I identified 2 types of disconnects: 1. The browser shows 'Server disconnected' and it can be recovered by refreshing the window. Error log shows [Errno 32] Broken pipe. Log attached. 2. The browser shows 'Failed to connect to server (code: 1006)' which is related to the token being expired. Log attached. Haproxy config below: [root@overcloud-controller-0 ~]# grep -A6 novnc /etc/haproxy/haproxy.cfg listen nova_novncproxy bind 172.16.20.10:6080 bind 172.16.23.10:6080 balance source server overcloud-controller-0 172.16.20.15:6080 check fall 5 inter 2000 rise 2 server overcloud-controller-1 172.16.20.13:6080 check fall 5 inter 2000 rise 2 server overcloud-controller-2 172.16.20.14:6080 check fall 5 inter 2000 rise 2 Created attachment 1077221 [details]
Error broken pipe
Created attachment 1077222 [details]
Error token expired
(In reply to Marius Cornea from comment #39) > I still get disconnects with the config provided by > openstack-puppet-modules-2015.1.8-17.el7ost.noarch > > I identified 2 types of disconnects: > > 1. The browser shows 'Server disconnected' and it can be recovered by > refreshing the window. Error log shows [Errno 32] Broken pipe. Log attached. > > 2. The browser shows 'Failed to connect to server (code: 1006)' which is > related to the token being expired. Log attached. > > Haproxy config below: > > [root@overcloud-controller-0 ~]# grep -A6 novnc /etc/haproxy/haproxy.cfg > listen nova_novncproxy > bind 172.16.20.10:6080 > bind 172.16.23.10:6080 > balance source > server overcloud-controller-0 172.16.20.15:6080 check fall 5 inter 2000 > rise 2 > server overcloud-controller-1 172.16.20.13:6080 check fall 5 inter 2000 > rise 2 > server overcloud-controller-2 172.16.20.14:6080 check fall 5 inter 2000 > rise 2 See comments 21, 30 and 31. I've suggested using a tunnel timeout. Did you try that? I've just tried it and I couldn't reproduce the disconnects anymore, the nova-novncproxy.log looks clean now. Here's my haproxy config: listen nova_novncproxy bind 172.16.20.11:6080 bind 172.16.23.10:6080 balance source timeout tunnel 1h server overcloud-controller-0 172.16.20.14:6080 check fall 5 inter 2000 rise 2 server overcloud-controller-1 172.16.20.12:6080 check fall 5 inter 2000 rise 2 server overcloud-controller-2 172.16.20.15:6080 check fall 5 inter 2000 rise 2 This is still unclear how it got solved. Could you please post your packages versions? Did you change something manually? I manually added 'timeout tunnel 1h' to the novnc haproxy config as suggested before. Package version is openstack-puppet-modules-2015.1.8-19.el7ost.noarch. @Emilien, We never added the timeout tunnel 1h on our client, however we did remove the httpchk get / statement, which the timing of the check seemed to correlate with the connection reset statement. We did however add a tcpka check, however in hindsight I do not think that is required. My thoughts are that adding the balanced source fixed the issue for the client being able to access the console properly. And that removing the httpchk get / cleared up the connection resets in the logs. The client in our example has been running clean for a week now. I did some more testing with and without tunnel timeout to understand the issue here. What I noticed is that when not setting the tunnel timeout the vnc session would eventually end up as disconnected with the log showing the Broken pipe error. The console can be recovered by opening a new console session. While using the tunnel timeout the vnc session was kept connected over long period. Nevertheless these are different issues than the ones reported in this ticket. Therefore I am going to move this bug back to verified and open a new one for setting tunnel timeout in the novnc haproxy config. I filed BZ#1267043 for setting timeout tunnel in the nova_novncproxy section. *** Bug 1257328 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:1872 |