Bug 1326940
Summary: | After ovirt-engine is restarted, hypervisors stop listening on 54321 until vdsm is restarted. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Allie DeVolder <adevolder> | ||||||||
Component: | vdsm | Assignee: | Piotr Kliczewski <pkliczew> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Petr Matyáš <pmatyas> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 3.6.3 | CC: | adevolder, bazulay, bcholler, billcarlson, dondavis, fdeutsch, fsun, gklein, gveitmic, gwatson, jbelka, jhunsaker, lsurette, lsvaty, mgoldboi, mkalinin, mperina, myakove, nsoffer, olim, oourfali, pkliczew, rmcswain, srevivo, ycui, ykaul | ||||||||
Target Milestone: | ovirt-4.1.0-alpha | Keywords: | Reopened, ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1391877 (view as bug list) | Environment: | |||||||||
Last Closed: | 2017-04-25 00:46:55 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1391877 | ||||||||||
Attachments: |
|
Description
Allie DeVolder
2016-04-13 19:52:25 UTC
Can you show the output of `netstat -lntp` on the non-responsive host? Does `vdsClient -s 0 getVdsCaps` work there? The only possible scenario I see is that Vdsm has a bug blocking new connections, which you notice after an engine restart. How many times have you seen this? Please contact pkliczew or ybronhei when this happens and grant them credentials to the affected host. Hi Allan, can you provide the information in comment #5? I restart ovirt-engine and everything is fine afterwards. vdsmd stays responsive and engine re-establish the connection to it with getVdsCaps Jake, any progress getting logs as described in Comment 19? Created attachment 1183914 [details]
updated vdsm.log
Created attachment 1183915 [details]
updated supervdsm.log
Created attachment 1183916 [details]
updated messages
Closing the bug for now, but feel free to reopen when the issue is occurred again and please provide output as suggested in Comment 48. Thanks Hello, Anyway to deal with the ARL's issue? Should there be a change on RH side or on the customer side? Thanks & Regards Oonkwee Emerging Technologies RedHat Global Support Hello Piotr, Answers from the customer: The socket is not blocked, it is closed. Running netstat on the hypervisor during a scan will show the socket drop. Restarting vdsm restores the socket. Additionally, the scan is agentless and authenticated as a non-privileged user. How would it block this socket without privileges? The fact that a non-privileged user can close the socket is a serious problem. OK, in this case we can add logic to reopen the socket. IMO there's now way how a socket owned by vdsm user can be closed by non-privileged user. So Nessus scanner is either executed under root user or under a user with special privileges. Both of those cases should be solvable on customer side by Nessus scanner reconfiguration. On the other hand reopening socket will most probably not cause any additional issues, so I'm targeting this to 4.1 Restoring needinfo Vdsm is running in a safe and closed network, it is not designed to be exposed to malicious network scanners. I don't think we can support this type of usage. Regarding comment 80 - the difference between ovirt-imageio and vdsm is the SSL stack; vdsm is using M2Crypto by default, and ovirt-imageio is using python builtin SSL support. Jiri, can you change vdsm to use the builtin SSL stack and try to reproduce again with nmap? To change ssl configuration, add a file named /etc/vdsm.conf.d/50_ssl.conf with this content: [vars] # Specifies which ssl implementation should be used. # There are 2 options: # "m2c" to use the m2crypto module # "ssl" to use the standard python ssl module ssl_implementation = ssl Then restart vdsm and try to break it using mmap. (In reply to Nir Soffer from comment #85) > Vdsm is running in a safe and closed network, it is not designed to be > exposed to > malicious network scanners. I don't think we can support this type of usage. > > Regarding comment 80 - the difference between ovirt-imageio and vdsm is the > SSL > stack; vdsm is using M2Crypto by default, and ovirt-imageio is using python > builtin SSL support. > > Jiri, can you change vdsm to use the builtin SSL stack and try to reproduce > again > with nmap? > > To change ssl configuration, add a file named /etc/vdsm.conf.d/50_ssl.conf > with this content: > > [vars] > # Specifies which ssl implementation should be used. > # There are 2 options: > # "m2c" to use the m2crypto module > # "ssl" to use the standard python ssl module > ssl_implementation = ssl > > Then restart vdsm and try to break it using mmap. This change caused vdsmd listening only on IPv6 TCP socket: ~~~ ...storageRefresh::DEBUG::2016-11-16 07:10:24,946::hsm::383::Storage.HSM::(storageRefresh) HSM is ready Reactor thread::INFO::2016-11-16 07:10:39,492::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from ::1:47873 Reactor thread::ERROR::2016-11-16 07:10:39,511::sslutils::221::ProtocolDetector.SSLHandshakeDispatcher::(_handle_io) peer certificate does not match host name Reactor thread::INFO::2016-11-16 07:10:39,513::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from ::1:47874 ... ~~~ [root@dell-r210ii-04 ~]# netstat -ap |grep 54321 tcp6 0 0 [::]:54321 [::]:* LISTEN 21403/python [root@dell-r210ii-04 ~]# lsof -nPp 21403 | grep -i tcp vdsm 21403 vdsm 23u IPv6 90565982 0t0 TCP *:54321 (LISTEN) Compare with #76. Hi folks, I found this extremely easy to reproduce and it looks like a socket programming bug to me. See: Host (maintenance mode): # ss -l | grep 54321 | grep LISTEN tcp LISTEN 0 5 :::54321 :::* "Attacker": # nmap -sT -p 54321 192.168.0.110 (this is the Host IP) Result: # ss -l | grep 54321 | grep LISTEN <empty> What happens? "Attacker" "Host" -----SYN-----> 54321 <---SYN+ACK--- 54321 -----ACK-----> 54321 ----RST+ACK--> 54321 Nothing really special, this is not an attack or some random garbage. It's a 3-way handshake + RST. AFAIK, this is a vdsm socket programming issue. Using stap: #!/usr/bin/stap probe kernel.function("inet_csk_listen_stop") { state = tcp_ts_get_info_state($sk) if ($sk != 0) printf("inet_csk_listen_stop state %s %6d %16s %6d %6d\n", tcp_sockstate_str(state), uid(), execname(), pid(), inet_get_local_port($sk)) } probe kernel.function("tcp_close") { state = tcp_ts_get_info_state($sk) if ($sk != 0) printf("tcp_close state %s %6d %16s %6d %6d\n", tcp_sockstate_str(state), uid(), execname(), pid(), inet_get_local_port($sk)) } probe kernel.function("tcp_rcv_state_process") { state = tcp_ts_get_info_state($sk) if ($sk != 0) printf("tcp_rcv_state_process state %s %6d %16s %6d %6d\n", tcp_sockstate_str(state), uid(), execname(), pid(), inet_get_local_port($sk)) } Result when "nmap -sT -p 54321 192.168.0.110" is run: tcp_rcv_state_process state TCP_LISTEN 0 swapper/3 0 54321 tcp_rcv_state_process state TCP_SYN_RECV 0 swapper/3 0 54321 tcp_close state TCP_LISTEN 36 vdsm 5635 54321 <--- * inet_csk_listen_stop state TCP_CLOSE 36 vdsm 5635 54321 * Why are we closing something which is in LISTEN? The child handler is already after the 3 acks and is not in LISTEN anymore. So this is we closing the listen socket, and that's the bug. The handler for the already ESTABLISHED connection (after 3 ack) should have nothing to do with the main listening socket, so no matter what that handler does, it should not interfere with the LISTEN socket. Even if we send an RST to a LISTEN socket (which should not be the case here), the kernel won't close it (see tcp_rcv_state_process). The only thing left is something wrong in userspace. Looking at the code I can't point out exactly what is wrong, but something is and it looks like it's in userspace (vdsm). Thanks, Germano (In reply to Germano Veit Michel from comment #95) > it looks like it's in userspace (vdsm). Vdsm does not close the listen socket, this may be an issue with the SSL libraries used by vdsm. Please try the same when vdsm is configured to use python builtin ssl: [vars] # Specifies which ssl implementation should be used. # There are 2 options: # "m2c" to use the m2crypto module # "ssl" to use the standard python ssl module ssl_implementation = ssl According to comment 91, there is an issue with this configuration listening only to ipv6, but we need to know if the closing listen socket happens with this configuration. This looks like a bug in lower level component, we need to find the component and open a bug for this component. Hey Nir! (In reply to Nir Soffer from comment #97) > This looks like a bug in lower level component, we need to find the component > and open a bug for this component. > ... > This looks like a bug in lower level component, we need to find the component > and open a bug for this component. Makes perfect sense to me, it does not necessarily needs to be in vdsm. But that trace definitely shows that is something under vdsm's PID that closes the socket, be it the interpreter, libs or whaterver. I'm actually sorry. I'm not sure why I did not do this yesterday, but here is a traceback when the tcp_close is issued: => handle_accept in /usr/lib/python2.7/site-packages/vdsm/protocoldetector.py:64 => accept in /usr/lib64/python2.7/socket.py:201 => __init__ in /usr/lib64/python2.7/socket.py:185 => meth in /usr/lib64/python2.7/socket.py:223 => meth in /usr/lib64/python2.7/socket.py:223 => handle_close in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:48 => _delegate_call in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:141 => handle_close in /usr/lib/python2.7/site-packages/vdsm/protocoldetector.py:75 => close in /usr/lib64/python2.7/asyncore.py:403 => del_channel in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:136 => del_channel in /usr/lib64/python2.7/asyncore.py:286 => close in /usr/lib64/python2.7/socket.py:192 tcp_close state TCP_LISTEN 36 vdsm 18997 54321 Is it coming from yajsonrpc? vdsm-yajsonrpc-4.18.13-1.el7ev.noarch vdsm-4.18.13-1.el7ev.x86_64 (In reply to Nir Soffer from comment #97) > Please try the same when vdsm is configured to use python builtin ssl: I just did. 1) The socket listens to both IPv4 and IPv6 2) The Manager cannot connect: vdsm ProtocolDetector.SSLHandshakeDispatcher ERROR peer certificate does not match host name 3) nmap 3-way handshake make userspace close the LISTEN socket. So it looks like it's not related to SSL. By the way, here is the traceback when using python's ssl. The only different thing is the vdsm pid (18997 vs 19840) => mutex in /usr/lib/python2.7/site-packages/pthread.py:92 => readwrite in /usr/lib64/python2.7/asyncore.py:105 => handle_read_event in /usr/lib64/python2.7/asyncore.py:439 => handle_accept in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:51 => _delegate_call in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:141 => handle_accept in /usr/lib/python2.7/site-packages/vdsm/protocoldetector.py:64 => accept in /usr/lib64/python2.7/socket.py:201 => __init__ in /usr/lib64/python2.7/socket.py:185 => meth in /usr/lib64/python2.7/socket.py:223 => meth in /usr/lib64/python2.7/socket.py:223 => handle_close in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:48 => _delegate_call in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:141 => handle_close in /usr/lib/python2.7/site-packages/vdsm/protocoldetector.py:75 => close in /usr/lib64/python2.7/asyncore.py:403 => del_channel in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:136 => del_channel in /usr/lib64/python2.7/asyncore.py:286 => close in /usr/lib64/python2.7/socket.py:192 tcp_close state TCP_LISTEN 36 vdsm 19840 54321 Hopefully this helps you solving the bug! Cheers, Some more useful info. I CANNOT reproduce it on: RHEV-H 6.7 Host (20160219.0.el6ev) in a 3.5 Environment: vdsm-4.16.35-2.el6ev.x86_64 vdsm-yajsonrpc-4.16.35-2.el6ev.noarch (In reply to Germano Veit Michel from comment #100) > Some more useful info. I CANNOT reproduce it on: > RHEV-H 6.7 Host (20160219.0.el6ev) in a 3.5 Environment: > vdsm-4.16.35-2.el6ev.x86_64 > vdsm-yajsonrpc-4.16.35-2.el6ev.noarch Thanks Germano, your research is very useful! Indeed, protocol detector implementation was "improved" in 3.6, which may explain the situation. Do you have vdsm logs from 3.6 and 3.5 tests? The vdsm logs from 3.5 may be the most useful. Turns out that this is a mix of bad error handling logic in python asyncore library, and bad error handling in vdsm protocoldetector. Right after we accept a connection, we log the connection remote address and create a dispatcher with the connection. In this flow, we call getpeername() on the client socket. If the remote side closed the connection immediately, for example, during network scanning, getpeeername() raises socket.error. Because of bad error handling in asyncore, the socket.error is treated as listen socket error, and the listen socket is closed. This faulty vdsm code was introduced in this commit: commit b25dbdcce417f366e0654c2c8456e8f0d0e3059e Author: Saggi Mizrahi <smizrahi> Date: Mon Jan 19 19:24:41 2015 +0200 Have the protocol detector use the Reactor The current implementation duplicates a lot of the code in the reactor. This removes a lot of the duplicate code and prepares the ground for a single reactor for VDSM. Change-Id: If1cc2a205cdad335e8c3af4a77ed49c8977a79ba Signed-off-by: Saggi Mizrahi <smizrahi> Signed-off-by: pkliczewski <piotr.kliczewski> Reviewed-on: https://gerrit.ovirt.org/37098 Reviewed-by: Yaniv Bronhaim <ybronhei> Reviewed-by: Dan Kenigsberg <danken> This code was released in vdsm-4.17.2. Patch https://gerrit.ovirt.org/67056 fixes this issue, and can be easily ported to 4.0 and 3.6. ok, vdsm-4.19.2-2.el7ev.x86_64 Will this be made available for 3.6? Also, may I get a reason why vdsm listens on interfaces other than RHEV subnet? (In reply to Bill Carlson from comment #106) Hi Bill, > Will this be made available for 3.6? > This was backported to vdsm-4.17.37 > Also, may I get a reason why vdsm listens on interfaces other than RHEV > subnet? Please open a support case or a bugzilla about this, it is not related to this bug. |