Bug 1326940

Summary: After ovirt-engine is restarted, hypervisors stop listening on 54321 until vdsm is restarted.
Product: Red Hat Enterprise Virtualization Manager Reporter: Allie DeVolder <adevolder>
Component: vdsmAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.3CC: adevolder, bazulay, bcholler, billcarlson, dondavis, fdeutsch, fsun, gklein, gveitmic, gwatson, jbelka, jhunsaker, lsurette, lsvaty, mgoldboi, mkalinin, mperina, myakove, nsoffer, olim, oourfali, pkliczew, rmcswain, srevivo, ycui, ykaul
Target Milestone: ovirt-4.1.0-alphaKeywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1391877 (view as bug list) Environment:
Last Closed: 2017-04-25 00:46:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1391877    
Attachments:
Description Flags
updated vdsm.log
none
updated supervdsm.log
none
updated messages none

Description Allie DeVolder 2016-04-13 19:52:25 UTC
Description of problem:
When ovirt-engine is restarted on the RHEV-M, hypervisors will often stop listening on port 54321. A manual restart of vdsm will correct the issue.

Version-Release number of selected component (if applicable):
RHEV-M Host:  rhevm-3.6.3.4-0.1.el6.noarch
Hypervisor:  RHEL 7.2 server vdsm-4.17.23-0.el7ev.noarch

How reproducible:
Unknown

Steps to Reproduce:
1. Restart ovirt-engine on RHEV-M

Actual results:
In some circumstances, vdsm will stop listening.

Expected results:
All hypervisors remain functional

Comment 5 Dan Kenigsberg 2016-05-04 08:16:34 UTC
Can you show the output of `netstat -lntp` on the non-responsive host?
Does `vdsClient -s 0 getVdsCaps` work there?

The only possible scenario I see is that Vdsm has a bug blocking new connections, which you notice after an engine restart.

How many times have you seen this? Please contact pkliczew or ybronhei when this happens and grant them credentials to the affected host.

Comment 6 Yaniv Bronhaim 2016-05-15 08:29:47 UTC
Hi Allan, can you provide the information in comment #5? I restart ovirt-engine and everything is fine afterwards. vdsmd stays responsive and engine re-establish the connection to it with getVdsCaps

Comment 20 Martin Perina 2016-07-12 10:14:23 UTC
Jake, any progress getting logs as described in Comment 19?

Comment 23 Allie DeVolder 2016-07-25 18:49:18 UTC
Created attachment 1183914 [details]
updated vdsm.log

Comment 24 Allie DeVolder 2016-07-25 18:50:17 UTC
Created attachment 1183915 [details]
updated supervdsm.log

Comment 25 Allie DeVolder 2016-07-25 18:50:50 UTC
Created attachment 1183916 [details]
updated messages

Comment 49 Martin Perina 2016-09-21 11:42:49 UTC
Closing the bug for now, but feel free to reopen when the issue is occurred again and please provide output as suggested in Comment 48. Thanks

Comment 65 Oonkwee Lim 2016-10-11 18:14:30 UTC
Hello,

Anyway to deal with the ARL's issue?

Should there be a change on RH side or on the customer side?

Thanks & Regards

Oonkwee
Emerging Technologies
RedHat Global Support

Comment 67 Oonkwee Lim 2016-10-14 22:54:50 UTC
Hello Piotr,

Answers from the customer:

The socket is not blocked, it is closed.  Running netstat on the hypervisor during a scan will show the socket drop.  Restarting vdsm restores the socket.

Additionally, the scan is agentless and authenticated as a non-privileged user.  How would it block this socket without privileges?  The fact that a non-privileged user can close the socket is a serious problem.

Comment 68 Piotr Kliczewski 2016-10-17 07:12:32 UTC
OK, in this case we can add logic to reopen the socket.

Comment 69 Martin Perina 2016-10-18 11:07:05 UTC
IMO there's now way how a socket owned by vdsm user can be closed by non-privileged user. So Nessus scanner is either executed under root user or under a user with special privileges. Both of those cases should be solvable on customer side by Nessus scanner reconfiguration. On the other hand reopening socket will most probably not cause any additional issues, so I'm targeting this to 4.1

Comment 82 Piotr Kliczewski 2016-11-01 19:33:11 UTC
Restoring needinfo

Comment 85 Nir Soffer 2016-11-08 21:07:44 UTC
Vdsm is running in a safe and closed network, it is not designed to be exposed to
malicious network scanners. I don't think we can support this type of usage.

Regarding comment 80 - the difference between ovirt-imageio and vdsm is the SSL
stack; vdsm is using M2Crypto by default, and ovirt-imageio is using python 
builtin SSL support.

Jiri, can you change vdsm to use the builtin SSL stack and try to reproduce again
with nmap?

To change ssl configuration, add a file named /etc/vdsm.conf.d/50_ssl.conf
with this content:

[vars]
# Specifies which ssl implementation should be used.
# There are 2 options:
# "m2c" to use the m2crypto module
# "ssl" to use the standard python ssl module
ssl_implementation = ssl

Then restart vdsm and try to break it using mmap.

Comment 91 Jiri Belka 2016-11-16 08:15:13 UTC
(In reply to Nir Soffer from comment #85)
> Vdsm is running in a safe and closed network, it is not designed to be
> exposed to
> malicious network scanners. I don't think we can support this type of usage.
> 
> Regarding comment 80 - the difference between ovirt-imageio and vdsm is the
> SSL
> stack; vdsm is using M2Crypto by default, and ovirt-imageio is using python 
> builtin SSL support.
> 
> Jiri, can you change vdsm to use the builtin SSL stack and try to reproduce
> again
> with nmap?
> 
> To change ssl configuration, add a file named /etc/vdsm.conf.d/50_ssl.conf
> with this content:
> 
> [vars]
> # Specifies which ssl implementation should be used.
> # There are 2 options:
> # "m2c" to use the m2crypto module
> # "ssl" to use the standard python ssl module
> ssl_implementation = ssl
> 
> Then restart vdsm and try to break it using mmap.

This change caused vdsmd listening only on IPv6 TCP socket:

~~~
...storageRefresh::DEBUG::2016-11-16 07:10:24,946::hsm::383::Storage.HSM::(storageRefresh) HSM is ready
Reactor thread::INFO::2016-11-16 07:10:39,492::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from ::1:47873
Reactor thread::ERROR::2016-11-16 07:10:39,511::sslutils::221::ProtocolDetector.SSLHandshakeDispatcher::(_handle_io) peer certificate does not match host name
Reactor thread::INFO::2016-11-16 07:10:39,513::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from ::1:47874
...
~~~

[root@dell-r210ii-04 ~]# netstat -ap |grep 54321  
tcp6       0      0 [::]:54321              [::]:*                  LISTEN      21403/python        
[root@dell-r210ii-04 ~]# lsof -nPp 21403 | grep -i tcp
vdsm    21403 vdsm   23u     IPv6           90565982      0t0       TCP *:54321 (LISTEN)

Compare with #76.

Comment 95 Germano Veit Michel 2016-11-17 04:08:45 UTC
Hi folks,

I found this extremely easy to reproduce and it looks like a socket programming bug to me. See:

Host (maintenance mode):
# ss -l | grep 54321 | grep LISTEN
tcp    LISTEN     0      5      :::54321                :::*

"Attacker":
# nmap -sT -p 54321 192.168.0.110 (this is the Host IP)

Result:
# ss -l | grep 54321 | grep LISTEN
<empty>

What happens?

"Attacker"       "Host"
    -----SYN-----> 54321
    <---SYN+ACK--- 54321
    -----ACK-----> 54321
    ----RST+ACK--> 54321

Nothing really special, this is not an attack or some random garbage. It's a 3-way handshake + RST. AFAIK, this is a vdsm socket programming issue.

Using stap:

#!/usr/bin/stap

probe kernel.function("inet_csk_listen_stop") {
  state = tcp_ts_get_info_state($sk)
  if ($sk != 0)
    printf("inet_csk_listen_stop state %s %6d %16s %6d %6d\n", tcp_sockstate_str(state), uid(), execname(), pid(),
           inet_get_local_port($sk))
}

probe kernel.function("tcp_close") {
  state = tcp_ts_get_info_state($sk)
  if ($sk != 0)
    printf("tcp_close state %s %6d %16s %6d %6d\n", tcp_sockstate_str(state), uid(), execname(), pid(),
           inet_get_local_port($sk))
}

probe kernel.function("tcp_rcv_state_process") {
  state = tcp_ts_get_info_state($sk)
  if ($sk != 0)
    printf("tcp_rcv_state_process state %s %6d %16s %6d %6d\n", tcp_sockstate_str(state), uid(), execname(), pid(),
           inet_get_local_port($sk))
}

Result when "nmap -sT -p 54321 192.168.0.110" is run:
tcp_rcv_state_process state TCP_LISTEN      0        swapper/3      0  54321
tcp_rcv_state_process state TCP_SYN_RECV    0        swapper/3      0  54321
tcp_close             state TCP_LISTEN     36             vdsm   5635  54321  <--- *
inet_csk_listen_stop  state TCP_CLOSE      36             vdsm   5635  54321

* Why are we closing something which is in LISTEN? The child handler is already after the 3 acks and is not in LISTEN anymore. So this is we closing the listen socket, and that's the bug.

The handler for the already ESTABLISHED connection (after 3 ack) should have nothing to do with the main listening socket, so no matter what that handler does, it should not interfere with the LISTEN socket.

Even if we send an RST to a LISTEN socket (which should not be the case here), the kernel won't close it (see tcp_rcv_state_process). The only thing left is something wrong in userspace. Looking at the code I can't point out exactly what is wrong, but something is and it looks like it's in userspace (vdsm).

Thanks,
Germano

Comment 97 Nir Soffer 2016-11-17 17:45:34 UTC
(In reply to Germano Veit Michel from comment #95)
> it looks like it's in userspace (vdsm).

Vdsm does not close the listen socket, this may be an issue with the SSL
libraries used by vdsm.

Please try the same when vdsm is configured to use python builtin ssl:

[vars]
# Specifies which ssl implementation should be used.
# There are 2 options:
# "m2c" to use the m2crypto module
# "ssl" to use the standard python ssl module
ssl_implementation = ssl

According to comment 91, there is an issue with this configuration
listening only to ipv6, but we need to know if the closing listen socket
happens with this configuration.

This looks like a bug in lower level component, we need to find the component
and open a bug for this component.

Comment 98 Germano Veit Michel 2016-11-18 02:34:22 UTC
Hey Nir!

(In reply to Nir Soffer from comment #97)
> This looks like a bug in lower level component, we need to find the component
> and open a bug for this component.
> ...
> This looks like a bug in lower level component, we need to find the component
> and open a bug for this component.

Makes perfect sense to me, it does not necessarily needs to be in vdsm. But that trace definitely shows that is something under vdsm's PID that closes the socket, be it the interpreter, libs or whaterver.

I'm actually sorry. I'm not sure why I did not do this yesterday, but here is a traceback when the tcp_close is issued:

=> handle_accept in /usr/lib/python2.7/site-packages/vdsm/protocoldetector.py:64
=> accept in /usr/lib64/python2.7/socket.py:201
=> __init__ in /usr/lib64/python2.7/socket.py:185
=> meth in /usr/lib64/python2.7/socket.py:223
=> meth in /usr/lib64/python2.7/socket.py:223
=> handle_close in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:48
=> _delegate_call in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:141
=> handle_close in /usr/lib/python2.7/site-packages/vdsm/protocoldetector.py:75
=> close in /usr/lib64/python2.7/asyncore.py:403
=> del_channel in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:136
=> del_channel in /usr/lib64/python2.7/asyncore.py:286
=> close in /usr/lib64/python2.7/socket.py:192
tcp_close state TCP_LISTEN     36             vdsm  18997  54321

Is it coming from yajsonrpc?

vdsm-yajsonrpc-4.18.13-1.el7ev.noarch
vdsm-4.18.13-1.el7ev.x86_64

(In reply to Nir Soffer from comment #97)
> Please try the same when vdsm is configured to use python builtin ssl:

I just did.
1) The socket listens to both IPv4 and IPv6
2) The Manager cannot connect:
   vdsm ProtocolDetector.SSLHandshakeDispatcher ERROR peer certificate does not match host name
3) nmap 3-way handshake make userspace close the LISTEN socket.

So it looks like it's not related to SSL. By the way, here is the traceback when using python's ssl.
The only different thing is the vdsm pid (18997 vs 19840)

=> mutex in /usr/lib/python2.7/site-packages/pthread.py:92
=> readwrite in /usr/lib64/python2.7/asyncore.py:105
=> handle_read_event in /usr/lib64/python2.7/asyncore.py:439
=> handle_accept in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:51
=> _delegate_call in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:141
=> handle_accept in /usr/lib/python2.7/site-packages/vdsm/protocoldetector.py:64
=> accept in /usr/lib64/python2.7/socket.py:201
=> __init__ in /usr/lib64/python2.7/socket.py:185
=> meth in /usr/lib64/python2.7/socket.py:223
=> meth in /usr/lib64/python2.7/socket.py:223
=> handle_close in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:48
=> _delegate_call in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:141
=> handle_close in /usr/lib/python2.7/site-packages/vdsm/protocoldetector.py:75
=> close in /usr/lib64/python2.7/asyncore.py:403
=> del_channel in /usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py:136
=> del_channel in /usr/lib64/python2.7/asyncore.py:286
=> close in /usr/lib64/python2.7/socket.py:192
tcp_close state TCP_LISTEN     36             vdsm  19840  54321

Hopefully this helps you solving the bug!

Cheers,

Comment 100 Germano Veit Michel 2016-11-18 03:06:43 UTC
Some more useful info. I CANNOT reproduce it on:
RHEV-H 6.7 Host (20160219.0.el6ev) in a 3.5 Environment:
vdsm-4.16.35-2.el6ev.x86_64
vdsm-yajsonrpc-4.16.35-2.el6ev.noarch

Comment 101 Nir Soffer 2016-11-18 09:43:00 UTC
(In reply to Germano Veit Michel from comment #100)
> Some more useful info. I CANNOT reproduce it on:
> RHEV-H 6.7 Host (20160219.0.el6ev) in a 3.5 Environment:
> vdsm-4.16.35-2.el6ev.x86_64
> vdsm-yajsonrpc-4.16.35-2.el6ev.noarch

Thanks Germano, your research is very useful!

Indeed, protocol detector implementation was "improved" in 3.6, which may
explain the situation.

Do you have vdsm logs from 3.6 and 3.5 tests? The vdsm logs from 3.5 may be the 
most useful.

Comment 102 Nir Soffer 2016-11-18 21:10:07 UTC
Turns out that this is a mix of bad error handling logic in python
asyncore library, and bad error handling in vdsm protocoldetector.

Right after we accept a connection, we log the connection remote address
and create a dispatcher with the connection. In this flow, we call getpeername()
on the client socket. If the remote side closed the connection immediately, 
for example, during network scanning, getpeeername() raises socket.error.

Because of bad error handling in asyncore, the socket.error is treated as 
listen socket error, and the listen socket is closed.

This faulty vdsm code was introduced in this commit:

commit b25dbdcce417f366e0654c2c8456e8f0d0e3059e
Author: Saggi Mizrahi <smizrahi>
Date:   Mon Jan 19 19:24:41 2015 +0200

    Have the protocol detector use the Reactor
    
    The current implementation duplicates a lot of the code in the reactor.
    This removes a lot of the duplicate code and prepares the ground for a
    single reactor for VDSM.
    
    Change-Id: If1cc2a205cdad335e8c3af4a77ed49c8977a79ba
    Signed-off-by: Saggi Mizrahi <smizrahi>
    Signed-off-by: pkliczewski <piotr.kliczewski>
    Reviewed-on: https://gerrit.ovirt.org/37098
    Reviewed-by: Yaniv Bronhaim <ybronhei>
    Reviewed-by: Dan Kenigsberg <danken>

This code was released in vdsm-4.17.2.

Patch https://gerrit.ovirt.org/67056 fixes this issue, and can be easily ported
to 4.0 and 3.6.

Comment 105 Jiri Belka 2017-01-24 11:10:06 UTC
ok, vdsm-4.19.2-2.el7ev.x86_64

Comment 106 Bill Carlson 2017-04-11 20:19:10 UTC
Will this be made available for 3.6?

Also, may I get a reason why vdsm listens on interfaces other than RHEV subnet?

Comment 108 Germano Veit Michel 2017-04-28 02:57:15 UTC
(In reply to Bill Carlson from comment #106)

Hi Bill,

> Will this be made available for 3.6?
> 
This was backported to vdsm-4.17.37

> Also, may I get a reason why vdsm listens on interfaces other than RHEV
> subnet?
Please open a support case or a bugzilla about this, it is not related to this bug.