Description of problem: After a network cable exchange of an Ovirt node the VDSM consumed 100% cpu. The process did not recover although the network connectivity was fine again. It seems as if vdsm can run in endless error loop if communication from client breaks (during handshakte?). Version-Release number of selected component (if applicable): vdsm 4.16.10 How reproducible: 100% Steps to Reproduce: The following simplified setup should produce the error after a few seconds/minutes. 1. Create vdsm client script vdx.sh on Ovirt node. #!/bin/sh while [ 1 -eq 1 ]; do vdsClient -s 0 getVdsCapabilities > /dev/null echo -n . done 2. execute vdx.sh 3. during script output hit CTRL-C at random times. So that current client request is aborted but scripts runs on. Actual results: After 40-50 times VDSM CPU usage raises to 100%. vdsm log gives tons of the following error messages: Detector thread::ERROR::2015-06-09 21:14:01,515::protocoldetector::90::vds.MultiProtocolAcceptor::(serve_forever) Unhandled exception Traceback (most recent call last): File "/usr/share/vdsm/protocoldetector.py", line 86, in serve_forever self._process_events() File "/usr/share/vdsm/protocoldetector.py", line 105, in _process_events self._handle_connection_read(fd) File "/usr/share/vdsm/protocoldetector.py", line 225, in _handle_connection_read data = client_socket.recv(self._required_size, socket.MSG_PEEK) File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 58, in read self._data = self.connection.read(size) File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 229, in read return self._read_nbio(size) File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 218, in _read_nbio return m2.ssl_read_nbio(self.ssl, size) SSLError: unexpected eof Detector thread::ERROR::2015-06-09 21:14:01,516::protocoldetector::90::vds.MultiProtocolAcceptor::(serve_forever) Unhandled exception Traceback (most recent call last): File "/usr/share/vdsm/protocoldetector.py", line 86, in serve_forever self._process_events() File "/usr/share/vdsm/protocoldetector.py", line 105, in _process_events self._handle_connection_read(fd) File "/usr/share/vdsm/protocoldetector.py", line 225, in _handle_connection_read data = client_socket.recv(self._required_size, socket.MSG_PEEK) File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 58, in read self._data = self.connection.read(size) File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 229, in read return self._read_nbio(size) File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 218, in _read_nbio return m2.ssl_read_nbio(self.ssl, size) SSLError: unexpected eof ... top KiB Mem: 24682112 total, 5454764 used, 19227348 free, 88936 buffers KiB Swap: 10235900 total, 0 used, 10235900 free, 4402944 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5821 vdsm 0 -20 2102668 62436 21796 S 101,8 0,3 0:44.14 vdsm 373 root 20 0 228164 119976 115472 S 32,6 0,5 481:22.23 systemd-journal 31680 root 20 0 198988 14168 7248 S 1,0 0,1 0:13.02 iotop 7 root 20 0 0 0 0 S 0,3 0,0 10:54.30 rcu_sched 9 root 20 0 0 0 0 S 0,3 0,0 4:24.23 rcuos/1 4678 root 20 0 0 0 0 S 0,3 0,0 0:00.17 kworker/0:0 1 root 20 0 50036 8192 5440 S 0,0 0,0 1:16.15 systemd Expected results: stable operation of vdsm
Please provide engine and vdsm logs.
Log attached error starts at 14:07:48,102 engine log is not needed. node is in maintenance mode.
Created attachment 1037253 [details] vdsm log
I would like to understand what happened with the engine. Please provide engine log as well.
Sorry for not clarifying the situation completely. We noticed that one of the nodes had high CPU usage after a maintenance. This was discovered a day later and we started to analyze the situation on the node. We could elaborate the initial situation only from brainstorming. So it might be not 100% correct. Afterwards it took us about two weeks to identify a simple reproducer. See script vdx.sh. Looking back at our logs I'm sorry that I'm currently no longer able to provide the logs of the time of first occurrence. I hope that you have enough data to test it yourself.
Piotr - did you try to reproduce?
Working on it now
*** Bug 1217896 has been marked as a duplicate of this bug. ***
Verified in vdsm-4.16.22-1.el7ev.x86_64
This is an automated message. oVirt 3.5.4 has been released on September 3rd 2015 and should include the fix for this BZ. Moving to closed current release.