Bug 1229859 - vdsm hangup 100% CPU
Summary: vdsm hangup 100% CPU
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.4
Assignee: Piotr Kliczewski
QA Contact: Petr Kubica
URL:
Whiteboard: infra
Depends On:
Blocks: 1215236
TreeView+ depends on / blocked
 
Reported: 2015-06-09 19:21 UTC by Markus Stockhausen
Modified: 2016-02-10 19:32 UTC (History)
13 users (show)

Fixed In Version: vdsm-4.16.21-1.el6ev
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-09-03 13:54:50 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)
vdsm log (344.74 KB, application/zip)
2015-06-10 12:13 UTC, Markus Stockhausen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 42206 0 ovirt-3.5 MERGED protocoldetecor: SSLError handled not correctly Never
oVirt gerrit 42284 0 master ABANDONED protocoldetecor: SSLError handled not correctly Never
oVirt gerrit 42351 0 ovirt-3.5 MERGED protocoldetector: Remove connection after handshake error Never

Description Markus Stockhausen 2015-06-09 19:21:56 UTC
Description of problem:

After a network cable exchange of an Ovirt node the VDSM consumed 100% cpu. The process did not recover although the network connectivity was fine again. It seems as if vdsm can run in endless error loop if communication from client breaks (during handshakte?).

Version-Release number of selected component (if applicable):

vdsm 4.16.10


How reproducible:

100%


Steps to Reproduce:

The following simplified setup should produce the error after a few seconds/minutes.

1. Create vdsm client script vdx.sh on Ovirt node.

#!/bin/sh
while [ 1 -eq 1 ]; do
  vdsClient -s 0 getVdsCapabilities > /dev/null
  echo -n .
done

2. execute vdx.sh

3. during script output hit CTRL-C at random times. So that current client request is aborted but scripts runs on.

Actual results:

After 40-50 times VDSM CPU usage raises to 100%. vdsm log gives tons of the following error messages:

Detector thread::ERROR::2015-06-09 21:14:01,515::protocoldetector::90::vds.MultiProtocolAcceptor::(serve_forever) Unhandled exception
Traceback (most recent call last):
  File "/usr/share/vdsm/protocoldetector.py", line 86, in serve_forever
    self._process_events()
  File "/usr/share/vdsm/protocoldetector.py", line 105, in _process_events
    self._handle_connection_read(fd)
  File "/usr/share/vdsm/protocoldetector.py", line 225, in _handle_connection_read
    data = client_socket.recv(self._required_size, socket.MSG_PEEK)
  File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 58, in read
    self._data = self.connection.read(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 229, in read
    return self._read_nbio(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 218, in _read_nbio
    return m2.ssl_read_nbio(self.ssl, size)
SSLError: unexpected eof
Detector thread::ERROR::2015-06-09 21:14:01,516::protocoldetector::90::vds.MultiProtocolAcceptor::(serve_forever) Unhandled exception
Traceback (most recent call last):
  File "/usr/share/vdsm/protocoldetector.py", line 86, in serve_forever
    self._process_events()
  File "/usr/share/vdsm/protocoldetector.py", line 105, in _process_events
    self._handle_connection_read(fd)
  File "/usr/share/vdsm/protocoldetector.py", line 225, in _handle_connection_read
    data = client_socket.recv(self._required_size, socket.MSG_PEEK)
  File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 58, in read
    self._data = self.connection.read(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 229, in read
    return self._read_nbio(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 218, in _read_nbio
    return m2.ssl_read_nbio(self.ssl, size)
SSLError: unexpected eof
...

top

KiB Mem:  24682112 total,  5454764 used, 19227348 free,    88936 buffers
KiB Swap: 10235900 total,        0 used, 10235900 free,  4402944 cached

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5821 vdsm       0 -20 2102668  62436  21796 S 101,8  0,3   0:44.14 vdsm
  373 root      20   0  228164 119976 115472 S  32,6  0,5 481:22.23 systemd-journal
31680 root      20   0  198988  14168   7248 S   1,0  0,1   0:13.02 iotop
    7 root      20   0       0      0      0 S   0,3  0,0  10:54.30 rcu_sched
    9 root      20   0       0      0      0 S   0,3  0,0   4:24.23 rcuos/1
 4678 root      20   0       0      0      0 S   0,3  0,0   0:00.17 kworker/0:0
    1 root      20   0   50036   8192   5440 S   0,0  0,0   1:16.15 systemd



Expected results:

stable operation of vdsm

Comment 1 Piotr Kliczewski 2015-06-10 08:18:09 UTC
Please provide engine and vdsm logs.

Comment 2 Markus Stockhausen 2015-06-10 12:12:37 UTC
Log attached error starts at 14:07:48,102

engine log is not needed. node is in maintenance mode.

Comment 3 Markus Stockhausen 2015-06-10 12:13:06 UTC
Created attachment 1037253 [details]
vdsm log

Comment 4 Piotr Kliczewski 2015-06-10 12:24:58 UTC
I would like to understand what happened with the engine. Please provide engine log as well.

Comment 5 Markus Stockhausen 2015-06-10 13:30:45 UTC
Sorry for not clarifying the situation completely.

We noticed that one of the nodes had high CPU usage after a maintenance. This was discovered a day later and we started to analyze the situation on the node. We could elaborate the initial situation only from brainstorming. So it might be not 100% correct. Afterwards it took us about two weeks to identify a simple reproducer. See script vdx.sh. 

Looking back at our logs I'm sorry that I'm currently no longer able to provide the logs of the time of first occurrence. I hope that you have enough data to test it yourself.

Comment 6 Oved Ourfali 2015-06-11 07:16:47 UTC
Piotr - did you try to reproduce?

Comment 7 Piotr Kliczewski 2015-06-11 07:29:37 UTC
Working on it now

Comment 8 Piotr Kliczewski 2015-06-17 07:41:35 UTC
*** Bug 1217896 has been marked as a duplicate of this bug. ***

Comment 9 Petr Kubica 2015-07-16 11:45:08 UTC
Verified in vdsm-4.16.22-1.el7ev.x86_64

Comment 10 Sandro Bonazzola 2015-09-03 13:54:50 UTC
This is an automated message.
oVirt 3.5.4 has been released on September 3rd 2015 and should include the fix for this BZ. Moving to closed current release.


Note You need to log in before you can comment on or make changes to this bug.