Bug 1229859

Summary: vdsm hangup 100% CPU
Product: [Retired] oVirt Reporter: Markus Stockhausen <mst>
Component: vdsmAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Kubica <pkubica>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5CC: bazulay, bugs, ecohen, gklein, lsurette, mgoldboi, mst, oourfali, pkliczew, pstehlik, rbalakri, usurse, yeylon
Target Milestone: ---   
Target Release: 3.5.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: vdsm-4.16.21-1.el6ev Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-09-03 13:54:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1215236    
Attachments:
Description Flags
vdsm log none

Description Markus Stockhausen 2015-06-09 19:21:56 UTC
Description of problem:

After a network cable exchange of an Ovirt node the VDSM consumed 100% cpu. The process did not recover although the network connectivity was fine again. It seems as if vdsm can run in endless error loop if communication from client breaks (during handshakte?).

Version-Release number of selected component (if applicable):

vdsm 4.16.10


How reproducible:

100%


Steps to Reproduce:

The following simplified setup should produce the error after a few seconds/minutes.

1. Create vdsm client script vdx.sh on Ovirt node.

#!/bin/sh
while [ 1 -eq 1 ]; do
  vdsClient -s 0 getVdsCapabilities > /dev/null
  echo -n .
done

2. execute vdx.sh

3. during script output hit CTRL-C at random times. So that current client request is aborted but scripts runs on.

Actual results:

After 40-50 times VDSM CPU usage raises to 100%. vdsm log gives tons of the following error messages:

Detector thread::ERROR::2015-06-09 21:14:01,515::protocoldetector::90::vds.MultiProtocolAcceptor::(serve_forever) Unhandled exception
Traceback (most recent call last):
  File "/usr/share/vdsm/protocoldetector.py", line 86, in serve_forever
    self._process_events()
  File "/usr/share/vdsm/protocoldetector.py", line 105, in _process_events
    self._handle_connection_read(fd)
  File "/usr/share/vdsm/protocoldetector.py", line 225, in _handle_connection_read
    data = client_socket.recv(self._required_size, socket.MSG_PEEK)
  File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 58, in read
    self._data = self.connection.read(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 229, in read
    return self._read_nbio(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 218, in _read_nbio
    return m2.ssl_read_nbio(self.ssl, size)
SSLError: unexpected eof
Detector thread::ERROR::2015-06-09 21:14:01,516::protocoldetector::90::vds.MultiProtocolAcceptor::(serve_forever) Unhandled exception
Traceback (most recent call last):
  File "/usr/share/vdsm/protocoldetector.py", line 86, in serve_forever
    self._process_events()
  File "/usr/share/vdsm/protocoldetector.py", line 105, in _process_events
    self._handle_connection_read(fd)
  File "/usr/share/vdsm/protocoldetector.py", line 225, in _handle_connection_read
    data = client_socket.recv(self._required_size, socket.MSG_PEEK)
  File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 58, in read
    self._data = self.connection.read(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 229, in read
    return self._read_nbio(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 218, in _read_nbio
    return m2.ssl_read_nbio(self.ssl, size)
SSLError: unexpected eof
...

top

KiB Mem:  24682112 total,  5454764 used, 19227348 free,    88936 buffers
KiB Swap: 10235900 total,        0 used, 10235900 free,  4402944 cached

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5821 vdsm       0 -20 2102668  62436  21796 S 101,8  0,3   0:44.14 vdsm
  373 root      20   0  228164 119976 115472 S  32,6  0,5 481:22.23 systemd-journal
31680 root      20   0  198988  14168   7248 S   1,0  0,1   0:13.02 iotop
    7 root      20   0       0      0      0 S   0,3  0,0  10:54.30 rcu_sched
    9 root      20   0       0      0      0 S   0,3  0,0   4:24.23 rcuos/1
 4678 root      20   0       0      0      0 S   0,3  0,0   0:00.17 kworker/0:0
    1 root      20   0   50036   8192   5440 S   0,0  0,0   1:16.15 systemd



Expected results:

stable operation of vdsm

Comment 1 Piotr Kliczewski 2015-06-10 08:18:09 UTC
Please provide engine and vdsm logs.

Comment 2 Markus Stockhausen 2015-06-10 12:12:37 UTC
Log attached error starts at 14:07:48,102

engine log is not needed. node is in maintenance mode.

Comment 3 Markus Stockhausen 2015-06-10 12:13:06 UTC
Created attachment 1037253 [details]
vdsm log

Comment 4 Piotr Kliczewski 2015-06-10 12:24:58 UTC
I would like to understand what happened with the engine. Please provide engine log as well.

Comment 5 Markus Stockhausen 2015-06-10 13:30:45 UTC
Sorry for not clarifying the situation completely.

We noticed that one of the nodes had high CPU usage after a maintenance. This was discovered a day later and we started to analyze the situation on the node. We could elaborate the initial situation only from brainstorming. So it might be not 100% correct. Afterwards it took us about two weeks to identify a simple reproducer. See script vdx.sh. 

Looking back at our logs I'm sorry that I'm currently no longer able to provide the logs of the time of first occurrence. I hope that you have enough data to test it yourself.

Comment 6 Oved Ourfali 2015-06-11 07:16:47 UTC
Piotr - did you try to reproduce?

Comment 7 Piotr Kliczewski 2015-06-11 07:29:37 UTC
Working on it now

Comment 8 Piotr Kliczewski 2015-06-17 07:41:35 UTC
*** Bug 1217896 has been marked as a duplicate of this bug. ***

Comment 9 Petr Kubica 2015-07-16 11:45:08 UTC
Verified in vdsm-4.16.22-1.el7ev.x86_64

Comment 10 Sandro Bonazzola 2015-09-03 13:54:50 UTC
This is an automated message.
oVirt 3.5.4 has been released on September 3rd 2015 and should include the fix for this BZ. Moving to closed current release.