Bug 966133 - [vdsm] When running vdsm on a vm as fake host, vdsm initiates many processes of supervdsm in each run
Summary: [vdsm] When running vdsm on a vm as fake host, vdsm initiates many processes ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.3.0
Assignee: Yaniv Bronhaim
QA Contact: yeylon@redhat.com
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-22 14:36 UTC by vvyazmin@redhat.com
Modified: 2016-04-18 06:57 UTC (History)
13 users (show)

Fixed In Version: is2
Doc Type: Bug Fix
Doc Text:
Failed to connect to Super VDSM via getDeviceList command. This happened because each call to supervdsm checks the ctime of the process, which is used to verify that the pid of supervdsm, is the original pid that VDSM starts when executing supervdsm, otherwise the pid was taken by another process and supervdsm needs to be restarted. When running on fake kvm the ctime value changes during the process run. When ctime is different from the original that VDSM saved, VDSM relaunches new instance of supervdsm. This has been fixed by using proc/pid/stat field 21(starttime) to define the time the process started.
Clone Of:
Environment:
Last Closed: 2014-01-21 16:07:57 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
## Logs rhevm, vdsm (1.39 MB, application/x-gzip)
2013-05-22 14:36 UTC, vvyazmin@redhat.com
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0040 0 normal SHIPPED_LIVE vdsm bug fix and enhancement update 2014-01-21 20:26:21 UTC
oVirt gerrit 14998 0 None None None Never

Description vvyazmin@redhat.com 2013-05-22 14:36:33 UTC
Created attachment 751740 [details]
## Logs rhevm, vdsm

Description of problem: 
Failed connect to Super Vdsm via getDeviceList command

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF17 environment: 

RHEVM: rhevm-3.2.0-10.26.rc.el6ev.noarch
VDSM: vdsm-4.10.2-18.0.el6ev.x86_64 
LIBVIRT: libvirt-0.10.2-18.el6_4.4.x86_64 
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create iSCSI DC with 150 hosts (in my case 150 fake hosts) connected to one SD
2. Try extend SD or try create a new SD in same DC
  
Actual results:
1. Failed run GetDeviceListVDS  Command
2. Failed add or edit SD

Expected results:
Succeed extend SD or add a new SD

Additional info:

/var/log/ovirt-engine/engine.log
2013-05-22 15:08:18,279 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-4) Failed in GetDeviceListVDS method
2013-05-22 15:08:18,279 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-4) Error code BlockDeviceActionError and error message VDSGeneri
cException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()
2013-05-22 15:08:18,279 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-4) Command org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceLi
stVDSCommand return value 
 
LUNListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=600, mMessage=Error block device action: ()]]

2013-05-22 15:08:18,279 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-4) HostName = Fake_Host_046
2013-05-22 15:08:18,280 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (ajp-/127.0.0.1:8702-4) Command GetDeviceListVDS execution failed. Exception: VDSErrorException: V
DSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()
2013-05-22 15:08:18,280 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-4) FINISH, GetDeviceListVDSCommand, log id: 6cde584b
2013-05-22 15:08:18,280 ERROR [org.ovirt.engine.core.bll.GetLunsByVgIdQuery] (ajp-/127.0.0.1:8702-4) Query GetLunsByVgIdQuery failed. Exception message is VdcBLLException: org.o
virt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()

/var/log/vdsm/vdsm.log

Thread-1145::DEBUG::2013-05-22 16:45:04,488::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
Thread-1145::DEBUG::2013-05-22 16:45:04,490::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.py 89ee77db-771a-4e4c-9352-539eb7d3e328 8543 /var/run/vdsm/svdsm.pid /var/run/vdsm/svdsm.time /var/run/vdsm/svdsm.sock 36' (cwd None)
Thread-1145::DEBUG::2013-05-22 16:45:06,536::supervdsm::179::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm
Thread-1145::ERROR::2013-05-22 16:45:06,729::task::850::TaskManager.Task::(_setError) Task=`553ee6ca-ba54-444d-a6c5-6e0a302c7af6`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1806, in getDeviceList
    devices = self._getDeviceList(storageType)
  File "/usr/share/vdsm/storage/hsm.py", line 1838, in _getDeviceList
    for dev in multipath.pathListIter(guids):
  File "/usr/share/vdsm/storage/multipath.py", line 226, in pathListIter
    pathStatuses = devicemapper.getPathsStatus()
  File "/usr/share/vdsm/storage/devicemapper.py", line 197, in getPathsStatus
    return getProxy().getPathsStatus()
  File "/usr/share/vdsm/supervdsm.py", line 76, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 67, in <lambda>
    **kwargs)
  File "<string>", line 2, in getPathsStatus
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '255ea90'
---------------------------------------------------------------------------
Thread-1145::DEBUG::2013-05-22 16:45:06,794::task::869::TaskManager.Task::(_run) Task=`553ee6ca-ba54-444d-a6c5-6e0a302c7af6`::Task._run: 553ee6ca-ba54-444d-a6c5-6e0a302c7af6 (None,) {} failed - stopping task
Thread-1145::DEBUG::2013-05-22 16:45:06,796::task::1194::TaskManager.Task::(stop) Task=`553ee6ca-ba54-444d-a6c5-6e0a302c7af6`::stopping in state preparing (force False)
Thread-1145::DEBUG::2013-05-22 16:45:06,802::task::974::TaskManager.Task::(_decref) Task=`553ee6ca-ba54-444d-a6c5-6e0a302c7af6`::ref 1 aborting True
Thread-1145::INFO::2013-05-22 16:45:06,807::task::1151::TaskManager.Task::(prepare) Task=`553ee6ca-ba54-444d-a6c5-6e0a302c7af6`::aborting: Task is aborted: u'\n---------------------------------------------------------------------------\nTraceback (most recent call last):\n  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client\n    obj, exposed, gettypeid = id_to_obj[ident]\nKeyError: \'255ea90\'\n---------------------------------------------------------------------------' - code 100
Thread-1145::DEBUG::2013-05-22 16:45:06,809::task::1156::TaskManager.Task::(prepare) Task=`553ee6ca-ba54-444d-a6c5-6e0a302c7af6`::Prepare: aborted: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '255ea90'
---------------------------------------------------------------------------
Thread-1145::DEBUG::2013-05-22 16:45:06,811::task::974::TaskManager.Task::(_decref) Task=`553ee6ca-ba54-444d-a6c5-6e0a302c7af6`::ref 0 aborting True

Comment 2 Yaniv Bronhaim 2013-05-23 10:11:16 UTC
To see the consequences just run "ps aux | grep supervdsm" after vdsm starts and see more than one instance of supervdsm running. Each new call to supervdsm leads to new process.

This happens because currently (will change in 3.3 when moving 2 separate services for supervdsm) each call to supervdsm checks the ctime of the process (http://en.wikipedia.org/wiki/Stat_(system_call)#ctime). This happens to verify that the pid of supervdsm is still the same original pid that vdsm starts when executed supervdsm, otherwise the pid was taken by another process and we need to restart supervdsm.

Apparently, running over fake kvm the ctime value changes during the process run. When the ctime is different than the original that vdsm saved, vdsm relaunch new instance of supervdsm. 
To avoid using ctime, the attached patch uses /proc/pid/stat (http://linux.die.net/man/5/proc) value 21 ( The time the process started after system boot. )

Comment 6 vvyazmin@redhat.com 2013-08-20 11:46:34 UTC
Tested on iSCSI Data Center with 150 hosts

Verified, tested on RHEVM 3.3 - IS10 environment:

RHEVM:  rhevm-3.3.0-0.15.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.10-1.el6ev.noarch
VDSM:  vdsm-4.12.0-61.git8178ec2.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

Comment 9 Charlie 2013-11-28 00:35:17 UTC
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 11 errata-xmlrpc 2014-01-21 16:07:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html


Note You need to log in before you can comment on or make changes to this bug.