Bug 991091

Summary: [vdsm] after a crash of libvirt, vdsm is stuck in recovery phase and unable to initialize
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED ERRATA QA Contact: Barak Dagan <bdagan>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: abaron, aberezin, acathrow, bazulay, danken, iheim, lpeer, michal.skrivanek, pstehlik, ybronhei, yeylon
Target Milestone: ---Keywords: Regression
Target Release: 3.3.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: is12 Doc Type: Bug Fix
Doc Text:
Previously VDSM was stuck in a recovery phase and unable to initialize after a libvirt crash, and an exception on the _recoverExistingVms thread. Now, after this happens, VDSM attempts to restart until it succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 16:30:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Elad 2013-08-01 15:27:11 UTC
Created attachment 781657 [details]
logs

Description of problem:
vdsm is stuck in recovery phase in case it went down (with SIG-TERM) after its socket with libvirt closed.   

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 6.4 (Santiago)
vdsm-4.12.0-rc3.12.git139ec2f.el6ev.x86_64
libvirt-0.10.2-18.el6_4.9.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
sanlock-2.6-2.el6.x86_64


How reproducible:
100%

Steps to Reproduce:
1. on a host with running VMs
2. stop libvirt daemon with: initctl stop libvirtd
3. vdsm will receive signal 15 (SIG-TERM) (because it had running VMs) and will go down


Actual results:
libvirt crashes:

Traceback (most recent call last):
  File "/usr/share/vdsm/sampling.py", line 351, in collect
    statsFunction()
  File "/usr/share/vdsm/sampling.py", line 226, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/vm.py", line 553, in _sampleNet
    netSamples[nic.name] = self._vm._dom.interfaceStats(nic.name)
  File "/usr/share/vdsm/vm.py", line 824, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1867, in interfaceStats
    if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self)
libvirtError: internal error client socket is closed

vdsm receive SIG-TERM and performs prepareForShutdown:

MainThread::DEBUG::2013-08-01 15:42:11,532::taskManager::80::TaskManager::(prepareForShutdown) Request to stop all tasks
MainThread::INFO::2013-08-01 15:42:11,532::taskManager::86::TaskManager::(prepareForShutdown) 2bb6040c-3903-420c-8a44-186e9992169b
MainThread::DEBUG::2013-08-01 15:42:11,532::vdsm::45::vds::(sigtermHandler) Received signal 15

vdsm comes up:

MainThread::INFO::2013-08-01 15:42:12,886::vdsm::101::vds::(run) (PID: 24225) I am the actual vdsm 4.12.0-rc3.12.git139ec2f.el6ev nott-vds2.qa.lab.tlv.redhat.com (2.6.32-358.14.1.el6.x86_64)


vdsm stuck in recovery phase:

Thread-33::DEBUG::2013-08-01 15:42:23,389::task::869::TaskManager.Task::(_run) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Task._run: 40784773-e784-4d78-8b69-6b68344f27a5 () {} failed - stopping task
Thread-33::DEBUG::2013-08-01 15:42:23,389::task::1194::TaskManager.Task::(stop) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::stopping in state preparing (force False)
Thread-33::DEBUG::2013-08-01 15:42:23,389::task::974::TaskManager.Task::(_decref) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::ref 1 aborting True
Thread-33::INFO::2013-08-01 15:42:23,389::task::1151::TaskManager.Task::(prepare) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::aborting: Task is aborted: 'Not SPM' - code 654
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::1156::TaskManager.Task::(prepare) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Prepare: aborted: Not SPM
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::974::TaskManager.Task::(_decref) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::ref 0 aborting True
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::909::TaskManager.Task::(_doAbort) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Task._doAbort: force False
Thread-33::DEBUG::2013-08-01 15:42:23,390::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::579::TaskManager.Task::(_updateState) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::moving from state preparing -> state aborting
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::534::TaskManager.Task::(__state_aborting) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::_aborting: recover policy none
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::579::TaskManager.Task::(_updateState) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::moving from state aborting -> state failed
Thread-33::DEBUG::2013-08-01 15:42:23,391::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-33::DEBUG::2013-08-01 15:42:23,391::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-33::ERROR::2013-08-01 15:42:23,391::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Not SPM: ()', 'code': 654}}


Thread-35::INFO::2013-08-01 15:42:24,563::logUtils::44::dispatcher::(wrapper) Run and protect: getAllTasksInfo(spUUID=None, options=None)
Thread-35::ERROR::2013-08-01 15:42:24,563::task::850::TaskManager.Task::(_setError) Task=`025c38b1-8790-4cef-84e2-464aac88ea00`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2165, in getAllTasksInfo
    allTasksInfo = sp.getAllTasksInfo()
  File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper
    raise SecureError()
SecureError

on engine:

2013-08-01 18:05:02,669 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-66) HostName = nott-vds2
2013-08-01 18:05:02,669 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-66) Command GetCapabilitiesVDS execution failed. Exception: VDSRecoveringException
: Recovering from crash or Initializing
2013-08-01 18:05:02,678 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-66) Correlation ID: null, Call Stack: org.ovirt.engine.core.vdsbroker.vdsbroker.V
DSRecoveringException: Recovering from crash or Initializing


Expected results:
'Recovering from crash or Initializing' should end successfully 

Additional info:
logs

Comment 1 Elad 2013-08-01 15:28:43 UTC
Expected results:
'Recovering from crash or Initializing' should end successfully when libvirt comes up

Comment 4 Yaniv Bronhaim 2013-08-04 09:44:49 UTC
This happens only in sysV. With systemd dependency vdsm is restarted when libvirt stops, and vdsm starts libvirt when starting up.
With sysV, only when restarting the service, the init script takes can to start libvirt service, but if libvirt is stopped and vdsm gets SIGTERM due to communication error, vdsm starts again without libvirt service and gets stuck on initialization for good.
Is it a regression? I don't see any changes in this area, it looks like it has always worked that way 

To reproduce it run vdsm, then "initctl stop libvirtd", then run setupNetwork verb (for example: vdsClient -s 0 setupNetworks nic=eth0 bootproto=dhcp STP=no bridged=true).
Vdsm will restart, and will be stuck on initialization, without any informative logs. 
IMHO, vdsm should catch the connection error and signal itself with SIGTERM, until the communication will establish. This will lead to infinite restarts until someone will start libvirt service manually. We do the same with supervdsm service, and it should be equivalent here.

Comment 5 Yaniv Bronhaim 2013-08-04 11:37:35 UTC
Small correction, all meant about upstat and not sysV over rhel. 

Additionally, The case here is less relevant as the error raises only if manually libvirtd service is stopped, which can not happened without manual intervention. 

As part of the change http://gerrit.ovirt.org/#/c/14018/5/vdsm/clientIF.py - we took the part that does libvirtconnection.get() out of the try scope , and currently we don't catch the exception that raises if the connection fails. We're left without any hint what went wrong during the initialization and vdsm is stuck until manual intervention for restart. This should be fixed. The regression raised after that commit in 3.2.

Comment 6 Yaniv Bronhaim 2013-08-15 14:54:33 UTC
*** Bug 852008 has been marked as a duplicate of this bug. ***

Comment 7 Barak 2013-08-21 14:16:48 UTC
Just making sure it is tested correctly,

There is a difference between libvirt crash and libvirt stop,

while on libvirt manual stop it is expected that the user that had stopped libvirt will start it manually - otherwise he is risking the host on moving to non-operational.

On libvirt crash there are no such expectations, however this patch does not bring libvirt up after crash nor can we distinct between the crash and stop.

It is highly unconventional to start a service from within the code of another service (we do that in vdsm init script, and this is unconventional as well).

While systemd takes care of the right dependency systemV does not, so in RHEL if libvirt process crashes it will remain down and it vdsm will remain in initialization forever.

Comment 8 Barak 2013-08-21 14:18:01 UTC
(In reply to Barak from comment #7)
> Just making sure it is tested correctly,
> 
> There is a difference between libvirt crash and libvirt stop,
> 
> while on libvirt manual stop it is expected that the user that had stopped
> libvirt will start it manually - otherwise he is risking the host on moving
> to non-operational.
> 
> On libvirt crash there are no such expectations, however this patch does not
> bring libvirt up after crash nor can we distinct between the crash and stop.
> 
> It is highly unconventional to start a service from within the code of
> another service (we do that in vdsm init script, and this is unconventional
> as well).
> 
> While systemd takes care of the right dependency systemV does not, so in
> RHEL if libvirt process crashes it will remain down and it vdsm will remain
> in initialization forever.

well not forever .... after some timeouts in the engine it will move to non-operational ... which is expected

Comment 9 Yaniv Bronhaim 2013-08-22 08:03:41 UTC
Before the attached patch vdsm stays on initialization forever. Each request was responded with "Recovering from initialization" message.
Now, if manually libvirtd started\recovered, vdsm tries each 5 seconds to re-communicate with it and returns to operational stage if succeeded

Comment 10 Barak Dagan 2013-09-08 12:43:42 UTC
Verified on is13.

When libvirt is up, vdsm connection is recovered and host is back in up (active) state

Comment 11 Charlie 2013-11-28 00:27:43 UTC
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 12 errata-xmlrpc 2014-01-21 16:30:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html