Bug 991091 - [vdsm] after a crash of libvirt, vdsm is stuck in recovery phase and unable to initialize
[vdsm] after a crash of libvirt, vdsm is stuck in recovery phase and unable t...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
x86_64 Unspecified
unspecified Severity urgent
: ---
: 3.3.0
Assigned To: Yaniv Bronhaim
Barak Dagan
infra
: Regression
: 852008 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-01 11:27 EDT by Elad
Modified: 2016-02-10 14:30 EST (History)
11 users (show)

See Also:
Fixed In Version: is12
Doc Type: Bug Fix
Doc Text:
Previously VDSM was stuck in a recovery phase and unable to initialize after a libvirt crash, and an exception on the _recoverExistingVms thread. Now, after this happens, VDSM attempts to restart until it succeeds.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-21 11:30:01 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (2.96 MB, application/x-gzip)
2013-08-01 11:27 EDT, Elad
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 17617 None None None Never

  None (edit)
Description Elad 2013-08-01 11:27:11 EDT
Created attachment 781657 [details]
logs

Description of problem:
vdsm is stuck in recovery phase in case it went down (with SIG-TERM) after its socket with libvirt closed.   

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 6.4 (Santiago)
vdsm-4.12.0-rc3.12.git139ec2f.el6ev.x86_64
libvirt-0.10.2-18.el6_4.9.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
sanlock-2.6-2.el6.x86_64


How reproducible:
100%

Steps to Reproduce:
1. on a host with running VMs
2. stop libvirt daemon with: initctl stop libvirtd
3. vdsm will receive signal 15 (SIG-TERM) (because it had running VMs) and will go down


Actual results:
libvirt crashes:

Traceback (most recent call last):
  File "/usr/share/vdsm/sampling.py", line 351, in collect
    statsFunction()
  File "/usr/share/vdsm/sampling.py", line 226, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/vm.py", line 553, in _sampleNet
    netSamples[nic.name] = self._vm._dom.interfaceStats(nic.name)
  File "/usr/share/vdsm/vm.py", line 824, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1867, in interfaceStats
    if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self)
libvirtError: internal error client socket is closed

vdsm receive SIG-TERM and performs prepareForShutdown:

MainThread::DEBUG::2013-08-01 15:42:11,532::taskManager::80::TaskManager::(prepareForShutdown) Request to stop all tasks
MainThread::INFO::2013-08-01 15:42:11,532::taskManager::86::TaskManager::(prepareForShutdown) 2bb6040c-3903-420c-8a44-186e9992169b
MainThread::DEBUG::2013-08-01 15:42:11,532::vdsm::45::vds::(sigtermHandler) Received signal 15

vdsm comes up:

MainThread::INFO::2013-08-01 15:42:12,886::vdsm::101::vds::(run) (PID: 24225) I am the actual vdsm 4.12.0-rc3.12.git139ec2f.el6ev nott-vds2.qa.lab.tlv.redhat.com (2.6.32-358.14.1.el6.x86_64)


vdsm stuck in recovery phase:

Thread-33::DEBUG::2013-08-01 15:42:23,389::task::869::TaskManager.Task::(_run) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Task._run: 40784773-e784-4d78-8b69-6b68344f27a5 () {} failed - stopping task
Thread-33::DEBUG::2013-08-01 15:42:23,389::task::1194::TaskManager.Task::(stop) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::stopping in state preparing (force False)
Thread-33::DEBUG::2013-08-01 15:42:23,389::task::974::TaskManager.Task::(_decref) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::ref 1 aborting True
Thread-33::INFO::2013-08-01 15:42:23,389::task::1151::TaskManager.Task::(prepare) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::aborting: Task is aborted: 'Not SPM' - code 654
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::1156::TaskManager.Task::(prepare) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Prepare: aborted: Not SPM
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::974::TaskManager.Task::(_decref) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::ref 0 aborting True
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::909::TaskManager.Task::(_doAbort) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Task._doAbort: force False
Thread-33::DEBUG::2013-08-01 15:42:23,390::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::579::TaskManager.Task::(_updateState) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::moving from state preparing -> state aborting
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::534::TaskManager.Task::(__state_aborting) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::_aborting: recover policy none
Thread-33::DEBUG::2013-08-01 15:42:23,390::task::579::TaskManager.Task::(_updateState) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::moving from state aborting -> state failed
Thread-33::DEBUG::2013-08-01 15:42:23,391::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-33::DEBUG::2013-08-01 15:42:23,391::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-33::ERROR::2013-08-01 15:42:23,391::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Not SPM: ()', 'code': 654}}


Thread-35::INFO::2013-08-01 15:42:24,563::logUtils::44::dispatcher::(wrapper) Run and protect: getAllTasksInfo(spUUID=None, options=None)
Thread-35::ERROR::2013-08-01 15:42:24,563::task::850::TaskManager.Task::(_setError) Task=`025c38b1-8790-4cef-84e2-464aac88ea00`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2165, in getAllTasksInfo
    allTasksInfo = sp.getAllTasksInfo()
  File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper
    raise SecureError()
SecureError

on engine:

2013-08-01 18:05:02,669 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-66) HostName = nott-vds2
2013-08-01 18:05:02,669 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-66) Command GetCapabilitiesVDS execution failed. Exception: VDSRecoveringException
: Recovering from crash or Initializing
2013-08-01 18:05:02,678 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-66) Correlation ID: null, Call Stack: org.ovirt.engine.core.vdsbroker.vdsbroker.V
DSRecoveringException: Recovering from crash or Initializing


Expected results:
'Recovering from crash or Initializing' should end successfully 

Additional info:
logs
Comment 1 Elad 2013-08-01 11:28:43 EDT
Expected results:
'Recovering from crash or Initializing' should end successfully when libvirt comes up
Comment 4 Yaniv Bronhaim 2013-08-04 05:44:49 EDT
This happens only in sysV. With systemd dependency vdsm is restarted when libvirt stops, and vdsm starts libvirt when starting up.
With sysV, only when restarting the service, the init script takes can to start libvirt service, but if libvirt is stopped and vdsm gets SIGTERM due to communication error, vdsm starts again without libvirt service and gets stuck on initialization for good.
Is it a regression? I don't see any changes in this area, it looks like it has always worked that way 

To reproduce it run vdsm, then "initctl stop libvirtd", then run setupNetwork verb (for example: vdsClient -s 0 setupNetworks nic=eth0 bootproto=dhcp STP=no bridged=true).
Vdsm will restart, and will be stuck on initialization, without any informative logs. 
IMHO, vdsm should catch the connection error and signal itself with SIGTERM, until the communication will establish. This will lead to infinite restarts until someone will start libvirt service manually. We do the same with supervdsm service, and it should be equivalent here.
Comment 5 Yaniv Bronhaim 2013-08-04 07:37:35 EDT
Small correction, all meant about upstat and not sysV over rhel. 

Additionally, The case here is less relevant as the error raises only if manually libvirtd service is stopped, which can not happened without manual intervention. 

As part of the change http://gerrit.ovirt.org/#/c/14018/5/vdsm/clientIF.py - we took the part that does libvirtconnection.get() out of the try scope , and currently we don't catch the exception that raises if the connection fails. We're left without any hint what went wrong during the initialization and vdsm is stuck until manual intervention for restart. This should be fixed. The regression raised after that commit in 3.2.
Comment 6 Yaniv Bronhaim 2013-08-15 10:54:33 EDT
*** Bug 852008 has been marked as a duplicate of this bug. ***
Comment 7 Barak 2013-08-21 10:16:48 EDT
Just making sure it is tested correctly,

There is a difference between libvirt crash and libvirt stop,

while on libvirt manual stop it is expected that the user that had stopped libvirt will start it manually - otherwise he is risking the host on moving to non-operational.

On libvirt crash there are no such expectations, however this patch does not bring libvirt up after crash nor can we distinct between the crash and stop.

It is highly unconventional to start a service from within the code of another service (we do that in vdsm init script, and this is unconventional as well).

While systemd takes care of the right dependency systemV does not, so in RHEL if libvirt process crashes it will remain down and it vdsm will remain in initialization forever.
Comment 8 Barak 2013-08-21 10:18:01 EDT
(In reply to Barak from comment #7)
> Just making sure it is tested correctly,
> 
> There is a difference between libvirt crash and libvirt stop,
> 
> while on libvirt manual stop it is expected that the user that had stopped
> libvirt will start it manually - otherwise he is risking the host on moving
> to non-operational.
> 
> On libvirt crash there are no such expectations, however this patch does not
> bring libvirt up after crash nor can we distinct between the crash and stop.
> 
> It is highly unconventional to start a service from within the code of
> another service (we do that in vdsm init script, and this is unconventional
> as well).
> 
> While systemd takes care of the right dependency systemV does not, so in
> RHEL if libvirt process crashes it will remain down and it vdsm will remain
> in initialization forever.

well not forever .... after some timeouts in the engine it will move to non-operational ... which is expected
Comment 9 Yaniv Bronhaim 2013-08-22 04:03:41 EDT
Before the attached patch vdsm stays on initialization forever. Each request was responded with "Recovering from initialization" message.
Now, if manually libvirtd started\recovered, vdsm tries each 5 seconds to re-communicate with it and returns to operational stage if succeeded
Comment 10 Barak Dagan 2013-09-08 08:43:42 EDT
Verified on is13.

When libvirt is up, vdsm connection is recovered and host is back in up (active) state
Comment 11 Charlie 2013-11-27 19:27:43 EST
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.
Comment 12 errata-xmlrpc 2014-01-21 11:30:01 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html

Note You need to log in before you can comment on or make changes to this bug.