Bug 991091
Summary: | [vdsm] after a crash of libvirt, vdsm is stuck in recovery phase and unable to initialize | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Elad <ebenahar> | ||||
Component: | vdsm | Assignee: | Yaniv Bronhaim <ybronhei> | ||||
Status: | CLOSED ERRATA | QA Contact: | Barak Dagan <bdagan> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.3.0 | CC: | abaron, aberezin, acathrow, bazulay, danken, iheim, lpeer, michal.skrivanek, pstehlik, ybronhei, yeylon | ||||
Target Milestone: | --- | Keywords: | Regression | ||||
Target Release: | 3.3.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Unspecified | ||||||
Whiteboard: | infra | ||||||
Fixed In Version: | is12 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously VDSM was stuck in a recovery phase and unable to initialize after a libvirt crash, and an exception on the _recoverExistingVms thread. Now, after this happens, VDSM attempts to restart until it succeeds.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-01-21 16:30:01 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Expected results: 'Recovering from crash or Initializing' should end successfully when libvirt comes up This happens only in sysV. With systemd dependency vdsm is restarted when libvirt stops, and vdsm starts libvirt when starting up. With sysV, only when restarting the service, the init script takes can to start libvirt service, but if libvirt is stopped and vdsm gets SIGTERM due to communication error, vdsm starts again without libvirt service and gets stuck on initialization for good. Is it a regression? I don't see any changes in this area, it looks like it has always worked that way To reproduce it run vdsm, then "initctl stop libvirtd", then run setupNetwork verb (for example: vdsClient -s 0 setupNetworks nic=eth0 bootproto=dhcp STP=no bridged=true). Vdsm will restart, and will be stuck on initialization, without any informative logs. IMHO, vdsm should catch the connection error and signal itself with SIGTERM, until the communication will establish. This will lead to infinite restarts until someone will start libvirt service manually. We do the same with supervdsm service, and it should be equivalent here. Small correction, all meant about upstat and not sysV over rhel. Additionally, The case here is less relevant as the error raises only if manually libvirtd service is stopped, which can not happened without manual intervention. As part of the change http://gerrit.ovirt.org/#/c/14018/5/vdsm/clientIF.py - we took the part that does libvirtconnection.get() out of the try scope , and currently we don't catch the exception that raises if the connection fails. We're left without any hint what went wrong during the initialization and vdsm is stuck until manual intervention for restart. This should be fixed. The regression raised after that commit in 3.2. *** Bug 852008 has been marked as a duplicate of this bug. *** Just making sure it is tested correctly, There is a difference between libvirt crash and libvirt stop, while on libvirt manual stop it is expected that the user that had stopped libvirt will start it manually - otherwise he is risking the host on moving to non-operational. On libvirt crash there are no such expectations, however this patch does not bring libvirt up after crash nor can we distinct between the crash and stop. It is highly unconventional to start a service from within the code of another service (we do that in vdsm init script, and this is unconventional as well). While systemd takes care of the right dependency systemV does not, so in RHEL if libvirt process crashes it will remain down and it vdsm will remain in initialization forever. (In reply to Barak from comment #7) > Just making sure it is tested correctly, > > There is a difference between libvirt crash and libvirt stop, > > while on libvirt manual stop it is expected that the user that had stopped > libvirt will start it manually - otherwise he is risking the host on moving > to non-operational. > > On libvirt crash there are no such expectations, however this patch does not > bring libvirt up after crash nor can we distinct between the crash and stop. > > It is highly unconventional to start a service from within the code of > another service (we do that in vdsm init script, and this is unconventional > as well). > > While systemd takes care of the right dependency systemV does not, so in > RHEL if libvirt process crashes it will remain down and it vdsm will remain > in initialization forever. well not forever .... after some timeouts in the engine it will move to non-operational ... which is expected Before the attached patch vdsm stays on initialization forever. Each request was responded with "Recovering from initialization" message. Now, if manually libvirtd started\recovered, vdsm tries each 5 seconds to re-communicate with it and returns to operational stage if succeeded Verified on is13. When libvirt is up, vdsm connection is recovered and host is back in up (active) state This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html |
Created attachment 781657 [details] logs Description of problem: vdsm is stuck in recovery phase in case it went down (with SIG-TERM) after its socket with libvirt closed. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 6.4 (Santiago) vdsm-4.12.0-rc3.12.git139ec2f.el6ev.x86_64 libvirt-0.10.2-18.el6_4.9.x86_64 qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64 sanlock-2.6-2.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. on a host with running VMs 2. stop libvirt daemon with: initctl stop libvirtd 3. vdsm will receive signal 15 (SIG-TERM) (because it had running VMs) and will go down Actual results: libvirt crashes: Traceback (most recent call last): File "/usr/share/vdsm/sampling.py", line 351, in collect statsFunction() File "/usr/share/vdsm/sampling.py", line 226, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/vm.py", line 553, in _sampleNet netSamples[nic.name] = self._vm._dom.interfaceStats(nic.name) File "/usr/share/vdsm/vm.py", line 824, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1867, in interfaceStats if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self) libvirtError: internal error client socket is closed vdsm receive SIG-TERM and performs prepareForShutdown: MainThread::DEBUG::2013-08-01 15:42:11,532::taskManager::80::TaskManager::(prepareForShutdown) Request to stop all tasks MainThread::INFO::2013-08-01 15:42:11,532::taskManager::86::TaskManager::(prepareForShutdown) 2bb6040c-3903-420c-8a44-186e9992169b MainThread::DEBUG::2013-08-01 15:42:11,532::vdsm::45::vds::(sigtermHandler) Received signal 15 vdsm comes up: MainThread::INFO::2013-08-01 15:42:12,886::vdsm::101::vds::(run) (PID: 24225) I am the actual vdsm 4.12.0-rc3.12.git139ec2f.el6ev nott-vds2.qa.lab.tlv.redhat.com (2.6.32-358.14.1.el6.x86_64) vdsm stuck in recovery phase: Thread-33::DEBUG::2013-08-01 15:42:23,389::task::869::TaskManager.Task::(_run) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Task._run: 40784773-e784-4d78-8b69-6b68344f27a5 () {} failed - stopping task Thread-33::DEBUG::2013-08-01 15:42:23,389::task::1194::TaskManager.Task::(stop) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::stopping in state preparing (force False) Thread-33::DEBUG::2013-08-01 15:42:23,389::task::974::TaskManager.Task::(_decref) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::ref 1 aborting True Thread-33::INFO::2013-08-01 15:42:23,389::task::1151::TaskManager.Task::(prepare) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::aborting: Task is aborted: 'Not SPM' - code 654 Thread-33::DEBUG::2013-08-01 15:42:23,390::task::1156::TaskManager.Task::(prepare) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Prepare: aborted: Not SPM Thread-33::DEBUG::2013-08-01 15:42:23,390::task::974::TaskManager.Task::(_decref) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::ref 0 aborting True Thread-33::DEBUG::2013-08-01 15:42:23,390::task::909::TaskManager.Task::(_doAbort) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::Task._doAbort: force False Thread-33::DEBUG::2013-08-01 15:42:23,390::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-33::DEBUG::2013-08-01 15:42:23,390::task::579::TaskManager.Task::(_updateState) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::moving from state preparing -> state aborting Thread-33::DEBUG::2013-08-01 15:42:23,390::task::534::TaskManager.Task::(__state_aborting) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::_aborting: recover policy none Thread-33::DEBUG::2013-08-01 15:42:23,390::task::579::TaskManager.Task::(_updateState) Task=`40784773-e784-4d78-8b69-6b68344f27a5`::moving from state aborting -> state failed Thread-33::DEBUG::2013-08-01 15:42:23,391::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-33::DEBUG::2013-08-01 15:42:23,391::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-33::ERROR::2013-08-01 15:42:23,391::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Not SPM: ()', 'code': 654}} Thread-35::INFO::2013-08-01 15:42:24,563::logUtils::44::dispatcher::(wrapper) Run and protect: getAllTasksInfo(spUUID=None, options=None) Thread-35::ERROR::2013-08-01 15:42:24,563::task::850::TaskManager.Task::(_setError) Task=`025c38b1-8790-4cef-84e2-464aac88ea00`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2165, in getAllTasksInfo allTasksInfo = sp.getAllTasksInfo() File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper raise SecureError() SecureError on engine: 2013-08-01 18:05:02,669 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-66) HostName = nott-vds2 2013-08-01 18:05:02,669 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-66) Command GetCapabilitiesVDS execution failed. Exception: VDSRecoveringException : Recovering from crash or Initializing 2013-08-01 18:05:02,678 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-66) Correlation ID: null, Call Stack: org.ovirt.engine.core.vdsbroker.vdsbroker.V DSRecoveringException: Recovering from crash or Initializing Expected results: 'Recovering from crash or Initializing' should end successfully Additional info: logs