Bug 732914

Summary: vdsm][libvirtconnection] vdsm does not recover when it fails to connect to libvirtd upon startup.
Product: Red Hat Enterprise Linux 6 Reporter: David Naori <dnaori>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: David Naori <dnaori>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: abaron, bazulay, dnaori, fsimonce, hateya, iheim, ilvovsky, mgoldboi, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.9-97.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 07:25:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm log none

Description David Naori 2011-08-24 07:03:56 UTC
Created attachment 519562 [details]
vdsm log

Description of problem:
When vdsm fails to connect to libvirtd upon startup (libvirtd is not running in that exact time), it stays forever in "recoverying from crash or initializing" and does not take itself down.

clientIFinit::ERROR::2011-08-24 02:11:20,634::clientIF::938::vds::(_recoverExistingVms) Vm's recovery failed
Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 903, in _recoverExistingVms
    vdsmVms = self.getVDSMVms()
  File "/usr/share/vdsm/clientIF.py", line 964, in getVDSMVms
    conn = libvirtconnection.get(self)
  File "/usr/share/vdsm/libvirtconnection.py", line 106, in get
    conn = libvirt.openAuth('qemu:///system', auth, 0)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 102, in openAuth
    if ret is None:raise libvirtError('virConnectOpenAuth() failed')
libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory


Version-Release number of selected component (if applicable):
vdsm-4.9-95.el6.x86_64
libvirt-0.9.4-4.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
(On a host with running vms)
1.`/etc/init.d/vdsmd restart && initctl stop libvirtd`
  

Actual results:


Expected results:


Additional info:

Comment 2 Dan Kenigsberg 2011-08-26 22:17:32 UTC
David, would the author of http://gerrit.usersys.redhat.com/699 be nice to add "libvirtError: Failed to connect socket" to the errors expected to kill vdsm?

Yeylon, why did you add blocker flag? Is it urgent to touch for 3.0?

Comment 3 David Naori 2011-08-27 20:49:21 UTC
(In reply to comment #2)
> David, would the author of http://gerrit.usersys.redhat.com/699 be nice to add
> "libvirtError: Failed to connect socket" to the errors expected to kill vdsm?

I'm afraid this is not the case here, it is not in the try/except block:

 93     if not conn:
 94         conn = libvirt.openAuth('qemu:///system', auth, 0)

* tried to put it in a try/except block and call prepareForShutdown if it fails -  it's not doing the job in this case.

Comment 4 Federico Simoncelli 2011-08-29 11:05:42 UTC
David, are you sure?
This works for me:

commit 067f769de4df00cf4015e82acd16c1319938a14f
Author: Federico Simoncelli <fsimonce>
Date:   Mon Aug 29 11:01:16 2011 +0000

    BZ#732914 VDSM must exit if libvirt is not running
    
    Change-Id: I673184b8e5d765a9397f3fc14a70f7c31b907b3e

http://gerrit.usersys.redhat.com/861

Comment 5 Federico Simoncelli 2011-08-31 17:07:12 UTC
The problem here is quite tricky. When we issue the prepareForShutdown at startup if vdsm was previously connected to a pool then we don't have stoppable threads running yet but storageRefresh is trying to reconnect to the pool:

 threading.Thread(target=storageRefresh).start()
 (hsm.py:192)

and eventually when it succeeds it will start new non-demoniac threads which then won't be stopped.
I tried already several solutions but I didn't find the definitive working one.

Comment 6 Federico Simoncelli 2011-09-05 16:44:02 UTC
commit dc34ed11fe964fe2cdcc89e4df7f7f96cb639332
Author: Federico Simoncelli <fsimonce>
Date:   Mon Sep 5 16:25:53 2011 +0000

    BZ#732914 Check libvirt connection on startup
    
    Change-Id: I913acefd3d41bc34e831783687f287d92c7aa282

http://gerrit.usersys.redhat.com/896

Comment 8 David Naori 2011-09-13 09:11:23 UTC
clearing needinfo flag since it's fixed already.

Comment 9 David Naori 2011-09-19 15:56:02 UTC
MainThread::INFO::2011-09-19 21:56:01,703::vdsm::71::vds::(run) I am the actual vdsm 4.9-100
MainThread::ERROR::2011-09-19 21:56:01,896::vdsm::74::vds::(run) Traceback (most recent call last):
  File "/usr/share/vdsm//vdsm", line 72, in run
    serve_clients(log)
  File "/usr/share/vdsm//vdsm", line 40, in serve_clients
    cif = clientIF.clientIF(log)
  File "/usr/share/vdsm/clientIF.py", line 92, in __init__
    self._libvirt = libvirtconnection.get()
  File "/usr/share/vdsm/libvirtconnection.py", line 94, in get
    conn = libvirt.openAuth('qemu:///system', auth, 0)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 102, in openAuth
    if ret is None:raise libvirtError('virConnectOpenAuth() failed')
libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory

MainThread::INFO::2011-09-19 21:56:01,896::vdsm::76::vds::(run) VDSM main thread ended. Waiting for 1 other threads...
MainThread::INFO::2011-09-19 21:56:01,896::vdsm::79::vds::(run) <_MainThread(MainThread, started 140116261517056)>
MainThread::INFO::2011-09-19 21:56:01,896::vdsm::79::vds::(run) <Thread(libvirtEventLoop, started daemon 140116177499904)>
MainThread::INFO::2011-09-19 21:56:01,965::vdsm::71::vds::(run) I am the actual vdsm 4.9-100

Verified.

Comment 10 errata-xmlrpc 2011-12-06 07:25:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1782.html