Hide Forgot
Description of problem: built from hash fecb09b vdsm is failing to start and host deploy fails Running Transaction Installing : vdsm-python-4.13.0-201.gitfecb09b.el6.x86_64 1/6 Installing : vdsm-xmlrpc-4.13.0-201.gitfecb09b.el6.noarch 2/6 Installing : vdsm-cli-4.13.0-201.gitfecb09b.el6.noarch 3/6 Installing : vdsm-python-zombiereaper-4.13.0-201.gitfecb09b.el6.noarch 4/6 Installing : vdsm-python-cpopen-4.13.0-0.10.beta1.el6ev.x86_64 5/6 Installing : vdsm-4.13.0-201.gitfecb09b.el6.x86_64 6/6 Verifying : vdsm-python-cpopen-4.13.0-0.10.beta1.el6ev.x86_64 1/6 Verifying : vdsm-cli-4.13.0-201.gitfecb09b.el6.noarch 2/6 Verifying : vdsm-python-zombiereaper-4.13.0-201.gitfecb09b.el6.noarch 3/6 Verifying : vdsm-python-4.13.0-201.gitfecb09b.el6.x86_64 4/6 Verifying : vdsm-xmlrpc-4.13.0-201.gitfecb09b.el6.noarch 5/6 Verifying : vdsm-4.13.0-201.gitfecb09b.el6.x86_64 6/6 Installed: vdsm.x86_64 0:4.13.0-201.gitfecb09b.el6 Dependency Installed: vdsm-cli.noarch 0:4.13.0-201.gitfecb09b.el6 vdsm-python.x86_64 0:4.13.0-201.gitfecb09b.el6 vdsm-python-cpopen.x86_64 0:4.13.0-0.10.beta1.el6ev vdsm-python-zombiereaper.noarch 0:4.13.0-201.gitfecb09b.el6 vdsm-xmlrpc.noarch 0:4.13.0-201.gitfecb09b.el6 Complete! [root@cinteg05 yum.repos.d]# service vdsmd start vdsm: Running mkdirs vdsm: Running configure_coredump vdsm: Running run_init_hooks vdsm: Running gencerts vdsm: Running check_is_configured libvirt is not configured for vdsm yet sanlock service is already configured Modules libvirt are not configured Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 145, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 142, in main return tool_command[cmd]["command"](*args[1:]) File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 234, in isconfigured raise RuntimeError("Not configured. Try 'vdsm-tool configure'") RuntimeError: Not configured. Try 'vdsm-tool configure' vdsm: failed to execute check_is_configured, error code 1 vdsm start [FAILED]
Created attachment 831567 [details] hostdeploy log
2013-12-02 11:09:49 DEBUG otopi.plugins.ovirt_host_deploy.vdsm.packages plugin.execute:446 execute-output: ('/usr/bin/vdsm-tool', 'configure', '--force') stderr: service-status: ServiceOperationError: _serviceStatus failed Super VDSM daemon is not running service-status: ServiceOperationError: _serviceStatus failed VDS daemon is not running Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 145, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 142, in main return tool_command[cmd]["command"](*args[1:]) File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 205, in configure service.service_stop(s) File "/usr/lib64/python2.6/site-packages/vdsm/tool/service.py", line 369, in service_stop return _runAlts(_srvStopAlts, srvName) File "/usr/lib64/python2.6/site-packages/vdsm/tool/service.py", line 350, in _runAlts "%s failed" % alt.func_name, out, err) vdsm.tool.service.ServiceOperationError: ServiceOperationError: _serviceStop failed Sending stop signal sanlock (12883): [ OK ] Waiting for sanlock (12883) to stop:[FAILED]
The issue is about failing sanlock service stop operation. The configure part fails if it can't stop the service to configure, currently it occurs on cinteg05.ci.lab.tlv.redhat.com - service sanlock fails to be stopped. Nothing in sanlock.log, sanlock.pid is filled with the right pid, the sigterm is sent but nothing happen in sanlock. sending kill -s SIGTERM manually doesn't help. Federico, can you take a look?
(In reply to Yaniv Bronhaim from comment #11) > The issue is about failing sanlock service stop operation. The configure > part fails if it can't stop the service to configure, currently it occurs on > cinteg05.ci.lab.tlv.redhat.com - service sanlock fails to be stopped. > > Nothing in sanlock.log, sanlock.pid is filled with the right pid, the > sigterm is sent but nothing happen in sanlock. sending kill -s SIGTERM > manually doesn't help. > > Federico, can you take a look? Sanlock refuses to restart when it is holding lockspaces. So there are two issues here. One is that the environment was not properly cleaned, and the second one is that if sanlock was in use it didn't need to be restarted.
Beside the environment not being properly cleaned up (and sanlock still holding lockspaces), I think there's something we could improve on the vdsm side as well. As the log reports: 2013-12-02 11:09:59 DEBUG otopi.plugins.ovirt_host_deploy.vdsm.packages plugin.execute:441 execute-output: ['/etc/init.d/vdsmd', 'reconfigure'] stdout: libvirt is not configured for vdsm yet sanlock service is already configured Modules libvirt are not configured SUCCESS: ssl configured to true. No conflicts sanlock is already configured and therefore it doesn't need to be restarted (failing). Before change: 89d70a9 Introducing configurator package in vdsm-tool sanlock was restarted only when needed, and that's why we don't see this happening in 3.3.
Solution provided is to not configure what is already configured, which worked different when we passed --force to configure verb.
Federico, how are we to make sure that sanlock no longer touches outdated storage connections? `service sanlock stop` fails for a reason. We should not just ignore it and go on.
(In reply to Dan Kenigsberg from comment #20) > Federico, how are we to make sure that sanlock no longer touches outdated > storage connections? `service sanlock stop` fails for a reason. We should > not just ignore it and go on. Yes, this was mentioned in comment 12 and again in comment 13. But that's unrelated to this bug.
Well, it's not mentioned clearly enough, imo. I cannot find a clear procedure of how to stop all current sanlock users. It's related to this bug, since we'd better not fix it, before CI clean up their hosts properly. Otherwise, this dangerous condition would be forgotten - until sanlock eats some precious data.
(In reply to Dan Kenigsberg from comment #22) > Well, it's not mentioned clearly enough, imo. I cannot find a clear > procedure of how to stop all current sanlock users. > > It's related to this bug, since we'd better not fix it, before CI clean up > their hosts properly. Otherwise, this dangerous condition would be forgotten > - until sanlock eats some precious data. Yes, we definitely have to be able to clean up an host after we're done with it even if the engine has a bug or crashed and therefore the host wasn't properly put in maintenance. The easy way to remove all the lockspaces is: sanlock client status \ | awk -vORS='\0' '$1=="s"{print $2}' \ | xargs -0 -r -n1 sanlock rem_lockspace -s (Note: the command uses null-terminated lines -0 because NFS lockspaces contain a colon ":" that is a special character for sanlock and it's escaped with "\")
i just looked at a test that is finished and removed VDSM and failed to clean sanlock, shouldn't that be fixed on vdsm side? this means, that each run actually keeps the sanlock connection active, thus failing it for the next time and needed the command from comment #23. (this is after i run manually the rem_lockspace cmd on that host) MainThread::DEBUG::2013-12-22 12:27:35,755::task::579::TaskManager.Task::(_updateState) Task=`5cc33113-770c-4899-84c7-e4c7816a2db7`::moving from state init -> state preparing MainThread::INFO::2013-12-22 12:27:35,755::logUtils::44::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None) Thread-11::DEBUG::2013-12-22 12:27:35,757::storageServer::698::ConnectionMonitor::(_monitorConnections) Monitoring stopped VM Channels Listener::INFO::2013-12-22 12:27:35,946::vmChannels::178::vds::(run) VM channels listener thread has ended. Thread-124::DEBUG::2013-12-22 12:27:36,345::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct if=/rhev/data-center/mnt/10.35.148.12:_vol_vfilers_nfs01_jenkins-automation-rpm-vm30__nfs__2013__12__22__12__23__22__62163/d1e88bcb-6c0a-4ef9-a3b3-a774a606d3ca/dom_md/metadata bs=4096 count=1' (cwd None) Thread-124::DEBUG::2013-12-22 12:27:36,356::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n409 bytes (409 B) copied, 0.000226305 s, 1.8 MB/s\n'; <rc> = 0 MainThread::INFO::2013-12-22 12:27:36,692::domainMonitor::125::Storage.DomainMonitor::(close) Stopping domain monitors MainThread::INFO::2013-12-22 12:27:36,692::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 76d52b79-c255-458e-b08e-ca8d017bae3c Thread-161::DEBUG::2013-12-22 12:27:36,693::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 76d52b79-c255-458e-b08e-ca8d017bae3c Thread-161::DEBUG::2013-12-22 12:27:36,693::clusterlock::86::SafeLease::(releaseHostId) Host id for domain 76d52b79-c255-458e-b08e-ca8d017bae3c released successfully (id: 1) MainThread::INFO::2013-12-22 12:27:36,693::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring e6058c97-3b8e-44be-89c1-de7aede9c06a Thread-48::DEBUG::2013-12-22 12:27:36,693::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for e6058c97-3b8e-44be-89c1-de7aede9c06a Thread-48::INFO::2013-12-22 12:27:36,694::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain e6058c97-3b8e-44be-89c1-de7aede9c06a (id: 1) Thread-48::DEBUG::2013-12-22 12:27:36,694::domainMonitor::184::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain e6058c97-3b8e-44be-89c1-de7aede9c06a Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 181, in _monitorLoop self.domain.releaseHostId(self.hostId, unused=True) File "/usr/share/vdsm/storage/sd.py", line 461, in releaseHostId self._clusterLock.releaseHostId(hostId, async, unused) File "/usr/share/vdsm/storage/clusterlock.py", line 204, in releaseHostId raise se.ReleaseHostIdFailure(self._sdUUID, e) ReleaseHostIdFailure: Cannot release host id: ('e6058c97-3b8e-44be-89c1-de7aede9c06a', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy')) MainThread::INFO::2013-12-22 12:27:36,695::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring d1e88bcb-6c0a-4ef9-a3b3-a774a606d3ca Thread-124::DEBUG::2013-12-22 12:27:36,696::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for d1e88bcb-6c0a-4ef9-a3b3-a774a606d3ca Thread-124::INFO::2013-12-22 12:27:36,696::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain d1e88bcb-6c0a-4ef9-a3b3-a774a606d3ca (id: 1)
When VDSM is not in maintenance it doesn't clean up VMs and sanlock lockspaces as we don't want to hurt the running VMs. The error that you see in comment 24 is coming from the VDSM service being stopped (during yum remove) while the SPM is still active. If a test is interrupted for any reason (e.g. ovirt-engine bug/crash, test canceled, etc.) and you take a shortcut reinstalling vdsm without stopping the SPM and putting the host in maintenance there's nothing that VDSM can do.
Eyal - I can add this cleanup stuff to art host cleanup plugin. however it will mask the bug. Currently it seems that we are facing the same issue with downstream RC. I can see this message during add host (maybe related to this bug): Fails because of: 2014-03-05 09:48:37,656 - MainThread - plmanagement.error_fetcher - ERROR - Errors fetched from VDC(jenkins-vm-27.scl.lab.tlv.redhat.com): 2014-03-05 09:48:27,774 ERROR [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 10.35.160.69: Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute 2014-03-05 09:48:27,782 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 110629f6, Call Stack: null, Custom Event ID: -1, Message: Failed to install Host 10.35.160.69. Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute. 2014-03-05 09:48:28,350 ERROR [org.ovirt.engine.core.utils.ssh.SSHDialog] (org.ovirt.thread.pool-4-thread-4) SSH error running command root.160.69:'umask 0077; MYTMP="$(mktemp -t ovirt-XXXXXXXXXX)"; trap "chmod -R u+rwX \"${MYTMP}\" > /dev/null 2>&1; rm -fr \"${MYTMP}\" > /dev/null 2>&1" 0; rm -fr "${MYTMP}" && mkdir "${MYTMP}" && tar --warning=no-timestamp -C "${MYTMP}" -x && "${MYTMP}"/setup DIALOG/dialect=str:machine DIALOG/customization=bool:True' Can QE owner of this bug to recheck it? Putting blocker flags
vdsm service can be stopped gracefully without errors.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html