Bug 1035847 - vdsm-tool configure fails stopping service [NEEDINFO]
Summary: vdsm-tool configure fails stopping service
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.4.0
Assignee: Yaniv Bronhaim
QA Contact: Leonid Natapov
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-28 15:53 UTC by Ohad Basan
Modified: 2016-02-10 19:43 UTC (History)
19 users (show)

Fixed In Version: ovirt-3.4.0-beta2
Doc Type: Bug Fix
Doc Text:
Previously, the vdsm-tool configuration process failed, which stopped the vdsd service. VDSM now shuts down gracefully, and vdsm-tool configuration continues as expected.
Clone Of:
Environment:
Last Closed: 2014-06-09 13:26:45 UTC
oVirt Team: Infra
Target Upstream Version:
eedri: needinfo?


Attachments (Terms of Use)
hostdeploy log (283.34 KB, text/plain)
2013-12-02 13:03 UTC, Ohad Basan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0504 0 normal SHIPPED_LIVE vdsm 3.4.0 bug fix and enhancement update 2014-06-09 17:21:35 UTC
oVirt gerrit 22492 0 None None None Never

Description Ohad Basan 2013-11-28 15:53:59 UTC
Description of problem:
built from hash fecb09b vdsm is failing to start and host deploy fails

Running Transaction
  Installing : vdsm-python-4.13.0-201.gitfecb09b.el6.x86_64                                                                                                                                                    1/6 
  Installing : vdsm-xmlrpc-4.13.0-201.gitfecb09b.el6.noarch                                                                                                                                                    2/6 
  Installing : vdsm-cli-4.13.0-201.gitfecb09b.el6.noarch                                                                                                                                                       3/6 
  Installing : vdsm-python-zombiereaper-4.13.0-201.gitfecb09b.el6.noarch                                                                                                                                       4/6 
  Installing : vdsm-python-cpopen-4.13.0-0.10.beta1.el6ev.x86_64                                                                                                                                               5/6 
  Installing : vdsm-4.13.0-201.gitfecb09b.el6.x86_64                                                                                                                                                           6/6 
  Verifying  : vdsm-python-cpopen-4.13.0-0.10.beta1.el6ev.x86_64                                                                                                                                               1/6 
  Verifying  : vdsm-cli-4.13.0-201.gitfecb09b.el6.noarch                                                                                                                                                       2/6 
  Verifying  : vdsm-python-zombiereaper-4.13.0-201.gitfecb09b.el6.noarch                                                                                                                                       3/6 
  Verifying  : vdsm-python-4.13.0-201.gitfecb09b.el6.x86_64                                                                                                                                                    4/6 
  Verifying  : vdsm-xmlrpc-4.13.0-201.gitfecb09b.el6.noarch                                                                                                                                                    5/6 
  Verifying  : vdsm-4.13.0-201.gitfecb09b.el6.x86_64                                                                                                                                                           6/6 

Installed:
  vdsm.x86_64 0:4.13.0-201.gitfecb09b.el6                                                                                                                                                                          

Dependency Installed:
  vdsm-cli.noarch 0:4.13.0-201.gitfecb09b.el6     vdsm-python.x86_64 0:4.13.0-201.gitfecb09b.el6  vdsm-python-cpopen.x86_64 0:4.13.0-0.10.beta1.el6ev  vdsm-python-zombiereaper.noarch 0:4.13.0-201.gitfecb09b.el6 
  vdsm-xmlrpc.noarch 0:4.13.0-201.gitfecb09b.el6 

Complete!
[root@cinteg05 yum.repos.d]# service vdsmd start
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running run_init_hooks
vdsm: Running gencerts
vdsm: Running check_is_configured
libvirt is not configured for vdsm yet
sanlock service is already configured
Modules libvirt are not configured
 Traceback (most recent call last):
  File "/usr/bin/vdsm-tool", line 145, in <module>
    sys.exit(main())
  File "/usr/bin/vdsm-tool", line 142, in main
    return tool_command[cmd]["command"](*args[1:])
  File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 234, in isconfigured
    raise RuntimeError("Not configured. Try 'vdsm-tool configure'")
RuntimeError: Not configured. Try 'vdsm-tool configure'
vdsm: failed to execute check_is_configured, error code 1
vdsm start                                                 [FAILED]

Comment 9 Ohad Basan 2013-12-02 13:03:41 UTC
Created attachment 831567 [details]
hostdeploy log

Comment 10 Alon Bar-Lev 2013-12-02 13:08:21 UTC
2013-12-02 11:09:49 DEBUG otopi.plugins.ovirt_host_deploy.vdsm.packages plugin.execute:446 execute-output: ('/usr/bin/vdsm-tool', 'configure', '--force') stderr:
service-status: ServiceOperationError: _serviceStatus failed
Super VDSM daemon is not running

service-status: ServiceOperationError: _serviceStatus failed
VDS daemon is not running

Traceback (most recent call last):
  File "/usr/bin/vdsm-tool", line 145, in <module>
    sys.exit(main())
  File "/usr/bin/vdsm-tool", line 142, in main
    return tool_command[cmd]["command"](*args[1:])
  File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 205, in configure
    service.service_stop(s)
  File "/usr/lib64/python2.6/site-packages/vdsm/tool/service.py", line 369, in service_stop
    return _runAlts(_srvStopAlts, srvName)
  File "/usr/lib64/python2.6/site-packages/vdsm/tool/service.py", line 350, in _runAlts
    "%s failed" % alt.func_name, out, err)
vdsm.tool.service.ServiceOperationError: ServiceOperationError: _serviceStop failed
Sending stop signal sanlock (12883): [  OK  ]
Waiting for sanlock (12883) to stop:[FAILED]

Comment 11 Yaniv Bronhaim 2013-12-08 13:23:46 UTC
The issue is about failing sanlock service stop operation. The configure part fails if it can't stop the service to configure, currently it occurs on cinteg05.ci.lab.tlv.redhat.com - service sanlock fails to be stopped. 

Nothing in sanlock.log, sanlock.pid is filled with the right pid, the sigterm is sent but nothing happen in sanlock. sending kill -s SIGTERM manually doesn't help.

Federico, can you take a look?

Comment 12 Federico Simoncelli 2013-12-09 17:19:04 UTC
(In reply to Yaniv Bronhaim from comment #11)
> The issue is about failing sanlock service stop operation. The configure
> part fails if it can't stop the service to configure, currently it occurs on
> cinteg05.ci.lab.tlv.redhat.com - service sanlock fails to be stopped. 
> 
> Nothing in sanlock.log, sanlock.pid is filled with the right pid, the
> sigterm is sent but nothing happen in sanlock. sending kill -s SIGTERM
> manually doesn't help.
> 
> Federico, can you take a look?

Sanlock refuses to restart when it is holding lockspaces. So there are two issues here. One is that the environment was not properly cleaned, and the second one is that if sanlock was in use it didn't need to be restarted.

Comment 14 Federico Simoncelli 2013-12-10 08:50:31 UTC
Beside the environment not being properly cleaned up (and sanlock still holding lockspaces), I think there's something we could improve on the vdsm side as well. As the log reports:

2013-12-02 11:09:59 DEBUG otopi.plugins.ovirt_host_deploy.vdsm.packages plugin.execute:441 execute-output: ['/etc/init.d/vdsmd', 'reconfigure'] stdout:
libvirt is not configured for vdsm yet
sanlock service is already configured
Modules libvirt are not configured
 SUCCESS: ssl configured to true. No conflicts

sanlock is already configured and therefore it doesn't need to be restarted (failing).

Before change:

 89d70a9 Introducing configurator package in vdsm-tool

sanlock was restarted only when needed, and that's why we don't see this happening in 3.3.

Comment 16 Yaniv Bronhaim 2013-12-12 17:18:55 UTC
Solution provided is to not configure what is already configured, which worked different when we passed --force to configure verb.

Comment 20 Dan Kenigsberg 2013-12-18 16:19:46 UTC
Federico, how are we to make sure that sanlock no longer touches outdated storage connections? `service sanlock stop` fails for a reason. We should not just ignore it and go on.

Comment 21 Federico Simoncelli 2013-12-18 20:02:55 UTC
(In reply to Dan Kenigsberg from comment #20)
> Federico, how are we to make sure that sanlock no longer touches outdated
> storage connections? `service sanlock stop` fails for a reason. We should
> not just ignore it and go on.

Yes, this was mentioned in comment 12 and again in comment 13. But that's unrelated to this bug.

Comment 22 Dan Kenigsberg 2013-12-18 23:47:17 UTC
Well, it's not mentioned clearly enough, imo. I cannot find a clear procedure of how to stop all current sanlock users.

It's related to this bug, since we'd better not fix it, before CI clean up their hosts properly. Otherwise, this dangerous condition would be forgotten - until sanlock eats some precious data.

Comment 23 Federico Simoncelli 2013-12-19 11:04:12 UTC
(In reply to Dan Kenigsberg from comment #22)
> Well, it's not mentioned clearly enough, imo. I cannot find a clear
> procedure of how to stop all current sanlock users.
> 
> It's related to this bug, since we'd better not fix it, before CI clean up
> their hosts properly. Otherwise, this dangerous condition would be forgotten
> - until sanlock eats some precious data.

Yes, we definitely have to be able to clean up an host after we're done with it even if the engine has a bug or crashed and therefore the host wasn't properly put in maintenance.

The easy way to remove all the lockspaces is:

 sanlock client status \
     | awk -vORS='\0' '$1=="s"{print $2}' \
     | xargs -0 -r -n1 sanlock rem_lockspace -s

(Note: the command uses null-terminated lines -0 because NFS lockspaces contain a colon ":" that is a special character for sanlock and it's escaped with "\")

Comment 24 Eyal Edri 2013-12-22 10:35:21 UTC
i just looked at a test that is finished and removed VDSM and failed to clean sanlock, shouldn't that be fixed on vdsm side?

this means, that each run actually keeps the sanlock connection active, thus failing it for the next time and needed the command from comment #23.

(this is after i run manually the rem_lockspace cmd on that host)


MainThread::DEBUG::2013-12-22 12:27:35,755::task::579::TaskManager.Task::(_updateState) Task=`5cc33113-770c-4899-84c7-e4c7816a2db7`::moving from state init -> state preparing
MainThread::INFO::2013-12-22 12:27:35,755::logUtils::44::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
Thread-11::DEBUG::2013-12-22 12:27:35,757::storageServer::698::ConnectionMonitor::(_monitorConnections) Monitoring stopped
VM Channels Listener::INFO::2013-12-22 12:27:35,946::vmChannels::178::vds::(run) VM channels listener thread has ended.
Thread-124::DEBUG::2013-12-22 12:27:36,345::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct if=/rhev/data-center/mnt/10.35.148.12:_vol_vfilers_nfs01_jenkins-automation-rpm-vm30__nfs__2013__12__22__12__23__22__62163/d1e88bcb-6c0a-4ef9-a3b3-a774a606d3ca/dom_md/metadata bs=4096 count=1' (cwd None)
Thread-124::DEBUG::2013-12-22 12:27:36,356::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n409 bytes (409 B) copied, 0.000226305 s, 1.8 MB/s\n'; <rc> = 0
MainThread::INFO::2013-12-22 12:27:36,692::domainMonitor::125::Storage.DomainMonitor::(close) Stopping domain monitors
MainThread::INFO::2013-12-22 12:27:36,692::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 76d52b79-c255-458e-b08e-ca8d017bae3c
Thread-161::DEBUG::2013-12-22 12:27:36,693::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 76d52b79-c255-458e-b08e-ca8d017bae3c
Thread-161::DEBUG::2013-12-22 12:27:36,693::clusterlock::86::SafeLease::(releaseHostId) Host id for domain 76d52b79-c255-458e-b08e-ca8d017bae3c released successfully (id: 1)
MainThread::INFO::2013-12-22 12:27:36,693::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring e6058c97-3b8e-44be-89c1-de7aede9c06a
Thread-48::DEBUG::2013-12-22 12:27:36,693::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for e6058c97-3b8e-44be-89c1-de7aede9c06a
Thread-48::INFO::2013-12-22 12:27:36,694::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain e6058c97-3b8e-44be-89c1-de7aede9c06a (id: 1)
Thread-48::DEBUG::2013-12-22 12:27:36,694::domainMonitor::184::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain e6058c97-3b8e-44be-89c1-de7aede9c06a
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 181, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 461, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 204, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('e6058c97-3b8e-44be-89c1-de7aede9c06a', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))
MainThread::INFO::2013-12-22 12:27:36,695::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring d1e88bcb-6c0a-4ef9-a3b3-a774a606d3ca
Thread-124::DEBUG::2013-12-22 12:27:36,696::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for d1e88bcb-6c0a-4ef9-a3b3-a774a606d3ca
Thread-124::INFO::2013-12-22 12:27:36,696::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain d1e88bcb-6c0a-4ef9-a3b3-a774a606d3ca (id: 1)

Comment 25 Federico Simoncelli 2013-12-22 12:20:42 UTC
When VDSM is not in maintenance it doesn't clean up VMs and sanlock lockspaces as we don't want to hurt the running VMs.

The error that you see in comment 24 is coming from the VDSM service being stopped (during yum remove) while the SPM is still active.

If a test is interrupted for any reason (e.g. ovirt-engine bug/crash, test canceled, etc.) and you take a shortcut reinstalling vdsm without stopping the SPM and putting the host in maintenance there's nothing that VDSM can do.

Comment 27 Ilia Meerovich 2014-03-05 10:13:31 UTC
Eyal - I can add this cleanup stuff to art host cleanup plugin. however it will mask the bug.
Currently it seems that we are facing the same issue with downstream RC.

I can see this message during add host (maybe related to this bug):

Fails because of:
2014-03-05 09:48:37,656 - MainThread - plmanagement.error_fetcher - ERROR - Errors fetched from VDC(jenkins-vm-27.scl.lab.tlv.redhat.com): 2014-03-05 09:48:27,774 ERROR [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 10.35.160.69: Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute
2014-03-05 09:48:27,782 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 110629f6, Call Stack: null, Custom Event ID: -1, Message: Failed to install Host 10.35.160.69. Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute.
2014-03-05 09:48:28,350 ERROR [org.ovirt.engine.core.utils.ssh.SSHDialog] (org.ovirt.thread.pool-4-thread-4) SSH error running command root.160.69:'umask 0077; MYTMP="$(mktemp -t ovirt-XXXXXXXXXX)"; trap "chmod -R u+rwX \"${MYTMP}\" > /dev/null 2>&1; rm -fr \"${MYTMP}\" > /dev/null 2>&1" 0; rm -fr "${MYTMP}" && mkdir "${MYTMP}" && tar --warning=no-timestamp -C "${MYTMP}" -x &&  "${MYTMP}"/setup DIALOG/dialect=str:machine DIALOG/customization=bool:True'

Can QE owner of this bug to recheck it?
Putting blocker flags

Comment 29 Leonid Natapov 2014-03-13 09:56:43 UTC
vdsm service can be stopped gracefully without errors.

Comment 30 errata-xmlrpc 2014-06-09 13:26:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html


Note You need to log in before you can comment on or make changes to this bug.