Bug 1057225 - sanlock not configured with no error which fails vdsm service start and fails host installation
Summary: sanlock not configured with no error which fails vdsm service start and fails...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.4
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.4.0
Assignee: Yaniv Bronhaim
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
: 1059254 1063886 (view as bug list)
Depends On:
Blocks: 1024889
TreeView+ depends on / blocked
 
Reported: 2014-01-23 16:34 UTC by Dafna Ron
Modified: 2020-09-14 07:55 UTC (History)
16 users (show)

Fixed In Version: ovirt-3.4.0-beta3
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-03-31 12:28:13 UTC
oVirt Team: ---


Attachments (Terms of Use)
logs (101.32 KB, application/x-gzip)
2014-01-23 16:34 UTC, Dafna Ron
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 24007 0 None None None Never
oVirt gerrit 24153 0 None None None Never
oVirt gerrit 25162 0 None MERGED Use same constants for configuring and check configure of sanlock Never

Description Dafna Ron 2014-01-23 16:34:25 UTC
Created attachment 854474 [details]
logs

Description of problem:

host failed install with following error: 

2014-01-23 17:55:34 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/tmp/ovirt-NTFpH4YDW0/pythonlib/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/tmp/ovirt-NTFpH4YDW0/otopi-plugins/ovirt-host-deploy/vdsm/packages.py", line 219, in _start
    self.services.state('vdsmd', True)
  File "/tmp/ovirt-NTFpH4YDW0/otopi-plugins/otopi/services/rhel.py", line 188, in state
    'start' if state else 'stop'
  File "/tmp/ovirt-NTFpH4YDW0/otopi-plugins/otopi/services/rhel.py", line 96, in _executeServiceCommand
    raiseOnError=raiseOnError
  File "/tmp/ovirt-NTFpH4YDW0/pythonlib/otopi/plugin.py", line 451, in execute
    command=args[0],
RuntimeError: Command '/sbin/service' failed to execute
2014-01-23 17:55:34 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute

Alon check it and determined the following: 
host-deploy executes: /usr/bin/vdsm-tool --force -> this succeed, 
Done configuring modules to VDSM with no error

but Modules sanlock are not configured -> vdsm cannot be started. 

Version-Release number of selected component (if applicable):

[root@cougar06 ~]# rpm -qa |grep vdsm 
vdsm-python-4.14.1-2.el6.x86_64
vdsm-cli-4.14.1-2.el6.noarch
vdsm-python-zombiereaper-4.14.1-2.el6.noarch
vdsm-xmlrpc-4.14.1-2.el6.noarch
vdsm-4.14.1-2.el6.x86_64

[root@vm-18-201 host-deploy]# rpm -qa |grep otopi
otopi-java-1.2.0-0.1.beta.el6.noarch
otopi-1.2.0-0.1.beta.el6.noarch

How reproducible:


Steps to Reproduce:
1. install vdsm on the host (yum install vdsm) 
2. install host from UI
3.

Actual results:

host fails to be installed

Expected results:

if we fail to configure sanlock we should fail installation. 

Additional info:

opening on medium since I rebooted the host and re-installed and it solved the issue

Comment 1 Itamar Heim 2014-01-26 08:12:05 UTC
Setting target release to current version for consideration and review. please
do not push non-RFE bugs to an undefined target release to make sure bugs are
reviewed for relevancy, fix, closure, etc.

Comment 2 Alon Bar-Lev 2014-01-29 13:54:37 UTC
*** Bug 1059254 has been marked as a duplicate of this bug. ***

Comment 3 Douglas Schilling Landgraf 2014-01-29 14:41:08 UTC
We have found this problem here: https://bugzilla.redhat.com/show_bug.cgi?id=1055153 too

Comment 4 Federico Simoncelli 2014-02-03 15:03:18 UTC
Looking at the logs:

2014-01-23 17:55:34 DEBUG otopi.plugins.otopi.services.rhel plugin.execute:446 execute-output: ('/sbin/service', 'vdsmd', 'start') stderr:
initctl: Job is already running: libvirtd
Traceback (most recent call last):
  File "/usr/bin/vdsm-tool", line 145, in <module>
    sys.exit(main())
  File "/usr/bin/vdsm-tool", line 142, in main
    return tool_command[cmd]["command"](*args[1:])
  File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 261, in isconfigured
    raise RuntimeError(msg)
RuntimeError:

I see that initctl is called (I don't think it should), and when the output is "initctl: Job is already running: libvirtd" the exit code is 1:

# initctl start libvirtd
initctl: Job is already running: libvirtd
# echo $?
1

Are we still using initctl also on fedora?

Anyway is it possible that it's the initctl failure that triggers this issue? If so sanlock is not involved at all.

Comment 5 Douglas Schilling Landgraf 2014-02-03 15:13:29 UTC
(In reply to Federico Simoncelli from comment #4)
> Looking at the logs:
> 
> 2014-01-23 17:55:34 DEBUG otopi.plugins.otopi.services.rhel
> plugin.execute:446 execute-output: ('/sbin/service', 'vdsmd', 'start')
> stderr:
> initctl: Job is already running: libvirtd
> Traceback (most recent call last):
>   File "/usr/bin/vdsm-tool", line 145, in <module>
>     sys.exit(main())
>   File "/usr/bin/vdsm-tool", line 142, in main
>     return tool_command[cmd]["command"](*args[1:])
>   File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line
> 261, in isconfigured
>     raise RuntimeError(msg)
> RuntimeError:
> 
> I see that initctl is called (I don't think it should), and when the output
> is "initctl: Job is already running: libvirtd" the exit code is 1:
> 
> # initctl start libvirtd
> initctl: Job is already running: libvirtd
> # echo $?
> 1
> 
> Are we still using initctl also on fedora?

No, Fedora 19 doesn't contain it. However you got the log probably from the CentOS user. 

> 
> Anyway is it possible that it's the initctl failure that triggers this
> issue? If so sanlock is not involved at all.

I don't think so, we are facing this on CentOS and Fedora.
Please see this as well: https://bugzilla.redhat.com/show_bug.cgi?id=1055153

Comment 6 Yaniv Bronhaim 2014-02-03 15:29:39 UTC
I suspect that the problem is that when sanlock is down we assume configured, and you see in the log [1] the print "sanlock service is not running" and then "sanlock service is already configured". those prints relate to "vdsm-tool configure --force" or "vdsmd reconfigure" output that check isconfigured on each module and configure if not. 

then vdsmd starts sanlock service as required service and isconfigured fails on the group check (part of the pre-start tasks before vdsm starts). which is odd because vdsmd installation should add all the groups info (iiuc).

so the problem is somewhere in the spec, where vdsm configures the groups for sanlock process.
Federico , am i missing something?

[1] ovirt-20140123105534-cougar06.scl.lab.tlv.redhat.com-74d5bd50.log.xz

Comment 7 Yaniv Bronhaim 2014-02-03 16:19:02 UTC
ok, now its all clear. in patch [1] i moved the "/usr/sbin/usermod -a -G %{qemu_group},%{vdsm_group} %{snlk_user}" from spec %pre section to the configurator module. BUT, when we check isconfigured as I stated in previous comment we assume configured when sanlock is not running. then we end up with fail start of vdsm because sanlock configure did not run at first 'vdsm-tool configure' run.

this bug relates only to vdsm 3.4 which includes this commit and exposes a regression. it happens 100% if sanlock is stopped when host deploy runs and install vdsm on host.

its either [2] or moving the usermod action to the spec back as it was.

which alternative is preferred? 

[1] http://gerrit.ovirt.org/20100
[2] http://gerrit.ovirt.org/24007

Comment 8 Yaniv Bronhaim 2014-02-12 08:53:41 UTC
*** Bug 1063886 has been marked as a duplicate of this bug. ***

Comment 9 sefi litmanovich 2014-03-05 10:03:23 UTC
Reproduced with ovirt3.4.0-0.5beta1, otopi-1.2.0-0.1.beta.el6.noarch , vdsm-4.14.1-2.el6.x86_64

Verified with ovirt-engine-3.4.0-0.11.beta3.el6.noarch, otopi-1.2.0-0.3.beta2.el6.noarch.rpm, vdsm-4.14.3-0.el6.x86_64

Comment 10 Sandro Bonazzola 2014-03-31 12:28:13 UTC
this is an automated message: moving to Closed CURRENT RELEASE since oVirt 3.4.0 has been released

Comment 11 bakgatnet 2014-07-27 02:19:29 UTC
Issue persists.
Installed as per (rather poorly) documented methods.
* http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora_.2F_Red_Hat_Enterprise_Linux_.2F_CentOS.29
* http://www.ovirt.org/Feature/AllInOne

# uname -a
Linux $REDACTED 2.6.32-431.20.5.el6.centos.plus.x86_64 #1 SMP Fri Jul 25 07:10:09 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release 
CentOS release 6.5 (Final)

# rpm -qa |grep -i ovirt
ovirt-release34-1.0.1-1.noarch
ovirt-host-deploy-1.2.2-1.el6.noarch
ovirt-engine-restapi-3.4.3-1.el6.noarch
ovirt-engine-sdk-python-3.4.3.0-1.el6.noarch
ovirt-engine-websocket-proxy-3.4.3-1.el6.noarch
ovirt-engine-cli-3.4.0.5-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-3.4.3-1.el6.noarch
ovirt-engine-webadmin-portal-3.4.3-1.el6.noarch
ovirt-engine-userportal-3.4.3-1.el6.noarch
ovirt-engine-3.4.3-1.el6.noarch
ovirt-engine-setup-plugin-websocket-proxy-3.4.3-1.el6.noarch
ovirt-host-deploy-java-1.2.2-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-3.4.3-1.el6.noarch
ovirt-engine-backend-3.4.3-1.el6.noarch
ovirt-host-deploy-offline-1.2.2-1.el6.x86_64
ovirt-engine-lib-3.4.3-1.el6.noarch
ovirt-engine-setup-base-3.4.3-1.el6.noarch
ovirt-iso-uploader-3.4.3-1.el6.noarch
ovirt-engine-setup-3.4.3-1.el6.noarch
ovirt-engine-dbscripts-3.4.3-1.el6.noarch
ovirt-image-uploader-3.4.2-1.el6.noarch
ovirt-engine-tools-3.4.3-1.el6.noarch

          --== END OF SUMMARY ==--
[ INFO  ] Starting engine service
[ ERROR ] Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute
[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20140727140628-g0z922.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20140727141521-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Execution of setup failed


2014-07-27 13:25:01 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/system/engine.py", line 51, in _closeup
    state=True,
  File "/usr/share/otopi/plugins/otopi/services/rhel.py", line 188, in state
    'start' if state else 'stop'
  File "/usr/share/otopi/plugins/otopi/services/rhel.py", line 96, in _executeServiceCommand
    raiseOnError=raiseOnError
  File "/usr/lib/python2.6/site-packages/otopi/plugin.py", line 871, in execute
    command=args[0],
RuntimeError: Command '/sbin/service' failed to execute
2014-07-27 13:25:01 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute

Comment 12 bakgatnet 2014-07-27 05:46:52 UTC
Tried upgrading to the 3.5 development release - same issue.

2014-07-27 17:40:58 DEBUG otopi.plugins.otopi.services.rhel plugin.execute:866 execute-output: ('/sbin/service', 'ovirt-engine', 'start') stderr:


2014-07-27 17:40:58 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/system/engine.py", line 57, in _closeup
    state=True,
  File "/usr/share/otopi/plugins/otopi/services/rhel.py", line 188, in state
    'start' if state else 'stop'
  File "/usr/share/otopi/plugins/otopi/services/rhel.py", line 96, in _executeServiceCommand
    raiseOnError=raiseOnError
  File "/usr/lib/python2.6/site-packages/otopi/plugin.py", line 871, in execute
    command=args[0],
RuntimeError: Command '/sbin/service' failed to execute
2014-07-27 17:40:58 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute
2014-07-27 17:40:58 DEBUG otopi.context context.dumpEnvironment:468 ENVIRONMENT DUMP - BEGIN
2014-07-27 17:40:58 DEBUG otopi.context context.dumpEnvironment:478 ENV BASE/error=bool:'True'
2014-07-27 17:40:58 DEBUG otopi.context context.dumpEnvironment:478 ENV BASE/exceptionInfo=list:'[(<type 'exceptions.RuntimeError'>, RuntimeError("Command '/sbin/service' failed to execute",), <traceback object at 0x27fdf80>)]'
2014-07-27 17:40:58 DEBUG otopi.context context.dumpEnvironment:482 ENVIRONMENT DUMP - END
2014-07-27 17:40:58 INFO otopi.context context.runSequence:395 Stage: Clean up
2014-07-27 17:40:58 DEBUG otopi.context context.runSequence:399 STAGE cleanup
2014-07-27 17:40:58 DEBUG otopi.context context._executeMethod:138 Stage cleanup METHOD otopi.plugins.ovirt_engine_common.base.core.misc.Plugin._cleanup

Comment 13 Yaniv Bronhaim 2014-07-27 12:59:26 UTC
hey,
we need logs to investigate - host-deploy log, syslog (messages) and vdsm.log should be enough. please attach if it still occurs


Note You need to log in before you can comment on or make changes to this bug.