Bug 1120209
Summary: | Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute stage 'Closing up': Command '/bin/systemctl' failed to execute. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nikolai Sednev <nsednev> | ||||||||
Component: | vdsm | Assignee: | Nir Soffer <nsoffer> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ori Gofen <ogofen> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 3.4.0 | CC: | acanan, alonbl, amureini, bazulay, danken, ecohen, gklein, iheim, lpeer, nlevinki, nsednev, oourfali, rbalakri, Rhev-m-bugs, scohen, tnisan, yeylon, ykaplan | ||||||||
Target Milestone: | --- | Keywords: | Triaged, ZStream | ||||||||
Target Release: | 3.5.0 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | storage | ||||||||||
Fixed In Version: | v4.16.2 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1125178 (view as bug list) | Environment: | |||||||||
Last Closed: | 2015-02-16 13:40:03 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1125178, 1142923, 1156165 | ||||||||||
Attachments: |
|
I don't think this is a regression. I have seen this error in users logs in the past in the mailing list. Please prove that this is a regression by testing again with: 1. RHEL 6.5 2. RHEV 3.4.0 Please also specify exactly which RHEL 7.0 image you used, and which repositories were used to install vdsm. I tried to add a RHEL 7.0 host to engine yesterday, and the deployment fails when trying to install vdsm, I surprised that you reached a state where you can run vdsm. When you test, make sure you start with clean image, the same image a customer would start with. Also lowering severity, this does not look like urgent issue, as this happens only on the first time you deploy on a clean image. This look like this issue on the mailing list: http://lists.ovirt.org/pipermail/users/2014-February/021904.html Which seems to be caused by this change: http://gerrit.ovirt.org/17487 fixing bug 988995 This change is available rhev-3.3: $ git log --grep I57583eb8931411d40bcce8f6a676621f6ee1eebd commit 620343d6317c849fc985a5083cebb68c995f7c15 Author: Yeela Kaplan <ykaplan> Date: Tue Jul 30 16:27:07 2013 +0300 multipath: Reload multipathd instead of restart on setup Multipathd is required by vdsmd. We should not take care of starting it, but only reloading new configuration. Change-Id: I57583eb8931411d40bcce8f6a676621f6ee1eebd Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=988995 Signed-off-by: Yeela Kaplan <ykaplan> Reviewed-on: http://gerrit.ovirt.org/17487 Reviewed-by: Dan Kenigsberg <danken> So this may be a regression in rhev-3.3, but not in 3.4. I think that what happens here is this: 1. vdsm requires multipathd, so multipathd is installed 2. When installing multipathd, it is not started 3. When vdsm change multipath configuration, it used to restart multipathd, but since rhev-3.3 it is reloaded instead. 4. Because multipath is not running, reload fails, exposing the issue that multipathd is not running when vdsm is installed. It seems that the right way to handle this is to ensure that multipathd is running before installing vdsm. Alon, how do you suggest to solve this deployment issue? (In reply to Nir Soffer from comment #9) > It seems that the right way to handle this is to ensure that multipathd > is running before installing vdsm. > > Alon, how do you suggest to solve this deployment issue? what do you mean ensure? if user installs vdsm manually? if that is required vdsmd should start multipathd when it tries to start up. Proof that this is not a rhev-3.4.1 regression: https://bugzilla.redhat.com/show_bug.cgi?id=1079707#c7 That comment show same problem when rhev-3.4.0 was tested. Unfortunately the issue was ignored. Probably nobody tried to install on a clean image. Dan, can we solve this on systemd using a proper unit configuration as you suggest in https://bugzilla.redhat.com/show_bug.cgi?id=988995#c1 ? In my opinion, the proper fix is to avoid messing up with multipath.conf while Vdsm is running (a la http://gerrit.ovirt.org/#/c/26123/). We have an explicit Requires=multipathd.service, so the service should have been running when we call reload. I do not have another guess why reload failed. Nikolay, could you attach the output of `journalctl` and /var/log/messages, which may have hints on which service was running when? (In reply to Dan Kenigsberg from comment #13) > In my opinion, the proper fix is to avoid messing up with multipath.conf > while Vdsm is running (a la http://gerrit.ovirt.org/#/c/26123/). > > We have an explicit Requires=multipathd.service, so the service should have > been running when we call reload. I do not have another guess why reload > failed. I don't have logs from this case, but in the mailing list post in comment 9, we can see that the service was inactive: Feb 26 01:21:32 test-101022114 systemd[1]: Unit multipathd.service cannot be reloaded because it is inactive. Feb 26 01:21:32 test-101022114 python[2866]: detected unhandled Python exception in '/usr/bin/vdsm-tool' Feb 26 01:21:33 test-101022114 dbus-daemon[572]: dbus[572]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) Feb 26 01:21:33 test-101022114 dbus[572]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) Feb 26 01:21:33 test-101022114 dbus-daemon[572]: dbus[572]: [system] Successfully activated service 'org.freedesktop.problems' Feb 26 01:21:33 test-101022114 dbus[572]: [system] Successfully activated service 'org.freedesktop.problems' Feb 26 01:21:33 test-101022114 vdsm vds ERROR Error initializing IRS#012Traceback (most recent call last):#012 File "/usr/share/vdsm/clientIF.py", line 258, in _initIRS#012 self.irs = Dispatcher(HSM())#012 File "/usr/share/vdsm/storage/hsm.py", line 352, in __init__#012 multipath.setupMultipath()#012 File "/usr/share/vdsm/storage/multipath.py", line 182, in setupMultipath#012 raise se.MultipathReloadError()#012MultipathReloadError: Multipath service reload error: () Maybe systemd is starting the service but does not wait until the service is up? Yes, apparently you were right. systemd.unit(5): If a unit foo.service requires a unit bar.service as configured with Requires= and no ordering is configured with After= or Before=, then both units will be started simultaneously and without any delay between them if foo.service is activated. The issue here is not required services order. We can see during startup that multipathd is started before vdsm, however, during the first time vdsm is being run, we change the network configuration (unified_network_persistence_upgrade, restore_nets). These network configuration probably cause multipathd to fail and exit, so when the network changes are finished, and vdsm start to initialize, multipathd is not running. I tested this with master - once with master (see messages-before.log(, and once with the attached patch (see messages-after.log). In both of them we get the same results. It seems that vdsm init script cannot change netowrk configuration since vdsm required services depend on the the network. So the long term fix is to separate the network configuration from vdsm startup. For short term fix, we can check in setupMultipath if multipath is running, and if start it instead of reloading, but this check is racy. Antoni, can you look at this? Created attachment 920652 [details]
log from fresh fedora 20 installation using master from Jul 23
Created attachment 920653 [details]
log from fresh fedora 20 installation using attach patch
The attached patch is correct but does not fix this issue. This is caused by missing multipath.conf file on a fresh install on Fedora 20 and RHEL 7. When the file does not exists, multipathd service will fail to start silently, leading to failure to reload multipathd configuration in vdsm. [root@reserved-0-247 ~]# systemctl start multipathd [root@reserved-0-247 ~]# systemctl status multipathd multipathd.service - Device-Mapper Multipath Device Controller Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled) Active: inactive (dead) since Fri 2014-07-25 13:55:48 IDT; 14min ago start condition failed at Fri 2014-07-25 14:10:04 IDT; 2s ago ConditionPathExists=/etc/multipath.conf was not met Process: 11908 ExecStart=/sbin/multipathd (code=exited, status=0/SUCCESS) The first run of vdsm, vdsm installs a new multipath.conf file but will fail to reload multipathd service (because it was started but exited silently) and exit. The second run of vdsm will succeed as multipathd will start normally using the new configuration installed by vdsm. Actually there is also no multipath.conf file in fresh install of device-mapper-multipath on RHEL 6.5, but multipathd does not exit, and can be reconfigured later. When running without multipath.conf, multipathd log this to syslog: Jul 25 15:25:28 dhcp-1-210 multipathd: /etc/multipath.conf does not exist, blacklisting all devices. Jul 25 15:25:28 dhcp-1-210 multipathd: A sample multipath.conf file is located at Jul 25 15:25:28 dhcp-1-210 multipathd: /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf Jul 25 15:25:28 dhcp-1-210 multipathd: You can run /sbin/mpathconf to create or modify /etc/multipath.conf Jul 25 15:25:28 dhcp-1-210 multipathd: path checkers start up So this bug is caused by change in behavior in multipathd, requiring /etc/multipath.conf as as start condition. verified on vt3.1 |
Created attachment 918397 [details] logs Description of problem: On latest RHEVM3.4.1-av10.1 clean RHEL7.0 host fails to be attached to RHEVM with error: "Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute stage 'Closing up': Command '/bin/systemctl' failed to execute." Error taken from engine's log: 2014-07-16 14:41:10,445 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 31c895f9, Call Stack: null, Custom Event ID: -1, Message: Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute stage 'Closing up': Command '/bin/systemctl' failed to execute. Error taken from both hosts around the same time frame: MainThread::ERROR::2014-07-16 14:14:11,196::utils::1107::root::(panic) Panic: Error initializing IRS Traceback (most recent call last): File "/usr/share/vdsm/vdsm", line 75, in serve_clients irs = Dispatcher(HSM()) File "/usr/share/vdsm/storage/hsm.py", line 360, in __init__ multipath.setupMultipath() File "/usr/share/vdsm/storage/multipath.py", line 179, in setupMultipath raise se.MultipathReloadError() MultipathReloadError: Multipath service reload error: () MainThread::ERROR::2014-07-16 14:41:14,125::utils::1107::root::(panic) Panic: Error initializing IRS Traceback (most recent call last): File "/usr/share/vdsm/vdsm", line 75, in serve_clients irs = Dispatcher(HSM()) File "/usr/share/vdsm/storage/hsm.py", line 360, in __init__ multipath.setupMultipath() File "/usr/share/vdsm/storage/multipath.py", line 179, in setupMultipath raise se.MultipathReloadError() MultipathReloadError: Multipath service reload error: () As work around it's possible to remove and add several times the same host, so finally it will be attached to the engine. Version-Release number of selected component (if applicable): rhevm-3.3.4-0.53.el6ev.noarch qemu-kvm-rhev-1.5.3-60.el7ev_0.2.x86_64 vdsm-4.14.7-7.el7ev.x86_64 qemu-kvm-tools-rhev-1.5.3-60.el7ev_0.2.x86_64 qemu-kvm-common-rhev-1.5.3-60.el7ev_0.2.x86_64 libvirt-1.1.1-29.el7.x86_64 sanlock-3.1.0-2.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Establish RHEVM av10.1 somewhere. 2.Clean image install RHEL7.0 to the host and after installation edit repositories for RHEL7 accordingly. 3.Try adding the host to engine and receive errors. Actual results: Host not being added properly via engine from the first or even second time. Expected results: Host have to be added fast and without errors. Additional info: engine's, vdsm and super-vdsm logs attached.