Bug 1120209 - Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute stage 'Closing up': Command '/bin/systemctl' failed to execute.
Summary: Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute stage ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.5.0
Assignee: Nir Soffer
QA Contact: Ori Gofen
URL:
Whiteboard: storage
Depends On:
Blocks: 1125178 rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-07-16 12:43 UTC by Nikolai Sednev
Modified: 2016-05-26 01:49 UTC (History)
18 users (show)

Fixed In Version: v4.16.2
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1125178 (view as bug list)
Environment:
Last Closed: 2015-02-16 13:40:03 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (214.83 KB, application/x-gzip)
2014-07-16 12:43 UTC, Nikolai Sednev
no flags Details
log from fresh fedora 20 installation using master from Jul 23 (332.19 KB, text/x-log)
2014-07-24 18:02 UTC, Nir Soffer
no flags Details
log from fresh fedora 20 installation using attach patch (321.55 KB, text/x-log)
2014-07-24 18:03 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 30704 0 master MERGED spec: Create default multipath.conf during install Never
oVirt gerrit 30982 0 ovirt-3.5 MERGED spec: Create default multipath.conf during install Never

Description Nikolai Sednev 2014-07-16 12:43:14 UTC
Created attachment 918397 [details]
logs

Description of problem:
On latest RHEVM3.4.1-av10.1 clean RHEL7.0 host fails to be attached to RHEVM with error: "Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute stage 'Closing up': Command '/bin/systemctl' failed to execute."

Error taken from engine's log:
2014-07-16 14:41:10,445 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID:
31c895f9, Call Stack: null, Custom Event ID: -1, Message: Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute
stage 'Closing up': Command '/bin/systemctl' failed to execute.

Error taken from both hosts around the same time frame:
MainThread::ERROR::2014-07-16 14:14:11,196::utils::1107::root::(panic) Panic: Error initializing IRS
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm", line 75, in serve_clients
    irs = Dispatcher(HSM())
  File "/usr/share/vdsm/storage/hsm.py", line 360, in __init__
    multipath.setupMultipath()
  File "/usr/share/vdsm/storage/multipath.py", line 179, in setupMultipath
    raise se.MultipathReloadError()
MultipathReloadError: Multipath service reload error: ()


MainThread::ERROR::2014-07-16 14:41:14,125::utils::1107::root::(panic) Panic: Error initializing IRS
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm", line 75, in serve_clients
    irs = Dispatcher(HSM())
  File "/usr/share/vdsm/storage/hsm.py", line 360, in __init__
    multipath.setupMultipath()
  File "/usr/share/vdsm/storage/multipath.py", line 179, in setupMultipath
    raise se.MultipathReloadError()
MultipathReloadError: Multipath service reload error: ()



As work around it's possible to remove and add several times the same host, so finally it will be attached to the engine.

Version-Release number of selected component (if applicable):
rhevm-3.3.4-0.53.el6ev.noarch
qemu-kvm-rhev-1.5.3-60.el7ev_0.2.x86_64
vdsm-4.14.7-7.el7ev.x86_64
qemu-kvm-tools-rhev-1.5.3-60.el7ev_0.2.x86_64
qemu-kvm-common-rhev-1.5.3-60.el7ev_0.2.x86_64
libvirt-1.1.1-29.el7.x86_64
sanlock-3.1.0-2.el7.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Establish RHEVM av10.1 somewhere.
2.Clean image install RHEL7.0 to the host and after installation edit repositories for RHEL7 accordingly.
3.Try adding the host to engine and receive errors.

Actual results:
Host not being added properly via engine from the first or even second time.

Expected results:
Host have to be added fast and without errors.

Additional info:
engine's, vdsm and super-vdsm logs attached.

Comment 8 Nir Soffer 2014-07-17 12:33:40 UTC
I don't think this is a regression. I have seen this error in users logs in the past in the mailing list.

Please prove that this is a regression by testing again with:

1. RHEL 6.5
2. RHEV 3.4.0

Please also specify exactly which RHEL 7.0 image you used, and which repositories were used to install vdsm. I tried to add a RHEL 7.0 host to engine yesterday, and the deployment fails when trying to install vdsm, I surprised that you reached a state where you can run vdsm.

When you test, make sure you start with clean image, the same image a customer would start with.

Also lowering severity, this does not look like urgent issue, as this happens only on the first time you deploy on a clean image.

Comment 9 Nir Soffer 2014-07-17 13:49:40 UTC
This look like this issue on the mailing list:
http://lists.ovirt.org/pipermail/users/2014-February/021904.html

Which seems to be caused by this change:
http://gerrit.ovirt.org/17487 fixing bug 988995

This change is available rhev-3.3:

$ git log --grep I57583eb8931411d40bcce8f6a676621f6ee1eebd
commit 620343d6317c849fc985a5083cebb68c995f7c15
Author: Yeela Kaplan <ykaplan>
Date:   Tue Jul 30 16:27:07 2013 +0300

    multipath: Reload multipathd instead of restart on setup
    
    Multipathd is required by vdsmd.
    We should not take care of starting it, but only reloading
    new configuration.
    
    Change-Id: I57583eb8931411d40bcce8f6a676621f6ee1eebd
    Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=988995
    Signed-off-by: Yeela Kaplan <ykaplan>
    Reviewed-on: http://gerrit.ovirt.org/17487
    Reviewed-by: Dan Kenigsberg <danken>

So this may be a regression in rhev-3.3, but not in 3.4.

I think that what happens here is this:

1. vdsm requires multipathd, so multipathd is installed
2. When installing multipathd, it is not started
3. When vdsm change multipath configuration, it used to restart multipathd,
   but since rhev-3.3 it is reloaded instead.
4. Because multipath is not running, reload fails, exposing the issue that
   multipathd is not running when vdsm is installed.

It seems that the right way to handle this is to ensure that multipathd
is running before installing vdsm.

Alon, how do you suggest to solve this deployment issue?

Comment 10 Alon Bar-Lev 2014-07-17 14:07:16 UTC
(In reply to Nir Soffer from comment #9)
> It seems that the right way to handle this is to ensure that multipathd
> is running before installing vdsm.
> 
> Alon, how do you suggest to solve this deployment issue?

what do you mean ensure? if user installs vdsm manually? if that is required vdsmd should start multipathd when it tries to start up.

Comment 11 Nir Soffer 2014-07-17 14:51:28 UTC
Proof that this is not a rhev-3.4.1 regression:
https://bugzilla.redhat.com/show_bug.cgi?id=1079707#c7

That comment show same problem when rhev-3.4.0 was tested. Unfortunately the issue was ignored. Probably nobody tried to install on a clean image.

Comment 12 Nir Soffer 2014-07-17 16:01:07 UTC
Dan, can we solve this on systemd using a proper unit configuration as you suggest in https://bugzilla.redhat.com/show_bug.cgi?id=988995#c1 ?

Comment 13 Dan Kenigsberg 2014-07-18 14:21:08 UTC
In my opinion, the proper fix is to avoid messing up with multipath.conf while Vdsm is running (a la http://gerrit.ovirt.org/#/c/26123/).

We have an explicit Requires=multipathd.service, so the service should have been running when we call reload. I do not have another guess why reload failed.

Nikolay, could you attach the output of `journalctl` and /var/log/messages, which may have hints on which service was running when?

Comment 14 Nir Soffer 2014-07-18 19:09:26 UTC
(In reply to Dan Kenigsberg from comment #13)
> In my opinion, the proper fix is to avoid messing up with multipath.conf
> while Vdsm is running (a la http://gerrit.ovirt.org/#/c/26123/).
> 
> We have an explicit Requires=multipathd.service, so the service should have
> been running when we call reload. I do not have another guess why reload
> failed.

I don't have logs from this case, but in the mailing list post in comment 9, we can see that the service was inactive:

Feb 26 01:21:32 test-101022114 systemd[1]: Unit multipathd.service cannot be reloaded because it is inactive.
Feb 26 01:21:32 test-101022114 python[2866]: detected unhandled Python exception in '/usr/bin/vdsm-tool'
Feb 26 01:21:33 test-101022114 dbus-daemon[572]: dbus[572]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Feb 26 01:21:33 test-101022114 dbus[572]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Feb 26 01:21:33 test-101022114 dbus-daemon[572]: dbus[572]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 26 01:21:33 test-101022114 dbus[572]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 26 01:21:33 test-101022114 vdsm vds ERROR Error initializing IRS#012Traceback (most recent call last):#012  File "/usr/share/vdsm/clientIF.py", line 258, in _initIRS#012    self.irs = Dispatcher(HSM())#012  File "/usr/share/vdsm/storage/hsm.py", line 352, in __init__#012    multipath.setupMultipath()#012  File "/usr/share/vdsm/storage/multipath.py", line 182, in setupMultipath#012    raise se.MultipathReloadError()#012MultipathReloadError: Multipath service reload error: ()

Maybe systemd is starting the service but does not wait until the service is up?

Comment 15 Dan Kenigsberg 2014-07-23 22:12:48 UTC
Yes, apparently you were right. systemd.unit(5):

  If a unit foo.service requires a unit bar.service as configured
  with Requires= and no ordering is configured with After= or Before=,
  then both units will be started simultaneously and without any
  delay between them if foo.service is activated.

Comment 16 Nir Soffer 2014-07-24 18:00:49 UTC
The issue here is not required services order. We can see during startup that multipathd
is started before vdsm, however, during the first time vdsm is being run, we change
the network configuration (unified_network_persistence_upgrade, restore_nets). These
network configuration probably cause multipathd to fail and exit, so when the network
changes are finished, and vdsm start to initialize, multipathd is not running.

I tested this with master - once with master (see messages-before.log(, and once
with the attached patch (see messages-after.log). In both of them we get the same
results.

It seems that vdsm init script cannot change netowrk configuration since vdsm
required services depend on the the network. So the long term fix is to separate
the network configuration from vdsm startup.

For short term fix, we can check in setupMultipath if multipath is running,
and if start it instead of reloading, but this check is racy.

Antoni, can you look at this?

Comment 17 Nir Soffer 2014-07-24 18:02:05 UTC
Created attachment 920652 [details]
log from fresh fedora 20 installation using master from Jul 23

Comment 18 Nir Soffer 2014-07-24 18:03:05 UTC
Created attachment 920653 [details]
log from fresh fedora 20 installation using attach patch

Comment 19 Nir Soffer 2014-07-24 18:08:47 UTC
The attached patch is correct but does not fix this issue.

Comment 20 Nir Soffer 2014-07-25 11:43:30 UTC
This is caused by missing multipath.conf file on a fresh install on Fedora 20
and RHEL 7. When the file does not exists, multipathd service will fail to
start silently, leading to failure to reload multipathd configuration in vdsm.

    [root@reserved-0-247 ~]# systemctl start multipathd
    [root@reserved-0-247 ~]# systemctl status multipathd
    multipathd.service - Device-Mapper Multipath Device Controller
       Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled)
       Active: inactive (dead) since Fri 2014-07-25 13:55:48 IDT; 14min ago
               start condition failed at Fri 2014-07-25 14:10:04 IDT; 2s ago
               ConditionPathExists=/etc/multipath.conf was not met
      Process: 11908 ExecStart=/sbin/multipathd (code=exited, status=0/SUCCESS)

The first run of vdsm, vdsm installs a new multipath.conf file but will fail to 
reload multipathd service (because it was started but exited silently) and exit.
The second run of vdsm will succeed as multipathd will start normally using the
new configuration installed by vdsm.

Comment 21 Nir Soffer 2014-07-25 12:33:10 UTC
Actually there is also no multipath.conf file in fresh install of device-mapper-multipath on
RHEL 6.5, but multipathd does not exit, and can be reconfigured later.

When running without multipath.conf, multipathd log this to syslog:

Jul 25 15:25:28 dhcp-1-210 multipathd: /etc/multipath.conf does not exist, blacklisting all devices.
Jul 25 15:25:28 dhcp-1-210 multipathd: A sample multipath.conf file is located at
Jul 25 15:25:28 dhcp-1-210 multipathd: /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf
Jul 25 15:25:28 dhcp-1-210 multipathd: You can run /sbin/mpathconf to create or modify /etc/multipath.conf
Jul 25 15:25:28 dhcp-1-210 multipathd: path checkers start up

So this bug is caused by change in behavior in multipathd, requiring /etc/multipath.conf as
as start condition.

Comment 23 Ori Gofen 2014-09-15 12:55:36 UTC
verified on vt3.1


Note You need to log in before you can comment on or make changes to this bug.