Bug 1120209

Summary: Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute stage 'Closing up': Command '/bin/systemctl' failed to execute.
Product: Red Hat Enterprise Virtualization Manager Reporter: Nikolai Sednev <nsednev>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Ori Gofen <ogofen>
Severity: high Docs Contact:
Priority: high    
Version: 3.4.0CC: acanan, alonbl, amureini, bazulay, danken, ecohen, gklein, iheim, lpeer, nlevinki, nsednev, oourfali, rbalakri, Rhev-m-bugs, scohen, tnisan, yeylon, ykaplan
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: v4.16.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1125178 (view as bug list) Environment:
Last Closed: 2015-02-16 13:40:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1125178, 1142923, 1156165    
Attachments:
Description Flags
logs
none
log from fresh fedora 20 installation using master from Jul 23
none
log from fresh fedora 20 installation using attach patch none

Description Nikolai Sednev 2014-07-16 12:43:14 UTC
Created attachment 918397 [details]
logs

Description of problem:
On latest RHEVM3.4.1-av10.1 clean RHEL7.0 host fails to be attached to RHEVM with error: "Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute stage 'Closing up': Command '/bin/systemctl' failed to execute."

Error taken from engine's log:
2014-07-16 14:41:10,445 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID:
31c895f9, Call Stack: null, Custom Event ID: -1, Message: Failed to install Host alma04.qa.lab.tlv.redhat.com. Failed to execute
stage 'Closing up': Command '/bin/systemctl' failed to execute.

Error taken from both hosts around the same time frame:
MainThread::ERROR::2014-07-16 14:14:11,196::utils::1107::root::(panic) Panic: Error initializing IRS
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm", line 75, in serve_clients
    irs = Dispatcher(HSM())
  File "/usr/share/vdsm/storage/hsm.py", line 360, in __init__
    multipath.setupMultipath()
  File "/usr/share/vdsm/storage/multipath.py", line 179, in setupMultipath
    raise se.MultipathReloadError()
MultipathReloadError: Multipath service reload error: ()


MainThread::ERROR::2014-07-16 14:41:14,125::utils::1107::root::(panic) Panic: Error initializing IRS
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm", line 75, in serve_clients
    irs = Dispatcher(HSM())
  File "/usr/share/vdsm/storage/hsm.py", line 360, in __init__
    multipath.setupMultipath()
  File "/usr/share/vdsm/storage/multipath.py", line 179, in setupMultipath
    raise se.MultipathReloadError()
MultipathReloadError: Multipath service reload error: ()



As work around it's possible to remove and add several times the same host, so finally it will be attached to the engine.

Version-Release number of selected component (if applicable):
rhevm-3.3.4-0.53.el6ev.noarch
qemu-kvm-rhev-1.5.3-60.el7ev_0.2.x86_64
vdsm-4.14.7-7.el7ev.x86_64
qemu-kvm-tools-rhev-1.5.3-60.el7ev_0.2.x86_64
qemu-kvm-common-rhev-1.5.3-60.el7ev_0.2.x86_64
libvirt-1.1.1-29.el7.x86_64
sanlock-3.1.0-2.el7.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Establish RHEVM av10.1 somewhere.
2.Clean image install RHEL7.0 to the host and after installation edit repositories for RHEL7 accordingly.
3.Try adding the host to engine and receive errors.

Actual results:
Host not being added properly via engine from the first or even second time.

Expected results:
Host have to be added fast and without errors.

Additional info:
engine's, vdsm and super-vdsm logs attached.

Comment 8 Nir Soffer 2014-07-17 12:33:40 UTC
I don't think this is a regression. I have seen this error in users logs in the past in the mailing list.

Please prove that this is a regression by testing again with:

1. RHEL 6.5
2. RHEV 3.4.0

Please also specify exactly which RHEL 7.0 image you used, and which repositories were used to install vdsm. I tried to add a RHEL 7.0 host to engine yesterday, and the deployment fails when trying to install vdsm, I surprised that you reached a state where you can run vdsm.

When you test, make sure you start with clean image, the same image a customer would start with.

Also lowering severity, this does not look like urgent issue, as this happens only on the first time you deploy on a clean image.

Comment 9 Nir Soffer 2014-07-17 13:49:40 UTC
This look like this issue on the mailing list:
http://lists.ovirt.org/pipermail/users/2014-February/021904.html

Which seems to be caused by this change:
http://gerrit.ovirt.org/17487 fixing bug 988995

This change is available rhev-3.3:

$ git log --grep I57583eb8931411d40bcce8f6a676621f6ee1eebd
commit 620343d6317c849fc985a5083cebb68c995f7c15
Author: Yeela Kaplan <ykaplan>
Date:   Tue Jul 30 16:27:07 2013 +0300

    multipath: Reload multipathd instead of restart on setup
    
    Multipathd is required by vdsmd.
    We should not take care of starting it, but only reloading
    new configuration.
    
    Change-Id: I57583eb8931411d40bcce8f6a676621f6ee1eebd
    Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=988995
    Signed-off-by: Yeela Kaplan <ykaplan>
    Reviewed-on: http://gerrit.ovirt.org/17487
    Reviewed-by: Dan Kenigsberg <danken>

So this may be a regression in rhev-3.3, but not in 3.4.

I think that what happens here is this:

1. vdsm requires multipathd, so multipathd is installed
2. When installing multipathd, it is not started
3. When vdsm change multipath configuration, it used to restart multipathd,
   but since rhev-3.3 it is reloaded instead.
4. Because multipath is not running, reload fails, exposing the issue that
   multipathd is not running when vdsm is installed.

It seems that the right way to handle this is to ensure that multipathd
is running before installing vdsm.

Alon, how do you suggest to solve this deployment issue?

Comment 10 Alon Bar-Lev 2014-07-17 14:07:16 UTC
(In reply to Nir Soffer from comment #9)
> It seems that the right way to handle this is to ensure that multipathd
> is running before installing vdsm.
> 
> Alon, how do you suggest to solve this deployment issue?

what do you mean ensure? if user installs vdsm manually? if that is required vdsmd should start multipathd when it tries to start up.

Comment 11 Nir Soffer 2014-07-17 14:51:28 UTC
Proof that this is not a rhev-3.4.1 regression:
https://bugzilla.redhat.com/show_bug.cgi?id=1079707#c7

That comment show same problem when rhev-3.4.0 was tested. Unfortunately the issue was ignored. Probably nobody tried to install on a clean image.

Comment 12 Nir Soffer 2014-07-17 16:01:07 UTC
Dan, can we solve this on systemd using a proper unit configuration as you suggest in https://bugzilla.redhat.com/show_bug.cgi?id=988995#c1 ?

Comment 13 Dan Kenigsberg 2014-07-18 14:21:08 UTC
In my opinion, the proper fix is to avoid messing up with multipath.conf while Vdsm is running (a la http://gerrit.ovirt.org/#/c/26123/).

We have an explicit Requires=multipathd.service, so the service should have been running when we call reload. I do not have another guess why reload failed.

Nikolay, could you attach the output of `journalctl` and /var/log/messages, which may have hints on which service was running when?

Comment 14 Nir Soffer 2014-07-18 19:09:26 UTC
(In reply to Dan Kenigsberg from comment #13)
> In my opinion, the proper fix is to avoid messing up with multipath.conf
> while Vdsm is running (a la http://gerrit.ovirt.org/#/c/26123/).
> 
> We have an explicit Requires=multipathd.service, so the service should have
> been running when we call reload. I do not have another guess why reload
> failed.

I don't have logs from this case, but in the mailing list post in comment 9, we can see that the service was inactive:

Feb 26 01:21:32 test-101022114 systemd[1]: Unit multipathd.service cannot be reloaded because it is inactive.
Feb 26 01:21:32 test-101022114 python[2866]: detected unhandled Python exception in '/usr/bin/vdsm-tool'
Feb 26 01:21:33 test-101022114 dbus-daemon[572]: dbus[572]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Feb 26 01:21:33 test-101022114 dbus[572]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Feb 26 01:21:33 test-101022114 dbus-daemon[572]: dbus[572]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 26 01:21:33 test-101022114 dbus[572]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 26 01:21:33 test-101022114 vdsm vds ERROR Error initializing IRS#012Traceback (most recent call last):#012  File "/usr/share/vdsm/clientIF.py", line 258, in _initIRS#012    self.irs = Dispatcher(HSM())#012  File "/usr/share/vdsm/storage/hsm.py", line 352, in __init__#012    multipath.setupMultipath()#012  File "/usr/share/vdsm/storage/multipath.py", line 182, in setupMultipath#012    raise se.MultipathReloadError()#012MultipathReloadError: Multipath service reload error: ()

Maybe systemd is starting the service but does not wait until the service is up?

Comment 15 Dan Kenigsberg 2014-07-23 22:12:48 UTC
Yes, apparently you were right. systemd.unit(5):

  If a unit foo.service requires a unit bar.service as configured
  with Requires= and no ordering is configured with After= or Before=,
  then both units will be started simultaneously and without any
  delay between them if foo.service is activated.

Comment 16 Nir Soffer 2014-07-24 18:00:49 UTC
The issue here is not required services order. We can see during startup that multipathd
is started before vdsm, however, during the first time vdsm is being run, we change
the network configuration (unified_network_persistence_upgrade, restore_nets). These
network configuration probably cause multipathd to fail and exit, so when the network
changes are finished, and vdsm start to initialize, multipathd is not running.

I tested this with master - once with master (see messages-before.log(, and once
with the attached patch (see messages-after.log). In both of them we get the same
results.

It seems that vdsm init script cannot change netowrk configuration since vdsm
required services depend on the the network. So the long term fix is to separate
the network configuration from vdsm startup.

For short term fix, we can check in setupMultipath if multipath is running,
and if start it instead of reloading, but this check is racy.

Antoni, can you look at this?

Comment 17 Nir Soffer 2014-07-24 18:02:05 UTC
Created attachment 920652 [details]
log from fresh fedora 20 installation using master from Jul 23

Comment 18 Nir Soffer 2014-07-24 18:03:05 UTC
Created attachment 920653 [details]
log from fresh fedora 20 installation using attach patch

Comment 19 Nir Soffer 2014-07-24 18:08:47 UTC
The attached patch is correct but does not fix this issue.

Comment 20 Nir Soffer 2014-07-25 11:43:30 UTC
This is caused by missing multipath.conf file on a fresh install on Fedora 20
and RHEL 7. When the file does not exists, multipathd service will fail to
start silently, leading to failure to reload multipathd configuration in vdsm.

    [root@reserved-0-247 ~]# systemctl start multipathd
    [root@reserved-0-247 ~]# systemctl status multipathd
    multipathd.service - Device-Mapper Multipath Device Controller
       Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled)
       Active: inactive (dead) since Fri 2014-07-25 13:55:48 IDT; 14min ago
               start condition failed at Fri 2014-07-25 14:10:04 IDT; 2s ago
               ConditionPathExists=/etc/multipath.conf was not met
      Process: 11908 ExecStart=/sbin/multipathd (code=exited, status=0/SUCCESS)

The first run of vdsm, vdsm installs a new multipath.conf file but will fail to 
reload multipathd service (because it was started but exited silently) and exit.
The second run of vdsm will succeed as multipathd will start normally using the
new configuration installed by vdsm.

Comment 21 Nir Soffer 2014-07-25 12:33:10 UTC
Actually there is also no multipath.conf file in fresh install of device-mapper-multipath on
RHEL 6.5, but multipathd does not exit, and can be reconfigured later.

When running without multipath.conf, multipathd log this to syslog:

Jul 25 15:25:28 dhcp-1-210 multipathd: /etc/multipath.conf does not exist, blacklisting all devices.
Jul 25 15:25:28 dhcp-1-210 multipathd: A sample multipath.conf file is located at
Jul 25 15:25:28 dhcp-1-210 multipathd: /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf
Jul 25 15:25:28 dhcp-1-210 multipathd: You can run /sbin/mpathconf to create or modify /etc/multipath.conf
Jul 25 15:25:28 dhcp-1-210 multipathd: path checkers start up

So this bug is caused by change in behavior in multipathd, requiring /etc/multipath.conf as
as start condition.

Comment 23 Ori Gofen 2014-09-15 12:55:36 UTC
verified on vt3.1