Bug 1576479

Summary: Fails to start ovirt-imageio-daemon service on upgrade of host
Product: [oVirt] ovirt-imageio Reporter: Sahina Bose <sabose>
Component: DaemonAssignee: Daniel Erez <derez>
Status: CLOSED DUPLICATE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: high    
Version: 1.2.2CC: amureini, bugs, danken, derez, mflanner, nsoffer, sabose, sasundar, tnisan, ylavi
Target Milestone: ovirt-4.3.0Keywords: Reopened
Target Release: ---Flags: rule-engine: ovirt-4.3+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1630788 (view as bug list) Environment:
Last Closed: 2018-10-23 08:33:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1630788    
Attachments:
Description Flags
vdsm.log
none
daemon.log
none
daemon log -2 none

Description Sahina Bose 2018-05-09 14:50:35 UTC
Description of problem:

While upgrading host from ovirt-4.2.2 to ovirt-4.2.3, the upgrade fails due to failure in starting ovirt-imageio-daemon

Error from daemon.log
2018-05-09 20:04:25,812 INFO    (MainThread) [server] Starting (pid=30963, version=1.3.1.2)
2018-05-09 20:04:25,815 ERROR   (MainThread) [server] Service failed (image_server=<ovirt_imageio_daemon.server.ThreadedWSGIServer instance at 0x7ff2b952e998>, ticket_server=None, running=True)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py", line 60, in main
    start(config)
  File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py", line 101, in start
    UnixWSGIRequestHandler)
  File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
    self.server_bind()
  File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/uhttp.py", line 72, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory


Version-Release number of selected component (if applicable):


How reproducible:
Faced this on 1 of 3 hosts

Steps to Reproduce:
Upgrade host from engine UI

Comment 1 Daniel Erez 2018-05-10 11:09:10 UTC
Hi Sahina,

* Did the host upgrade fail due to this failure?
* Can you try to restart the service manually to see if it mitigates the issue.
* Please also attach relevant logs: /var/log/messages, host deploy logs, vdsm and daemon.

Comment 2 Sahina Bose 2018-05-10 11:19:20 UTC
(In reply to Daniel Erez from comment #1)
> Hi Sahina,
> 
> * Did the host upgrade fail due to this failure?

Yes, and the host is in non-operational. Re-installing host also fails at same point i.e 
An error has occurred during installation of Host rhsdev-grafton2.lab.eng.blr.redhat.com: Failed to execute stage 'Closing up': Failed to start service 'ovirt-imageio-daemon'.

> * Can you try to restart the service manually to see if it mitigates the
> issue.

Tried restarting service also rebooting node - same issue

> * Please also attach relevant logs: /var/log/messages, host deploy logs,
> vdsm and daemon.

Comment 3 Sahina Bose 2018-05-10 11:28:01 UTC
Created attachment 1434322 [details]
vdsm.log

Comment 4 Sahina Bose 2018-05-10 11:28:47 UTC
Created attachment 1434323 [details]
daemon.log

Comment 5 Sahina Bose 2018-05-10 11:46:43 UTC
I think the issue was with storage. Once the storage issue was resolved, I see that the daemon is starting.

Closing this bug.

Comment 6 Nir Soffer 2018-05-10 11:50:15 UTC
Sahina, just to make sure we are not missing a real bug related to this, can you
attach /var/log/messages from that host?

Comment 8 Nir Soffer 2018-05-18 15:02:15 UTC
Thanks Sahina. Looking in the log, it seems ovirt-imageio-daemon is started very
early - not by vdsm. This is wrong - ovirt-imageio-daemon service
should be disabled, it is started by vdsm.

May  4 15:14:26 rhsdev-grafton2 kernel: uds: modprobe: loaded version 6.1.0.41
May  4 15:14:26 rhsdev-grafton2 kernel: kvdo: modprobe: loaded version 6.1.0.153
May  4 15:14:26 rhsdev-grafton2 systemd: Started firewalld - dynamic firewall daemon.
May  4 15:14:26 rhsdev-grafton2 systemd: Reached target Network (Pre).
May  4 15:14:26 rhsdev-grafton2 systemd: Starting Network (Pre).
May  4 15:14:26 rhsdev-grafton2 systemd: Dependency failed for Network Manager Wait Online.
May  4 15:14:26 rhsdev-grafton2 systemd: Job NetworkManager-wait-online.service/start failed with result 'dependency'.
May  4 15:14:26 rhsdev-grafton2 systemd: Starting Open vSwitch Database Unit...
May  4 15:14:26 rhsdev-grafton2 kernel: kvdo0:dmsetup: starting device 'vdo_sdd' device instantiation 0 write policy auto
May  4 15:14:26 rhsdev-grafton2 kernel: kvdo0:dmsetup: underlying device, REQ_FLUSH: not supported, REQ_FUA: not supported
May  4 15:14:26 rhsdev-grafton2 kernel: kvdo0:dmsetup: Using mode sync automatically.
May  4 15:14:26 rhsdev-grafton2 kernel: kvdo0:dmsetup: zones: 1 logical, 1 physical, 1 hash; base threads: 5
May  4 15:14:26 rhsdev-grafton2 systemd: Started NTP client/server.
May  4 15:14:26 rhsdev-grafton2 kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team
May  4 15:14:26 rhsdev-grafton2 kernel: Ebtables v2.0 registered
May  4 15:14:26 rhsdev-grafton2 kernel: nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
May  4 15:14:26 rhsdev-grafton2 systemd: ovirt-imageio-daemon.service: main process exited, code=exited, status=1/FAILURE
May  4 15:14:26 rhsdev-grafton2 systemd: Failed to start oVirt ImageIO Daemon.
May  4 15:14:26 rhsdev-grafton2 systemd: Unit ovirt-imageio-daemon.service entered failed state.
May  4 15:14:26 rhsdev-grafton2 systemd: ovirt-imageio-daemon.service failed.
May  4 15:14:27 rhsdev-grafton2 kernel: Netfilter messages via NETLINK v0.30.
May  4 15:14:27 rhsdev-grafton2 kernel: ip_set: protocol 6
May  4 15:14:27 rhsdev-grafton2 kernel: XFS (dm-2): Ending clean mount
May  4 15:14:27 rhsdev-grafton2 systemd: Mounted /home.
May  4 15:14:27 rhsdev-grafton2 systemd: ovirt-imageio-daemon.service holdoff time over, scheduling restart.
...


Later we see that vdsm-tool with:

    No such file or directory: '/var/run/vdsm/bonding-defaults.json'

May  4 15:14:52 rhsdev-grafton2 vdsm-tool: Traceback (most recent call last):
May  4 15:14:52 rhsdev-grafton2 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, in main
May  4 15:14:52 rhsdev-grafton2 vdsm-tool: return tool_command[cmd]["command"](*args)
May  4 15:14:52 rhsdev-grafton2 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 94, in dump_bonding_options
May  4 15:14:52 rhsdev-grafton2 vdsm-tool: sysfs_options_mapper.dump_bonding_options()
May  4 15:14:52 rhsdev-grafton2 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 46, in dump_bonding_options
May  4 15:14:52 rhsdev-grafton2 vdsm-tool: with open(sysfs_options.BONDING_DEFAULTS, 'w') as f:
May  4 15:14:52 rhsdev-grafton2 vdsm-tool: IOError: [Errno 2] No such file or directory: '/var/run/vdsm/bonding-defaults.json'

So this looks like ovirt-imageio-daemon was started incorrectly,
before /run/vdsm/ was created.

Please check that ovirt-imageio-daemon service is not enabled in this setup.

Lets reopen the bug when we know how to reproduce it.

Comment 9 Mike 2018-08-08 11:38:16 UTC
I seem to be running into this as well.  I have a 4 node RHHI system that was installed with a layered install (not with the RHHI or RHV iso).  I am also using VDO for the partitions that make up the Gluster bricks.  When I do an upgrade, 1 or more of the 4 nodes will become non-responsive.  Rebooting the node from the command line allows it to become active again.  Unfortunately, its not always the same node(s) that become non-responsive.  Here are the steps I go through to upgrade (all done through the ovirt GUI):

1. evacuate all vms from the node to be upgraded
2. put the node into maintenance mode making sure to check the box to stop the gluster services but not checking the box about ignoring gluster quorum
3. upgrade the node and check the box to reboot the node after upgrade
4. wait until the reboot completes and the node will eventually go back into maintenance (from upgrage)
5. activate the node.  If the problem occurs, activating will immediately fail and the node will be marked non-responsive. 

remedy: 
Reboot the non-responsive node and it will come back into the cluster as active all by itself.

Mike

Comment 10 Sahina Bose 2018-08-08 11:53:36 UTC
Re-opening as this bug is reproducible intermittently. I saw this again in one of the dev setups as well on rebooting a node. Attaching the /var/log/messages from the node

Comment 12 Daniel Erez 2018-08-08 12:05:18 UTC
(In reply to Sahina Bose from comment #10)
> Re-opening as this bug is reproducible intermittently. I saw this again in
> one of the dev setups as well on rebooting a node. Attaching the
> /var/log/messages from the node

Can you please also attach daemon and vdsm logs? Which version of imageio-daemon was used?

Comment 13 Sahina Bose 2018-08-08 13:45:03 UTC
Created attachment 1474351 [details]
daemon log -2

There was no vdsm.log generated.

Comment 14 Sahina Bose 2018-08-09 07:24:59 UTC
(In reply to Daniel Erez from comment #12)
> (In reply to Sahina Bose from comment #10)
> > Re-opening as this bug is reproducible intermittently. I saw this again in
> > one of the dev setups as well on rebooting a node. Attaching the
> > /var/log/messages from the node
> 
> Can you please also attach daemon and vdsm logs? Which version of
> imageio-daemon was used?

ovirt-imageio-daemon-1.4.1-0.el7ev.noarch
ovirt-imageio-common-1.4.1-0.el7ev.noarch

Comment 15 Daniel Erez 2018-08-09 13:45:40 UTC
This depends on the RFE to register the port of the daemon at IANA (i.e. avoid port conflict on starting the daemon) - bug 1528972

Comment 16 Sahina Bose 2018-08-09 14:35:29 UTC
(In reply to Daniel Erez from comment #15)
> This depends on the RFE to register the port of the daemon at IANA (i.e.
> avoid port conflict on starting the daemon) - bug 1528972

Is there a workaround? 

In the host, where I encountered this the ports 54323 & 54322 are not taken by any other process. 
[root@rhsdev-grafton3-nic2 ~]# lsof -i :54323
[root@rhsdev-grafton3-nic2 ~]# lsof -i :54322

Which ports are conflicting?

Comment 17 Nir Soffer 2018-08-09 15:01:26 UTC
(In reply to Mike from comment #9)
Mike I don't see any connection between your issue and this bug. Please file a new
RHHI bug for this.

In the new bug please attach engine and host logs.

Comment 18 Nir Soffer 2018-08-09 15:03:53 UTC
(In reply to Sahina Bose from comment #10)
> Re-opening as this bug is reproducible intermittently. I saw this again in
> one of the dev setups as well on rebooting a node. Attaching the
> /var/log/messages from the node

Sahina, I don't see how Mike report is related to this bug. This bug is caused
by running ovirt-imageio-daemon before vdsm was installed, so /run/vdsm/ is 
missing. This is not ovirt-imageio bug, but probably RHHI bug.

Please close this bug, or move it to RHHI.

Comment 19 Nir Soffer 2018-08-09 15:09:14 UTC
(In reply to Sahina Bose from comment #16)
> (In reply to Daniel Erez from comment #15)
> > This depends on the RFE to register the port of the daemon at IANA (i.e.
> > avoid port conflict on starting the daemon) - bug 1528972
> 
> Is there a workaround? 
> 
> In the host, where I encountered this the ports 54323 & 54322 are not taken
> by any other process. 
> [root@rhsdev-grafton3-nic2 ~]# lsof -i :54323
> [root@rhsdev-grafton3-nic2 ~]# lsof -i :54322
> 
> Which ports are conflicting?

We have a rare issue with imageio daemon port used when the daemon is starting, but
it is not related to this bug.

I'm removing the dependency on bug 1528972.

Comment 20 Sahina Bose 2018-08-10 04:59:13 UTC
(In reply to Nir Soffer from comment #18)
> (In reply to Sahina Bose from comment #10)
> > Re-opening as this bug is reproducible intermittently. I saw this again in
> > one of the dev setups as well on rebooting a node. Attaching the
> > /var/log/messages from the node
> 
> Sahina, I don't see how Mike report is related to this bug. This bug is
> caused
> by running ovirt-imageio-daemon before vdsm was installed, so /run/vdsm/ is 
> missing. This is not ovirt-imageio bug, but probably RHHI bug.

From Mike's email (http://post-office.corp.redhat.com/archives/rhev-tech/2018-August/msg00075.html), the issue looked very similar:

Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: Traceback (most recent call last):
Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: File "/usr/bin/vdsm-tool", line 219, in main
Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: return tool_command[cmd]["command"](*args)
Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 94, in dump_bonding_options
Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: sysfs_options_mapper.dump_bonding_options()
Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 46, in dump_bonding_options
Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: with open(sysfs_options.BONDING_DEFAULTS, 'w') as f:
Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: IOError: [Errno 2] No such file or directory: '/var/run/vdsm/bonding-defaults.json'
Aug 07 10:12:28 rhv3.cluster.net systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=157/n/a
> 
> Please close this bug, or move it to RHHI.

RHHI is a layered product install, and does not have different logic of starting services than what's available in RHV.

When you say ovirt-image-io-daemon was started before vdsm, isn't there a dependency created in this service on vdsm?

Here's what I tried:
[root@rhsdev-grafton3-nic2 ~]# systemctl stop ovirt-imageio-daemon
[root@rhsdev-grafton3-nic2 ~]# vdsm-tool configure --force

Checking configuration status...

abrt is already configured for vdsm
lvm is configured for vdsm
libvirt is already configured for vdsm
SUCCESS: ssl configured to true. No conflicts
Manual override for multipath.conf detected - preserving current configuration
This manual override for multipath.conf was based on downrevved template. You are strongly advised to contact your support representatives

Running configure...
Reconfiguration of abrt is done.
Reconfiguration of passwd is done.
Reconfiguration of libvirt is done.
Traceback (most recent call last):
  File "/usr/bin/vdsm-tool", line 219, in main
    return tool_command[cmd]["command"](*args)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py", line 38, in wrapper
    func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 141, in configure
    _configure(c)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 88, in _configure
    getattr(module, 'configure', lambda: None)()
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/bond_defaults.py", line 37, in configure
    sysfs_options_mapper.dump_bonding_options()
  File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 46, in dump_bonding_options
    with open(sysfs_options.BONDING_DEFAULTS, 'w') as f:
IOError: [Errno 2] No such file or directory: '/var/run/vdsm/bonding-defaults.json'

Comment 21 Nir Soffer 2018-08-10 22:41:10 UTC
(In reply to Sahina Bose from comment #20)
> (In reply to Nir Soffer from comment #18)
> > (In reply to Sahina Bose from comment #10)
> > > Re-opening as this bug is reproducible intermittently. I saw this again in
> > > one of the dev setups as well on rebooting a node. Attaching the
> > > /var/log/messages from the node
> > 
> > Sahina, I don't see how Mike report is related to this bug. This bug is
> > caused
> > by running ovirt-imageio-daemon before vdsm was installed, so /run/vdsm/ is 
> > missing. This is not ovirt-imageio bug, but probably RHHI bug.
> 
> From Mike's email
> (http://post-office.corp.redhat.com/archives/rhev-tech/2018-August/msg00075.
> html), the issue looked very similar:
...
> Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: with
> open(sysfs_options.BONDING_DEFAULTS, 'w') as f:
> Aug 07 10:12:28 rhv3.cluster.net vdsm-tool[8610]: IOError: [Errno 2] No such
> file or directory: '/var/run/vdsm/bonding-defaults.json'

Yes, it show the same error we have seen in the logs you added to this bug.

But not, it is does not mean his issue is related to failure to start
ovirt-imageio-daemon, or to ovirt-imageio project.

In comment 8 I wrote:

    So this looks like ovirt-imageio-daemon was started incorrectly,
    before /run/vdsm/ was created.

This is not an error in ovirt-imageio, it is a deployment issue. Maybe imageio
service was enabled by mistake during deploymnet?

> Aug 07 10:12:28 rhv3.cluster.net systemd[1]: ovirt-ha-agent.service: main
> process exited, code=exited, status=157/n/a
> > 
> > Please close this bug, or move it to RHHI.
> 
> RHHI is a layered product install, and does not have different logic of
> starting services than what's available in RHV.

If imageio daemon service is started before vdsm was installed, and /run/vdsm
was created, it is not RHV issue. We do not enable imageio daemon and we start
it only from vdsm, which means /run/vdsm always exists when imageio daemon try
to bind to the unix socket at /run/vdsm/ovirt-imageio-daemon.sock.

> When you say ovirt-image-io-daemon was started before vdsm, isn't there a
> dependency created in this service on vdsm?

There is a dependency in vdsm service:

[Unit]                                                                                                                                                                                        
Description=Virtual Desktop Server Manager
Requires=multipathd.service time-sync.target \
         iscsid.service rpcbind.service supervdsmd.service sanlock.service \
         vdsm-network.service
After=multipathd.service iscsid.service rpcbind.service supervdsmd.service \
      sanlock.service vdsm-network.service
Before=libvirt-guests.service
Wants=mom-vdsm.service ovirt-imageio-daemon.service abrtd.service \
      dev-hugepages1G.mount libvirt-guests.service
 
> Here's what I tried:
> [root@rhsdev-grafton3-nic2 ~]# systemctl stop ovirt-imageio-daemon
> [root@rhsdev-grafton3-nic2 ~]# vdsm-tool configure --force
> 
> Checking configuration status...
> 
> abrt is already configured for vdsm
> lvm is configured for vdsm
> libvirt is already configured for vdsm
> SUCCESS: ssl configured to true. No conflicts
> Manual override for multipath.conf detected - preserving current
> configuration
> This manual override for multipath.conf was based on downrevved template.
> You are strongly advised to contact your support representatives
> 
> Running configure...
> Reconfiguration of abrt is done.
> Reconfiguration of passwd is done.
> Reconfiguration of libvirt is done.
> Traceback (most recent call last):
>   File "/usr/bin/vdsm-tool", line 219, in main
>     return tool_command[cmd]["command"](*args)
>   File "/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py", line 38, in
> wrapper
>     func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line
> 141, in configure
>     _configure(c)
>   File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line
> 88, in _configure
>     getattr(module, 'configure', lambda: None)()
>   File
> "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/bond_defaults.py",
> line 37, in configure
>     sysfs_options_mapper.dump_bonding_options()
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/
> sysfs_options_mapper.py", line 46, in dump_bonding_options
>     with open(sysfs_options.BONDING_DEFAULTS, 'w') as f:
> IOError: [Errno 2] No such file or directory:
> '/var/run/vdsm/bonding-defaults.json'

This error is not related to ovirt-imageio. If this whats breaks the upgrade,
you please file a vdsm/network bug.

Comment 22 Tal Nisan 2018-08-14 10:13:48 UTC
Dan, can someone from network please have a look? (see comment #21)

Comment 23 SATHEESARAN 2018-09-19 09:21:35 UTC
With RHHI testing ( RHV 4.2.6 & RHV 4.2.7 ), post reboot sometimes, I found that the /var/run/vdsm directory was missing. I am using RHVH for all my testing.

Anything that I could help here with logs or setup ?

Comment 24 SATHEESARAN 2018-09-24 06:34:58 UTC
(In reply to SATHEESARAN from comment #23)
> With RHHI testing ( RHV 4.2.6 & RHV 4.2.7 ), post reboot sometimes, I found
> that the /var/run/vdsm directory was missing. I am using RHVH for all my
> testing.
> 
> Anything that I could help here with logs or setup ?

This issue should be raised as a blocker for RHV 4.2.7 as it leads to non-responsive host in RHV Manager UI. ?

Comment 25 Sahina Bose 2018-09-24 07:09:40 UTC
Dan, can you provide your inputs?

Comment 26 Dan Kenigsberg 2018-09-24 22:17:03 UTC
My input is to echo Nir's request. this should be a separate bug. Not filing a fresh bug only wastes time of new people looking into this one.

(In reply to Nir Soffer from comment #17)
> (In reply to Mike from comment #9)
> Mike I don't see any connection between your issue and this bug. Please file
> a new
> RHHI bug for this.
> 
> In the new bug please attach engine and host logs.

I don't know why /var/run/vdsm does not exist after your upgrade. It does not smell related to network - more like a RHV-H issue.

Comment 27 Sahina Bose 2018-10-16 10:52:41 UTC
I have opened a new Bug 1639667 as requested. Please take a look

Comment 28 Nir Soffer 2018-10-23 08:23:19 UTC
(In reply to Sahina Bose from comment #27)
> I have opened a new Bug 1639667 as requested. Please take a look

Can we close this bug?

Comment 29 Sahina Bose 2018-10-23 08:33:07 UTC

*** This bug has been marked as a duplicate of bug 1639667 ***