Bug 1630788

Summary: Host goes non-responsive post reboot, as /var/run/vdsm directory is missing
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: bipin <bshetty>
Component: rhhiAssignee: Sachidananda Urs <surs>
Status: CLOSED CURRENTRELEASE QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: rhhiv-1.5CC: amureini, bugs, danken, derez, ebenahar, guillaume.pavese, mflanner, nsoffer, rcyriac, rhs-bugs, sabose, sankarshan, sasundar, tnisan
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: RHHI-V 1.5.z Async   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: gdeploy-2.0.2-31 Doc Type: Known Issue
Doc Text:
Cause: File system is not mounted when fstab entries have reference to vdo service when a host is rebooted Consequence: gluster bricks and vdsm services cannot be started Workaround (if any): Update fstab entries for devices using vdo as "_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service" Result: Filesystem is started on reboot
Story Points: ---
Clone Of: 1576479
: 1654584 (view as bug list) Environment:
Last Closed: 2019-05-20 04:54:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1576479, 1654584    
Bug Blocks: 1639667    

Comment 1 SATHEESARAN 2018-09-24 06:33:49 UTC
Comment0 carries some information from the dependent bug. Here is the issue observed with this bug.

When the RHVH node is rebooted, sometimes, /var/run/vdsm directory is missing, which leaves the host non-responsive.

Workaround for this problem is to manually create dirs.

# mkdir /var/run/vdsm
# mkdir /var/run/vdsm/trackedInterfaces
# chmod 755 /var/run/vdsm
# chown -R vdsm:kvm /var/run/vdsm

Comment 2 Nir Soffer 2018-10-04 20:52:26 UTC
(In reply to SATHEESARAN from comment #1)
> Comment0 carries some information from the dependent bug. Here is the issue
> observed with this bug.
> 
> When the RHVH node is rebooted, sometimes, /var/run/vdsm directory is
> missing, which leaves the host non-responsive.

creating /run/vdsm is the first thing vdsm does when started, see:
https://github.com/oVirt/vdsm/blob/ece859806fb531492e1ac54d11fc78f0b5d33e1c/init/vdsmd_init_common.sh.in#L209

As part of ExecStartPre - see:
https://github.com/oVirt/vdsm/blob/master/static/usr/lib/systemd/system/vdsmd.service.in

ovirt-imageio-daemon.service is *not* enabled - and it is started by vdsm using:

Wants=mom-vdsm.service ovirt-imageio-daemon.service abrtd.service \
      dev-hugepages1G.mount libvirt-guests.service kdump.service 

# systemctl status ovirt-imageio-daemon
● ovirt-imageio-daemon.service - oVirt ImageIO Daemon
   Loaded: loaded (/usr/lib/systemd/system/ovirt-imageio-daemon.service; disabled; vendor preset: disabled)
...

Is it possible that ovirt-imageio-daemon.service is enabled by mistake on RHHI?

Comment 3 Sahina Bose 2018-10-08 13:40:37 UTC
(In reply to Nir Soffer from comment #2)
> (In reply to SATHEESARAN from comment #1)
> > Comment0 carries some information from the dependent bug. Here is the issue
> > observed with this bug.
> > 
> > When the RHVH node is rebooted, sometimes, /var/run/vdsm directory is
> > missing, which leaves the host non-responsive.
> 
> creating /run/vdsm is the first thing vdsm does when started, see:
> https://github.com/oVirt/vdsm/blob/ece859806fb531492e1ac54d11fc78f0b5d33e1c/
> init/vdsmd_init_common.sh.in#L209
> 
> As part of ExecStartPre - see:
> https://github.com/oVirt/vdsm/blob/master/static/usr/lib/systemd/system/
> vdsmd.service.in
> 
> ovirt-imageio-daemon.service is *not* enabled - and it is started by vdsm
> using:
> 
> Wants=mom-vdsm.service ovirt-imageio-daemon.service abrtd.service \
>       dev-hugepages1G.mount libvirt-guests.service kdump.service 
> 
> # systemctl status ovirt-imageio-daemon
> ● ovirt-imageio-daemon.service - oVirt ImageIO Daemon
>    Loaded: loaded (/usr/lib/systemd/system/ovirt-imageio-daemon.service;
> disabled; vendor preset: disabled)
> ...
> 
> Is it possible that ovirt-imageio-daemon.service is enabled by mistake on
> RHHI?

To install RHHI, we install RHV-H , deploy Hosted Engine, and add the nodes to RHV-M. There's no additional step done unless selecting ovirt-image-io service during engine-setup enables the daemon on the nodes?

Comment 4 Nir Soffer 2018-10-09 16:43:48 UTC
(In reply to Sahina Bose from comment #3)
The daemon should not be enabled by anything. Maybe you replace the certificates
during deploy or upgrade? this may try to restart the daemon. But in this flow
/run/vdsm must exists.

It can help if you can reproduce the issue without RHHI, with a host connected to
normal engine.

Comment 5 Sahina Bose 2018-10-10 08:45:45 UTC
(In reply to Nir Soffer from comment #4)
> (In reply to Sahina Bose from comment #3)
> The daemon should not be enabled by anything. Maybe you replace the
> certificates
> during deploy or upgrade? this may try to restart the daemon. But in this
> flow
> /run/vdsm must exists.

No - we do not.

> 
> It can help if you can reproduce the issue without RHHI, with a host
> connected to
> normal engine.

We donot have a non-RHHI setup to reproduce. Raz, can you help? Has RHV QE encountered this error on HE deployments?

Comment 6 Raz Tamir 2018-10-10 09:02:42 UTC
(In reply to Sahina Bose from comment #5)
> (In reply to Nir Soffer from comment #4)
> > (In reply to Sahina Bose from comment #3)
> > The daemon should not be enabled by anything. Maybe you replace the
> > certificates
> > during deploy or upgrade? this may try to restart the daemon. But in this
> > flow
> > /run/vdsm must exists.
> 
> No - we do not.
> 
> > 
> > It can help if you can reproduce the issue without RHHI, with a host
> > connected to
> > normal engine.
> 
> We donot have a non-RHHI setup to reproduce. Raz, can you help? Has RHV QE
> encountered this error on HE deployments?
Seems like this is not happening on non-RHHI HE deployment.
So far, from the replies I got, no one saw it

Comment 7 SATHEESARAN 2018-11-13 08:11:21 UTC
Sahina,

Denis Keefe has come up with the workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1639667#c18. I have tested it and it worked. After reboot, there were /var/run/vdsm directory was intact.

Should this be called as the known_issue now ?

Comment 8 Sahina Bose 2018-11-20 07:02:25 UTC
(In reply to SATHEESARAN from comment #7)
> Sahina,
> 
> Denis Keefe has come up with the workaround in
> https://bugzilla.redhat.com/show_bug.cgi?id=1639667#c18. I have tested it
> and it worked. After reboot, there were /var/run/vdsm directory was intact.
> 
> Should this be called as the known_issue now ?

Yes, I've updated the doc_text.

Comment 9 Sachidananda Urs 2018-11-30 05:16:44 UTC
The issue has been addressed in this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1654584#c2

Comment 10 SATHEESARAN 2018-12-07 14:17:39 UTC
The dependent bug is ON_QA

Comment 11 SATHEESARAN 2018-12-07 14:17:59 UTC

Tested with gdeploy-2.0.2-31.el7rhgs

Additional mount options (_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service) are updated with /etc/fstab for XFS filesystems ( gluster bricks ) created on top of VDO volumes

<snip>
/dev/gluster_vg_sdb/gluster_lv_engine /gluster_bricks/engine xfs inode64,noatime,nodiratime 0 0
/dev/gluster_vg_sdc/gluster_lv_data /gluster_bricks/data xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0
/dev/gluster_vg_sdc/gluster_lv_vmstore /gluster_bricks/vmstore xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0
/dev/gluster_vg_sdd/gluster_lv_newvol /gluster_bricks/newvol xfs inode64,noatime,nodiratime 0 0

</snip>