Bug 1639667 - Sometimes host is non-responsive on reboot when gluster bricks are on vdo volumes, due to missing /var/run/vdsm directory
Summary: Sometimes host is non-responsive on reboot when gluster bricks are on vdo vol...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Gluster
Version: 4.30.0
Hardware: Unspecified
OS: Unspecified
high
urgent with 1 vote
Target Milestone: ovirt-4.3.1
: ---
Assignee: Parth Dhanjal
QA Contact: SATHEESARAN
URL:
Whiteboard:
: 1576479 (view as bug list)
Depends On: 1630788
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-16 10:46 UTC by Sahina Bose
Modified: 2019-03-13 16:39 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-03-13 16:39:33 UTC
oVirt Team: Gluster
Embargoed:
rule-engine: ovirt-4.3+
ylavi: testing_plan_complete?


Attachments (Terms of Use)

Description Sahina Bose 2018-10-16 10:46:08 UTC
Description of problem:
On rebooting a host added to hyperconverged environment, the host is marked Non-responsive in the UI due to failure starting vdsm services.

[root@tendrl26 ~]# service vdsmd status
Redirecting to /bin/systemctl status vdsmd.service
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: inactive (dead)

Oct 16 16:07:44 tendrl26.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Virtual Desktop Se...r.
Oct 16 16:07:44 tendrl26.lab.eng.blr.redhat.com systemd[1]: Job vdsmd.service/start failed with resu...'.
Oct 16 16:07:54 tendrl26.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Virtual Desktop Se...r.
Oct 16 16:07:54 tendrl26.lab.eng.blr.redhat.com systemd[1]: Job vdsmd.service/start failed with resu...'.
Oct 16 16:08:05 tendrl26.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Virtual Desktop Se...r.
Oct 16 16:08:05 tendrl26.lab.eng.blr.redhat.com systemd[1]: Job vdsmd.service/start failed with resu...'.
Oct 16 16:08:16 tendrl26.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Virtual Desktop Se...r.
Oct 16 16:08:16 tendrl26.lab.eng.blr.redhat.com systemd[1]: Job vdsmd.service/start failed with resu...'.
Oct 16 16:08:27 tendrl26.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Virtual Desktop Se...r.
Oct 16 16:08:27 tendrl26.lab.eng.blr.redhat.com systemd[1]: Job vdsmd.service/start failed with resu...'.
Hint: Some lines were ellipsized, use -l to show in full.
[root@tendrl26 ~]# vdsm-tool configure --force

Checking configuration status...

abrt is already configured for vdsm
lvm is configured for vdsm
libvirt is already configured for vdsm
SUCCESS: ssl configured to true. No conflicts
Manual override for multipath.conf detected - preserving current configuration
This manual override for multipath.conf was based on downrevved template. You are strongly advised to contact your support representatives

Running configure...
Reconfiguration of abrt is done.
Reconfiguration of passwd is done.
Reconfiguration of libvirt is done.
Traceback (most recent call last):
  File "/usr/bin/vdsm-tool", line 219, in main
    return tool_command[cmd]["command"](*args)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py", line 38, in wrapper
    func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 141, in configure
    _configure(c)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 88, in _configure
    getattr(module, 'configure', lambda: None)()
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/bond_defaults.py", line 37, in configure
    sysfs_options_mapper.dump_bonding_options()
  File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 46, in dump_bonding_options
    with open(sysfs_options.BONDING_DEFAULTS, 'w') as f:
IOError: [Errno 2] No such file or directory: '/var/run/vdsm/bonding-defaults.json'


Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Reboot a host added to HE (in HC environment)

Comment 2 Sahina Bose 2018-10-16 11:00:53 UTC
FYI - this is a RHEL 7.6 host (not a RHV-H host like the earlier reported Bug 1576479)
[root@tendrl26 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 3 Sahina Bose 2018-10-16 11:12:25 UTC
I was able to workaround the issue by doing a "yum reinstall vdsm"

Comment 4 Yaniv Lavi 2018-10-17 07:32:18 UTC
Can you provide an estimation on the percentage of this happening?

Comment 5 Michael Burman 2018-10-17 07:49:44 UTC
Could you please check why /var/ryn/vdsm is missing after upgrade?

Comment 6 Sahina Bose 2018-10-17 08:13:54 UTC
(In reply to Michael Burman from comment #5)
> Could you please check why /var/ryn/vdsm is missing after upgrade?

This was not an upgrade. It was a reboot of the server without moving to maintenance mode.

Comment 7 Sahina Bose 2018-10-17 08:20:18 UTC
(In reply to Yaniv Lavi from comment #4)
> Can you provide an estimation on the percentage of this happening?

I would ~30% - seen it 2 out of 6 times. And this is not specific to RHEL 7.6 - we have faced it with RHV-H and RHEL 7.5 as well (see https://bugzilla.redhat.com/show_bug.cgi?id=1576479#c20)

Comment 8 Michael Burman 2018-10-17 08:26:22 UTC
We never saw it on RHV QE

Comment 9 Martin Perina 2018-10-17 13:30:48 UTC
(In reply to Michael Burman from comment #5)
> Could you please check why /var/ryn/vdsm is missing after upgrade?

/var/run directory is being populated by systemd.tmpfs, but this couldn't be started due to dependency issues:

Oct 16 15:16:14 tendrl26 systemd: Found ordering cycle on sysinit.target/start                                                                                                                                
Oct 16 15:16:14 tendrl26 systemd: Found dependency on systemd-tmpfiles-setup.service/start                                                                                                                    
Oct 16 15:16:14 tendrl26 systemd: Found dependency on local-fs.target/start                                                                                                                                   
Oct 16 15:16:14 tendrl26 systemd: Found dependency on gluster_bricks-engine.mount/start                                                                                                                       
Oct 16 15:16:14 tendrl26 systemd: Found dependency on vdo.service/start                                                                                                                                       
Oct 16 15:16:14 tendrl26 systemd: Found dependency on basic.target/start                                                                                                                                      
Oct 16 15:16:14 tendrl26 systemd: Found dependency on sockets.target/start                                                                                                                                    
Oct 16 15:16:14 tendrl26 systemd: Found dependency on iscsiuio.socket/start                                                                                                                                   
Oct 16 15:16:14 tendrl26 systemd: Found dependency on sysinit.target/start                                                                                                                                    
Oct 16 15:16:14 tendrl26 systemd: Breaking ordering cycle by deleting job systemd-tmpfiles-setup.service/start

But no idea which of those dependencies is wrong, we probably need some systemd expert to figure that out

Comment 10 Martin Perina 2018-10-17 14:49:19 UTC
Sahina, one of the dependencies mentioned above is gluster_bricks-engine-mount. As Michael mentioned this issue was never observed by RHV QE, so can't this be a reason of the failure?

Comment 11 Sahina Bose 2018-10-17 15:27:30 UTC
(In reply to Martin Perina from comment #10)
> Sahina, one of the dependencies mentioned above is
> gluster_bricks-engine-mount. As Michael mentioned this issue was never
> observed by RHV QE, so can't this be a reason of the failure?

Possible. the brick mount has the following entry in /etc/fstab :
/dev/vg_sda3/gluster_lv_engine /gluster_bricks/engine xfs inode64,noatime,nodiratime,x-systemd.requires=vdo.service 0 0

Could this be similar to Bug 1552242

Setting info on Dennis

Comment 12 Sahina Bose 2018-10-18 12:32:28 UTC
Removing blocker as this is not consistent and seen only with VDO volume in the stack

Comment 16 Sahina Bose 2018-10-23 08:33:07 UTC
*** Bug 1576479 has been marked as a duplicate of this bug. ***

Comment 22 SATHEESARAN 2019-01-14 02:49:17 UTC
The dependent bug 1630788 is already verified as the problem is resolved with changes in mount options of the filesystem in /etc/fstab

@Sahina, based on the above reasoning, could you move this bug to ON_QA, so that it could be verified

Comment 23 Sandro Bonazzola 2019-01-28 09:41:03 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 24 SATHEESARAN 2019-03-12 17:29:30 UTC
Tested with the cockpit-ovirt-dashboard-0.12.4 and RHV 4.3.2

For the bricks/XFS filesystems created on the VDO volume, has an entry in /etc/fstab with
special mount options - "_netdev,x-system.device-timeout=0", which solves this problem

Comment 25 Sandro Bonazzola 2019-03-13 16:39:33 UTC
This bugzilla is included in oVirt 4.3.1 release, published on February 28th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.