1172905 – [HC] restarting vdsmd on a centos 7 host remounts gluster volumes, irrevocably pausing any running VMs

Bug 1172905 - [HC] restarting vdsmd on a centos 7 host remounts gluster volumes, irrevocably pausing any running VMs

Summary: [HC] restarting vdsmd on a centos 7 host remounts gluster volumes, irrevocabl...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc
Target Release:	4.17.8
Assignee:	Nir Soffer
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Hosted_Engine_HC
TreeView+	depends on / blocked

Reported:	2014-12-11 03:47 UTC by Darrell
Modified:	2016-03-11 07:20 UTC (History)
CC List:	18 users (show)
Fixed In Version:	v4.17.0.4
Clone Of:
Environment:
Last Closed:	2016-03-11 07:20:04 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.0+ ylavi: planning_ack+ rule-engine: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)
log files from vm pause (60.95 KB, application/zip) 2014-12-12 00:37 UTC, Darrell	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1201355	unspecified	CLOSED	[4.0] [HC] Hosted Engine storage domains disappear while running ovirt-host-deploy in Hyper Converged configuration	2021-02-22 00:41:40 UTC
oVirt gerrit	40239	None	None	None	Never
oVirt gerrit	40240	None	None	None	Never

Internal Links: 1201355

Description Darrell 2014-12-11 03:47:44 UTC

Description of problem: Restarting VDSMD (either manually or automatically due to a crash) unmount and remounts gluster mounted volumes, causing any running VMs on those volumes to pause due to IO errors. Those VMs are unable to be restarted.

Version-Release number of selected component (if applicable):
Ovirt 3.5.0 initial release -> nightlies on 12/10/14, specifically
vdsm-4.16.7-* -> vdsm-4.16.8-6.gitc240f5c.el7.x86_64

Components from latest test:
vdsm-xmlrpc-4.16.8-6.gitc240f5c.el7.noarch
vdsm-4.16.8-6.gitc240f5c.el7.x86_64
vdsm-python-4.16.8-6.gitc240f5c.el7.noarch
vdsm-yajsonrpc-4.16.8-6.gitc240f5c.el7.noarch
vdsm-cli-4.16.8-6.gitc240f5c.el7.noarch
vdsm-python-zombiereaper-4.16.8-6.gitc240f5c.el7.noarch
vdsm-jsonrpc-4.16.8-6.gitc240f5c.el7.noarch
glusterfs-api-3.5.2-1.el7.x86_64
glusterfs-fuse-3.5.2-1.el7.x86_64
glusterfs-3.5.2-1.el7.x86_64
glusterfs-rdma-3.5.2-1.el7.x86_64
glusterfs-cli-3.5.2-1.el7.x86_64
glusterfs-libs-3.5.2-1.el7.x86_64

This only occurs on Centos 7 hosts, on a Centos 6 host with the same component levels, the VDSMs do not pause and continue running as expected.

How reproducible: Always on centos 7 hosts.

Steps to Reproduce:
1. run some vms on a centos 7 ovirt 3.5.x host node
2. restart vddmd from the command line
3.

Actual results: vms pause due to IO errors and can not be restarted

Expected results: restarting vddmd has no effect on running vms

Additional info:
Appears similar to https://bugzilla.redhat.com/show_bug.cgi?id=1162640, I have uploaded logs from an event there. Can provide more if needed. On a recent test, I was using a host that was slow enough restarting vdsmd that I could see that the gluster volume had been unmounted. Have not tested with a pure nfs mount.

Comment 1 Dan Kenigsberg 2014-12-11 09:37:06 UTC

Please try to reproduce with vdsm-4.16.8, which has bug 1162640 fixed.

Then, please share your {super,}vdsm.log after the vdsmd restart? glusterfs logs may come up useful, too.

Comment 2 Darrell 2014-12-12 00:21:11 UTC

Reproduction above was with vdsm-4.16.8-6.gitc240f5c.el7.x86_64. Will upload logs from this test for you.

Comment 3 Darrell 2014-12-12 00:37:42 UTC

Created attachment 967447 [details]
log files from vm pause

Comment 4 Christopher Pereira 2015-03-26 21:01:24 UTC

That's probably because QEMU doesn't reopen the file descriptors after they got invalidated.
See my comments here: https://bugzilla.redhat.com/show_bug.cgi?id=1058300

Comment 5 Christopher Pereira 2015-05-01 19:39:41 UTC

Fixed in Gerrit:
https://gerrit.ovirt.org/#/c/40239/
https://gerrit.ovirt.org/#/c/40240/

Tested on CentOS 7

Comment 6 Christopher Pereira 2015-05-25 21:21:13 UTC

Merged and working fine in 3.6 Alpha.
Can be closed.

Comment 7 Christopher Pereira 2015-06-26 22:51:59 UTC

Sorry, the patches seems not to be present in 3.6 alpha branch, only in master.
Please include in alpha-2, since killing the storage is too dangerous.

Comment 8 Allon Mureinik 2015-06-28 09:29:22 UTC

(In reply to Christopher Pereira from comment #7)
> Sorry, the patches seems not to be present in 3.6 alpha branch, only in
> master.
> Please include in alpha-2, since killing the storage is too dangerous.
Since the patches are merged, this bug should be in MODIFIED. It will be included in the next upstream official build, and should already be available in the nightly builds (for the last month or so).

Comment 9 Christopher Pereira 2015-07-07 21:26:50 UTC

I just tested alpha-2 and the patches are now included.

Comment 10 Allon Mureinik 2015-07-08 10:59:32 UTC

Moving to ON_QA so this can be formally verified

Comment 11 Christopher Pereira 2015-10-14 11:00:24 UTC

Tested on 3.6-rc1 on CentOS 7.
Patches verified in production for some months.
Can be easily verified by checking that glusterd service is NOT running inside VDSM group.
The main issue was:
https://bugzilla.redhat.com/show_bug.cgi?id=1201355#c7

Comment 12 Red Hat Bugzilla Rules Engine 2015-10-18 08:34:49 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 13 Nir Soffer 2015-10-18 09:22:26 UTC

Yanvi, what info do you need?

Comment 14 Yaniv Lavi 2015-10-18 09:44:02 UTC

See comment #12

Comment 15 Nir Soffer 2015-10-18 10:24:10 UTC

(In reply to Yaniv Dary from comment #14)

The fix exists since 4.17.1, setting target release to 4.17.8 since no other
version is available.

Comment 16 SATHEESARAN 2016-02-29 06:19:06 UTC

Tested with RHEV 3.6.3.3 and RHGS 3.1.2 RC by adding RHGS node to 3.5 compatible cluster.

1. Lauched the VM with its disk image on gluster storage domain
2. Restarted the vdsm ( vdsm-4.17.20-0.1.el7ev.noarch ) 

App VM was not running uninterrupted

Note You need to log in before you can comment on or make changes to this bug.