Bug 1201355 - [4.0] [HC] Hosted Engine storage domains disappear while running ovirt-host-deploy in Hyper Converged configuration
Summary: [4.0] [HC] Hosted Engine storage domains disappear while running ovirt-host-d...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.0.0-alpha
: 4.17.999
Assignee: Sandro Bonazzola
QA Contact: SATHEESARAN
URL:
Whiteboard:
: 1220310 (view as bug list)
Depends On: 1022961 1253939
Blocks: oVirt_Hosted_Engine_GlusterFS Hosted_Engine_HC Gluster-HC-1
TreeView+ depends on / blocked
 
Reported: 2015-03-12 14:30 UTC by Sandro Bonazzola
Modified: 2016-08-25 12:42 UTC (History)
27 users (show)

Fixed In Version: ovirt 4.0.0 alpha1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-25 12:42:36 UTC
oVirt Team: Gluster
ylavi: ovirt-4.0.0?
ylavi: planning_ack+
rule-engine: devel_ack+
sasundar: testing_ack+


Attachments (Terms of Use)
host logs (1.82 MB, application/x-xz)
2015-03-12 14:30 UTC, Sandro Bonazzola
no flags Details
hosted engine vm logs (294.57 KB, application/x-xz)
2015-03-12 14:30 UTC, Sandro Bonazzola
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1172905 urgent CLOSED [HC] restarting vdsmd on a centos 7 host remounts gluster volumes, irrevocably pausing any running VMs 2020-10-14 00:28:05 UTC
Red Hat Bugzilla 1221486 urgent CLOSED [HC-HE]Failed to deploy HC-HE with | [ ERROR ] Failed to execute stage 'Closing up': The VM is not powering up: please c... 2020-10-14 00:28:05 UTC
oVirt gerrit 40179 master ABANDONED storage: exec mount in different cgroup 2020-10-20 04:28:25 UTC
oVirt gerrit 40183 master ABANDONED systemd: set KillMode to process for vdsmd 2020-10-20 04:28:25 UTC
oVirt gerrit 40240 master MERGED storageServer: Run glusterfs in its own cgroup 2020-10-20 04:28:25 UTC

Internal Links: 1172905 1221486

Description Sandro Bonazzola 2015-03-12 14:30:23 UTC
Created attachment 1001003 [details]
host logs

In the Hyper Converged configuration gluster daemon is configured and running on the host.
Hosted Engine Setup creates a local brick and use it for creating a master data storage domain where to deploy the Hosted Engine VM.
Once the VM is deployed and the engine is running, the setup uses the SDK to add the host to the cluster.
ovirt-host-deploy is then executed by the engine.

At this point the mount point under /rhev used by the data storage domain disappears and the Hosted Engine VM enters paused state.

In the mount logs:
[2015-02-12 15:33:45.223019] W [glusterfsd.c:1194:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[2015-02-12 15:33:45.223040] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/mididell.home:_hosted__engine__glusterfs'.

It seems that something is terminating glusterd with sigterm.
Restarting glusterd shouldn't be causing it and ovirt-host-deploy and ovirt-hosted-engine-setup are not sending sigterm.

Comment 1 Sandro Bonazzola 2015-03-12 14:30:56 UTC
Created attachment 1001005 [details]
hosted engine vm logs

Comment 2 Sandro Bonazzola 2015-03-12 14:35:20 UTC
Host is RHEL 7.1
fabric script for installing rpms used for the test:

testing():
    fabric.api.run('yum -y install glusterfs-server python-magic')
    urls = [
        ('http://jenkins.ovirt.org/job/'
        'ovirt-hosted-engine-setup_any_create-rpms_manual/11/ARCH=x86_64,'
        'DISTRIBUTION=epel-7/artifact/exported-artifacts/'
        'ovirt-hosted-engine-setup-1.2.3-0.0.master.el7.centos.noarch.rpm')
    ]
    for package in (
        'vdsm-4.16.12-7.gita30da75.el7.centos.x86_64.rpm',
        'vdsm-cli-4.16.12-7.gita30da75.el7.centos.noarch.rpm',
        'vdsm-python-zombiereaper-4.16.12-7.gita30da75.el7.centos.noarch.rpm',
        'vdsm-jsonrpc-4.16.12-7.gita30da75.el7.centos.noarch.rpm',
        'vdsm-xmlrpc-4.16.12-7.gita30da75.el7.centos.noarch.rpm',
        'vdsm-yajsonrpc-4.16.12-7.gita30da75.el7.centos.noarch.rpm',
        'vdsm-python-4.16.12-7.gita30da75.el7.centos.noarch.rpm',
        'vdsm-gluster-4.16.12-7.gita30da75.el7.centos.noarch.rpm',
    ):
        urls.append(
            'http://jenkins.ovirt.org/job/manual-build-tarball/490/label=el7/'
            'artifact/exported-artifacts/%s' % package
        )
    fabric.api.run('rpm -Uvh --force ' + ' '.join(urls))

gluster rpms come from http://download.gluster.org/pub/gluster/glusterfs/3.6/3.6.2/

other ovirt rpms are from nightly master snapshot.

Comment 3 Sandro Bonazzola 2015-03-12 14:38:49 UTC
Using VDSM component since it seems that the removal of the storage is due to fuse termination while libgfapi is supposed to be enabled in the rpms used.

Comment 4 Christopher Pereira 2015-03-26 20:49:18 UTC
I experienced a similar behaviour.

While hosted-engine-setup was adding the host to the newly created cluster, VDSM crashed.

On the hosted-engine-setup logs we have:

	[...]

	2015-03-26 18:36:48 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._closeup:635 Adding the host to the cluster

	2015-03-26 18:36:49 INFO otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:169 Waiting for the host to become operational in the engine. This may take several minutes...

	2015-03-26 18:36:50 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in installing state
	2015-03-26 18:36:54 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in installing state
	2015-03-26 18:36:57 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in installing state

	2015-03-26 18:54:43 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:183 Error fetching host state: [ERROR]::oVirt API connection failure, (7, 'I/O operation timed out')
	2015-03-26 18:54:43 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in  state
	2015-03-26 18:56:51 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:183 Error fetching host state: [ERROR]::oVirt API connection failure, (7, 'Failed connect to m5.imatronix.com:443; Connection timed out')
	2015-03-26 18:56:51 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in  state
	2015-03-26 18:59:00 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:183 Error fetching host state: [ERROR]::oVirt API connection failure, (7, 'Failed connect to m5.imatronix.com:443; Connection timed out')
	[...]
	
On the VDSM logs, we have:

	[...]

	MainThread::DEBUG::2015-03-26 18:36:56,767::vdsm::66::vds::(sigtermHandler) Received signal 15

	MainThread::DEBUG::2015-03-26 18:36:56,768::protocoldetector::158::vds.MultiProtocolAcceptor::(stop) Stopping Acceptor
	MainThread::INFO::2015-03-26 18:36:56,768::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping JsonRPC Server
	Detector thread::DEBUG::2015-03-26 18:36:56,768::protocoldetector::130::vds.MultiProtocolAcceptor::(_cleanup) Cleaning up
	Thread-88::DEBUG::2015-03-26 18:36:56,769::fileSD::262::Storage.Misc.excCmd::(getReadDelay) FAILED: <err> = ''; <rc> = -15
	MainThread::DEBUG::2015-03-26 18:36:56,769::vmchannels::192::vds::(stop) VM channels listener was stopped.
	Thread-88::ERROR::2015-03-26 18:36:56,769::misc::208::Storage.Misc::(readspeed) Unable to read file '/rhev/data-center/mnt/glusterSD/h2.imatronix.com:engine/7ba46e75-51af-4648-becc-5a469cb8e9c2/dom_md/metadata'
	Thread-88::ERROR::2015-03-26 18:36:56,769::monitor::250::Storage.Monitor::(_monitorDomain) Error monitoring domain 7ba46e75-51af-4648-becc-5a469cb8e9c2
	Traceback (most recent call last):
	  File "/usr/share/vdsm/storage/monitor.py", line 247, in _monitorDomain
		self._checkReadDelay()
	  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 726, in wrapper
		value = meth(self, *a, **kw)
	  File "/usr/share/vdsm/storage/monitor.py", line 319, in _checkReadDelay
		self.nextStatus.readDelay = self.domain.getReadDelay()
	  File "/usr/share/vdsm/storage/fileSD.py", line 262, in getReadDelay
		stats = misc.readspeed(self.metafile, 4096)
	  File "/usr/share/vdsm/storage/misc.py", line 209, in readspeed
		raise se.MiscFileReadException(path)
	MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/h2.imatronix.com:engine/7ba46e75-51af-4648-becc-5a469cb8e9c2/dom_md/metadata',)


	ioprocess communication (3617)::ERROR::2015-03-26 18:36:56,768::__init__::152::IOProcessClient::(_communicate) IOProcess failure
	Traceback (most recent call last):
	  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate
		raise Exception("FD closed")
	Exception: FD closed
	MainThread::INFO::2015-03-26 18:36:56,769::momIF::87::MOM::(stop) Shutting down MOM
	Thread-88::INFO::2015-03-26 18:36:56,770::monitor::273::Storage.Monitor::(_notifyStatusChanges) Domain 7ba46e75-51af-4648-becc-5a469cb8e9c2 became INVALID
	Thread-147::DEBUG::2015-03-26 18:36:56,771::misc::813::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Emitting event
	MainThread::DEBUG::2015-03-26 18:36:56,771::task::592::Storage.TaskManager.Task::(_updateState) Task=`aac0fcbf-3a3c-48be-bd72-26e4bcc2baf6`::moving from state init -> state preparing
	Thread-147::DEBUG::2015-03-26 18:36:56,771::misc::823::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Calling registered method `contEIOVms`
	ioprocess communication (3617)::DEBUG::2015-03-26 18:36:56,771::__init__::298::IOProcessClient::(_run) Starting IOProcess...
	MainThread::INFO::2015-03-26 18:36:56,771::logUtils::48::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
	Thread-147::DEBUG::2015-03-26 18:36:56,771::misc::833::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Event emitted
	Thread-11::DEBUG::2015-03-26 18:36:56,772::storageServer::725::Storage.ConnectionMonitor::(_monitorConnections) Monitoring stopped
	MainThread::INFO::2015-03-26 18:36:56,782::monitor::134::Storage.Monitor::(close) Stop monitoring all domains
	MainThread::INFO::2015-03-26 18:36:56,782::monitor::147::Storage.Monitor::(_stopMonitors) Stop monitoring 7ba46e75-51af-4648-becc-5a469cb8e9c2
	Thread-88::DEBUG::2015-03-26 18:36:56,782::monitor::209::Storage.Monitor::(_run) Domain monitor for 7ba46e75-51af-4648-becc-5a469cb8e9c2 stopped
	MainThread::DEBUG::2015-03-26 18:36:56,782::monitor::153::Storage.Monitor::(_stopMonitors) Waiting for monitor 7ba46e75-51af-4648-becc-5a469cb8e9c2
	Thread-88::INFO::2015-03-26 18:36:56,782::clusterlock::245::Storage.SANLock::(releaseHostId) Releasing host id for domain 7ba46e75-51af-4648-becc-5a469cb8e9c2 (id: 1)
	Thread-88::ERROR::2015-03-26 18:36:56,782::monitor::366::Storage.Monitor::(_releaseHostId) Error releasing host id 1 for domain 7ba46e75-51af-4648-becc-5a469cb8e9c2
	Traceback (most recent call last):
	  File "/usr/share/vdsm/storage/monitor.py", line 363, in _releaseHostId
		self.domain.releaseHostId(self.hostId, unused=True)
	  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
		self._clusterLock.releaseHostId(hostId, async, unused)
	  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
		raise se.ReleaseHostIdFailure(self._sdUUID, e)
	ReleaseHostIdFailure: Cannot release host id: (u'7ba46e75-51af-4648-becc-5a469cb8e9c2', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))

	MainThread::DEBUG::2015-03-26 18:36:56,783::taskManager::90::Storage.TaskManager::(prepareForShutdown) Request to stop all tasks
	MainThread::INFO::2015-03-26 18:36:56,783::taskManager::96::Storage.TaskManager::(prepareForShutdown) cdb4f72b-1789-4a9f-adc5-69f7488f5992
	MainThread::INFO::2015-03-26 18:36:56,783::logUtils::51::dispatcher::(wrapper) Run and protect: prepareForShutdown, Return response: None
	MainThread::DEBUG::2015-03-26 18:36:56,783::task::1188::Storage.TaskManager.Task::(prepare) Task=`aac0fcbf-3a3c-48be-bd72-26e4bcc2baf6`::finished: None
	MainThread::DEBUG::2015-03-26 18:36:56,783::task::592::Storage.TaskManager.Task::(_updateState) Task=`aac0fcbf-3a3c-48be-bd72-26e4bcc2baf6`::moving from state preparing -> state finished

	[...]

When trying to restart VDSM, we get:

	Thread-46::ERROR::2015-03-26 19:24:31,843::vm::1237::vm.Vm::(_startUnderlyingVm) vmId=`191045ac-79e4-4ce8-aad7-52cc9af313c5`::The vm start process failed
	Traceback (most recent call last):
	  File "/usr/share/vdsm/virt/vm.py", line 1185, in _startUnderlyingVm
		self._run()
	  File "/usr/share/vdsm/virt/vm.py", line 2253, in _run
		self._connection.createXML(domxml, flags),
	  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper
		ret = f(*args, **kwargs)
	  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3427, in createXML
		if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
	libvirtError: Failed to acquire lock: No space left on device
	Thread-46::INFO::2015-03-26 19:24:31,844::vm::1709::vm.Vm::(setDownStatus) vmId=`191045ac-79e4-4ce8-aad7-52cc9af313c5`::Changed state to Down: Failed to acquire lock: No space left on device (code=1)
	Thread-46::DEBUG::2015-03-26 19:24:31,844::vmchannels::214::vds::(unregister) Delete fileno 60 from listener.
	VM Channels Listener::DEBUG::2015-03-26 19:24:32,346::vmchannels::121::vds::(_do_del_channels) fileno 60 was removed from listener.

In sanlock.log we have:

	2015-03-26 19:24:30+0000 7589 [752]: cmd 9 target pid 9559 not found
	2015-03-26 19:24:31+0000 7589 [764]: r7 cmd_acquire 2,8,9559 invalid lockspace found -1 failed 935819904 name 7ba46e75-51af-4648-becc-5a469cb8e9c2

All 3 lease files are present.

Comment 5 Christopher Pereira 2015-03-26 21:20:09 UTC
Facts:

1) The engine storage (/rhev/data-center/mmt/...) was umounted during this process.

2) Another mount of the same volume (I mounted it before the setup) was still mounted after the VDSM crash, so maybe the problem is not related with gluster.

3) After reconnecting storage with "hosted-engine --connect-storage", the volume is mounted again.
Now, when trying to restart VDSM, we get an "invalid lockspace":

	Thread-46::ERROR::2015-03-26 19:24:31,843::vm::1237::vm.Vm::(_startUnderlyingVm) vmId=`191045ac-79e4-4ce8-aad7-52cc9af313c5`::The vm start process failed
	Traceback (most recent call last):
	  File "/usr/share/vdsm/virt/vm.py", line 1185, in _startUnderlyingVm
		self._run()
	  File "/usr/share/vdsm/virt/vm.py", line 2253, in _run
		self._connection.createXML(domxml, flags),
	  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper
		ret = f(*args, **kwargs)
	  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3427, in createXML
		if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
	libvirtError: Failed to acquire lock: No space left on device
	Thread-46::INFO::2015-03-26 19:24:31,844::vm::1709::vm.Vm::(setDownStatus) vmId=`191045ac-79e4-4ce8-aad7-52cc9af313c5`::Changed state to Down: Failed to acquire lock: No space left on device (code=1)
	Thread-46::DEBUG::2015-03-26 19:24:31,844::vmchannels::214::vds::(unregister) Delete fileno 60 from listener.
	VM Channels Listener::DEBUG::2015-03-26 19:24:32,346::vmchannels::121::vds::(_do_del_channels) fileno 60 was removed from listener.

In sanlock.log we have:

	2015-03-26 19:24:30+0000 7589 [752]: cmd 9 target pid 9559 not found
	2015-03-26 19:24:31+0000 7589 [764]: r7 cmd_acquire 2,8,9559 invalid lockspace found -1 failed 935819904 name 7ba46e75-51af-4648-becc-5a469cb8e9c2

(All 3 lease files are present)

Comment 6 Christopher Pereira 2015-03-26 21:21:25 UTC
About the hosted-engine VM not being resumed after restarting VDSM, I confirmed that QEMU is not reopening the file descriptors when resuming a paused VM.

See also:

- https://bugzilla.redhat.com/show_bug.cgi?id=1172905
- https://bugzilla.redhat.com/show_bug.cgi?id=1058300

Comment 7 Christopher Pereira 2015-03-26 23:05:06 UTC
I found out that the gluster engine storage disappears when VDSM is restarted during the activation process, because it is a child-process of the vdsmd.service:

   CGroup: /system.slice/vdsmd.service
           ├─14895 /usr/bin/python /usr/share/vdsm/vdsm
           ├─14964 /usr/libexec/ioprocess --read-pipe-fd 43 --write-pipe-fd 42 --max-threads 10 --ma
           ├─15836 /usr/libexec/ioprocess --read-pipe-fd 49 --write-pipe-fd 48 --max-threads 10 --ma
           ├─15911 /usr/sbin/glusterfs --volfile-server=h2.imatronix.com --volfile-id=engine /rhev/d
           └─15922 /usr/libexec/ioprocess --read-pipe-fd 61 --write-pipe-fd 60 --max-threads 10 --ma

A workarround is to mount the engine-storage manually, so when vdsmd is restarted on the host, it doesn't kill the storage and the engine VM.

Comment 8 Doron Fediuck 2015-03-31 06:49:46 UTC
(In reply to Christopher Pereira from comment #7)
> I found out that the gluster engine storage disappears when VDSM is
> restarted during the activation process, because it is a child-process of
> the vdsmd.service:
> 
>    CGroup: /system.slice/vdsmd.service
>            ├─14895 /usr/bin/python /usr/share/vdsm/vdsm
>            ├─14964 /usr/libexec/ioprocess --read-pipe-fd 43 --write-pipe-fd
> 42 --max-threads 10 --ma
>            ├─15836 /usr/libexec/ioprocess --read-pipe-fd 49 --write-pipe-fd
> 48 --max-threads 10 --ma
>            ├─15911 /usr/sbin/glusterfs --volfile-server=h2.imatronix.com
> --volfile-id=engine /rhev/d
>            └─15922 /usr/libexec/ioprocess --read-pipe-fd 61 --write-pipe-fd
> 60 --max-threads 10 --ma
> 
> A workarround is to mount the engine-storage manually, so when vdsmd is
> restarted on the host, it doesn't kill the storage and the engine VM.

Thanks Christopher.
The one thing that bothers me is that this issue only happens when we're
working with Gluster storage. Any other storage type works fine with the same
flow. Any idea why?

Comment 9 Christopher Pereira 2015-04-06 08:18:21 UTC
Doron,

Gluster FDs get invalidated because the glusterfs process is killed together with vdsmd.
Maybe with NFS the process hosting those FDs is not inside the same cgroup and is not killed during vdsmd restart.

Comment 10 Christopher Pereira 2015-04-06 08:18:59 UTC
Guys, an easy workarround to keep the hosted-engine VM alive during the setup, is to move the glusterfs process back to glusterd.service's cgroup with this command:

echo $glusterfspid > /sys/fs/cgroup/systemd/system.slice/glusterd.service/tasks

Comment 11 Pranith Kumar K 2015-04-08 01:39:37 UTC
hi Christopher,
     Will restarting glusterd kill glusterfs process when this is done?
Pranith

Comment 12 Christopher Pereira 2015-04-21 19:19:48 UTC
No.
Actually even after stopping glusterd, I still see glusterfs and glusterfsd processes running. I'm not sure if this is normal, but it happens with 3.7dev.
But if you kill those procs, QEMU's file-descriptors will be invalidated and the VM won't start anymore.
You will see a "block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)"
But AFAIR there was a trick: virsh suspend + resume.
This probably works because QEMU reopens the files (this is possible, because libvirt is passing files, not file-descriptors).
In the future, when we pass file-descriptors to QEMU instead of file paths, we will need to reinit them in libvirt of wherever they are recevied from.

Comment 13 Kaushal 2015-04-23 14:11:47 UTC
(In reply to Christopher Pereira from comment #9)
> Doron,
> 
> Gluster FDs get invalidated because the glusterfs process is killed together
> with vdsmd.
> Maybe with NFS the process hosting those FDs is not inside the same cgroup
> and is not killed during vdsmd restart.

This could be because the GlusterFS mount process is a user-space process, unlike NFS mounts which are handled by the kernel.

(In reply to Christopher Pereira from comment #12)
> No.
> Actually even after stopping glusterd, I still see glusterfs and glusterfsd
> processes running. I'm not sure if this is normal, but it happens with
> 3.7dev.

This is actually surprising to me. I always thought that systemd would keep track of processes under a service's cgroup and terminate all of them when the service was stopped.
Did you stop glusterd manually or via systemctl?

Comment 14 Kaushal 2015-04-24 06:37:12 UTC
The best solution here would be to have QEMU use gfapi to do access the disk image. This will make it free from the VDSM mount, and VDSM is free to do as it pleases with the mount.

Comment 15 Kaushal 2015-04-24 06:40:30 UTC
(In reply to Kaushal from comment #13)
> (In reply to Christopher Pereira from comment #12)
> > No.
> > Actually even after stopping glusterd, I still see glusterfs and glusterfsd
> > processes running. I'm not sure if this is normal, but it happens with
> > 3.7dev.
> 
> This is actually surprising to me. I always thought that systemd would keep
> track of processes under a service's cgroup and terminate all of them when
> the service was stopped.
> Did you stop glusterd manually or via systemctl?

I checked the glusterd.service unit file. It has KillMode=process, which instructs systemd to kill only the main glusterd process not the child processes. This is why the mount wasn't being killed when you stopped GlusterD.

Comment 16 Christopher Pereira 2015-04-26 06:48:52 UTC
Why is libgfapi not being used here?

Would it be possible to patch VDSM so that the glusterd process is not executed inside the VDSM cgroup?

As a workarround, to keep the gluster volume alive I'm currently:

Stopping VDSM, mounting the /rhev share manually and then restarting VDSM.
I check that the gluster process stays outside the VDSM CGROUP, ie. "systemctl status vdsmd" must not show the glusterd process under the vdsmd cgroup.

Comment 17 Christopher Pereira 2015-04-30 18:16:06 UTC
Please change status to fixed.
Thanks

Comment 18 Nir Soffer 2015-04-30 18:19:04 UTC
(In reply to Christopher Pereira from comment #17)
> Please change status to fixed.

Hi Christopher, the patches fixing them were not merged upstream yet, we cannot
change status to fixed yet.

Comment 19 Doron Fediuck 2015-05-12 07:26:36 UTC
*** Bug 1220310 has been marked as a duplicate of this bug. ***

Comment 21 Christopher Pereira 2015-05-25 21:22:42 UTC
Working fine in 3.6 Alpha (tested on CentOS 7).
Can be closed.

Comment 22 Nikolai Sednev 2015-09-07 14:41:43 UTC
Hi Sandro, 
May you change the target release 3.6.0->4.0 please?

Comment 23 Sandro Bonazzola 2015-09-11 12:23:50 UTC
Dropping Bug #1083025 from the blocked bugs, this bug shouldn't affect external glusterfs support

Comment 24 Yaniv Kaul 2015-09-22 12:35:13 UTC
Sandro, when will this bug move to ON_QA?

Comment 25 Sandro Bonazzola 2015-09-22 14:55:03 UTC
I'm going to move to ON_QA when we release a 4.0 milestone build.
We may have a "checkpoint" build after 3.6.0 will be out if we don't want to wait until the first alpha.

Comment 26 Sahina Bose 2016-03-31 14:10:36 UTC
Sandro, can this be closed?

Comment 27 Sandro Bonazzola 2016-04-05 14:37:35 UTC
(In reply to Sahina Bose from comment #26)
> Sandro, can this be closed?

closed no, moved to qa yes.

Comment 28 Sahina Bose 2016-04-05 14:51:40 UTC
Isn't this fix available in 3.6.x?

Comment 29 Sandro Bonazzola 2016-04-18 12:47:06 UTC
(In reply to Sahina Bose from comment #28)
> Isn't this fix available in 3.6.x?

As far as I can tell, no, it's not in 3.6 (change-id: I79714a9a6ecbde17c396eca6b17b9e3605fad688 )

Comment 30 Allon Mureinik 2016-04-18 12:53:35 UTC
(In reply to Sandro Bonazzola from comment #29)
> (In reply to Sahina Bose from comment #28)
> > Isn't this fix available in 3.6.x?
> 
> As far as I can tell, no, it's not in 3.6 (change-id:
> I79714a9a6ecbde17c396eca6b17b9e3605fad688 )

This patch was backported to the 3.6 branch - see commit hash 48841c0956ca91af5cd6aeb66782c8d2115f1d1e.
It's included in 4.17.0.4

Comment 32 SATHEESARAN 2016-08-23 04:22:17 UTC
Tested with gluster 3.7.9 and RHV 4.0.2-7.

While deploying hosted engine, used glusterfs storage domain backed with replica 3 volume. No problems found.

glusterfs mount process is under the system.slice

└─system.slice
  ├─run-18300.scope
  │ └─18363 /usr/sbin/glusterfs --volfile-server=10.70.37.157 --volfile-server=10.70.37.162 --volfile-server=10.70.37.187 --volfile-id=/rep2vol /rhev/data-center/mnt/glusterSD/10.70.37.157:_rep2vol

Comment 33 SATHEESARAN 2016-08-23 04:31:23 UTC
Also found that the gluster fuse mount persisted, even after restarting glusterd

Comment 34 SATHEESARAN 2016-08-23 04:31:55 UTC
(In reply to SATHEESARAN from comment #33)
> Also found that the gluster fuse mount persisted, even after restarting
> glusterd

And also after restarting vdsmd


Note You need to log in before you can comment on or make changes to this bug.