1550016 – [GlusterFS] HA VM is stuck in unkown status after ungraceful shutdown when libgfapi is disabled

Bug 1550016 - [GlusterFS] HA VM is stuck in unkown status after ungraceful shutdown when libgfapi is disabled

Summary: [GlusterFS] HA VM is stuck in unkown status after ungraceful shutdown when li...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1547095
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-28 09:59 UTC by Lilach Zitnitski
Modified:	2018-11-28 13:49 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-03-12 15:30:22 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: ovirt-4.2+ rule-engine: blocker+

Attachments	(Terms of Use)
logs (451.54 KB, application/zip) 2018-02-28 10:00 UTC, Lilach Zitnitski	no flags	Details
host poweroff logs (629.95 KB, application/zip) 2018-03-06 13:30 UTC, Lilach Zitnitski	no flags	Details
locking test script (874 bytes, text/plain) 2018-03-06 16:26 UTC, Nir Soffer	no flags	Details
View All

Description Lilach Zitnitski 2018-02-28 09:59:58 UTC

Description of problem:
I have HA VM with a lease on gluster storage domain.
When the host running the VM shuts down unexpectedly, the VM gets stuck in unknown status instead of failing over to a different and active host.

I tried the same test both on an engine with libgfapi enabled and disabled.
In both cases, the HA VM failed to start on another host.

Version-Release number of selected component (if applicable):
rhvm-4.2.2.1-0.1.el7.noarch
vdsm-4.20.19-1.el7ev.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
glusterfs-3.8.4-54.el7.x86_64
glusterfs-api-3.8.4-54.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. create VM with a disk on gluster domain and HA lease on gluster domain
2. start the VM
3. block connection between engine and host running the VM, and between the same host and the storage domain with the VM's disk and lease

Actual results:
VM is stuck in UNKOWN status

Expected results:
VM should start on another active host

Additional info:

engine.log

2018-02-27 19:57:31,003+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-4418) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM vm_0_TestCase17618_2719513854 was set to the Unknown status.

vdsm.log

2018-02-27 19:59:36,244+0200 ERROR (vm/4bea0481) [virt.vm] (vmId='4bea0481-65d3-461a-88f0-c78ed49a81ef') The vm start process failed (vm:940)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 869, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2813, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: 2018-02-27T17:59:32.923114Z qemu-kvm: -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__jenkins__ge5__volume__2/fd6e9ad0-51ff-447e-bc91-038ef7cb736e/images/99de1119-aa46-4d10-a1a4-f6371d24a8f9/565d618f-800e-4749-9404-5d526e1c5f5e,format=qcow2,if=none,id=drive-virtio-disk0,serial=99de1119-aa46-4d10-a1a4-f6371d24a8f9,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
2018-02-27T17:59:33.199650Z qemu-kvm: -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__jenkins__ge5__volume__2/fd6e9ad0-51ff-447e-bc91-038ef7cb736e/images/99de1119-aa46-4d10-a1a4-f6371d24a8f9/565d618f-800e-4749-9404-5d526e1c5f5e,format=qcow2,if=none,id=drive-virtio-disk0,serial=99de1119-aa46-4d10-a1a4-f6371d24a8f9,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock
Is another process using the image?

Comment 1 Lilach Zitnitski 2018-02-28 10:00:22 UTC

Created attachment 1401747 [details]
logs

Comment 2 Allon Mureinik 2018-02-28 10:02:16 UTC

Seems like the same issue as bug 1547095 [only here it's gluster and there it's NFS v3]

Comment 3 Nir Soffer 2018-02-28 16:35:34 UTC

(In reply to Lilach Zitnitski from comment #0)
> Steps to Reproduce:
> 1. create VM with a disk on gluster domain and HA lease on gluster domain
> 2. start the VM
> 3. block connection between engine and host running the VM, and between the
> same host and the storage domain with the VM's disk and lease

This is similar but not the tests we requested.

Blocking access to storage may cause qemu to kill the qemu process, and in this
case the NFS lock may be released, hiding the issue we want to test. We want to 
make sure the NFS lock on the client side exists on the original host.

Please test also this:

3. power off the host, simulating power loss. Using a nested host would be best
   for this test.

In this case other host can detect that the sanlock lease has expired, and engine
will start the VM on another host.

Comment 4 Nir Soffer 2018-02-28 16:51:11 UTC

Seems that the root cause is bug 1547095. Setting the bug to depend on that until
we find a workaround.

Comment 5 Nir Soffer 2018-02-28 17:03:12 UTC

Sahina, can you get someone from the gluster team to explain what is the expected
behavior in glusterfs in this case?

We have too use cases:
- host becomes disconnected from network, so the host cannot reach any of the 
  gluster nodes, and engine cannot reach the host. This is what was tested here,
  see comment 0.
- host loose power. The vm is not running on the host, but no code on the host 
  can update the gluster nodes about this.

We we want to know what is the behaviour of the fcntl based locks used by qemu
on the images. Do we have NFSv3 behavior when the locks remain until the client
recover and connect again to the storage nodes, or NFSv4 behavior, when the lock
expire after some time without any activity from the client?

Finally, do we have a way to disable remote locks, similar to NFSv3 "nolock" mount
option?

Comment 6 Jeff Cody 2018-02-28 22:21:02 UTC

I've done some testing with this from the qemu commandline, and it is readily reproducible with qemu when using fuse-mounted gluster shares (as the description shows: "[...] file=/rhev/data-center/mnt/glusterSD/gluster01.scl.la[...]").

For instance, when doing the following:

1. Launch qemu against a raw image on a fuse-mounted share:

# qemu-kvm -enable-kvm -drive file=/home/jcody/gfs/f25.img,format=raw,if=virtio,aio=native,cache=none,discard=unmap [...]

2. Disconnect host from network


3. Run qemu-io against that image from a different machine, with the same fuse mount:

# qemu-io /home/jcody/gfs/f25.img

can't open device /home/jcody/gfs/f25.img: Failed to get "write" lock
Is another process using the image?


However, if I wait ~10-15 minutes (sorry, I did not time it exactly), I do have a timeout on the fcntl lock, and I am able to then open the image from the same machine as step 3.

Does this accurately capture the issue you are encountering?


From the Description:
> I tried the same test both on an engine with libgfapi enabled and disabled.
> In both cases, the HA VM failed to start on another host.

Does this mean the HA VM was using the gluster:// protocol, and still encountering this issue?  The file locking is via fcntl in the posix file layer, so it does not make sense to me for it to be seen if the native gluster protocol is being used by qemu.

Comment 7 Nir Soffer 2018-03-01 00:37:19 UTC

(In reply to Lilach Zitnitski from comment #0)
> I tried the same test both on an engine with libgfapi enabled and disabled.
> In both cases, the HA VM failed to start on another host.

Lilach, the logs seems to contain only the libgfapi_disable tests:

$ ls -lh libgfapi_disabled
total 5.4M
-rw-r--r--. 1 nsoffer nsoffer 182K Feb 27 20:15 engine.log
-rw-r--r--. 1 nsoffer nsoffer 1.8M Feb 27 20:15 host_mixed_1_vdsm.log
-rw-r--r--. 1 nsoffer nsoffer 3.4M Feb 27 20:16 host_mixed_2_vdsm.log

Looking in the logs, the vm is *not* using libgfapi:

        <disk device="disk" snapshot="no" type="file">
            <source file="/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__jenkins__ge5__volume__2/fd6e9ad0-51ff-447e-bc91-038ef7cb736e/images/99de1119-aa46-4d10-a1a4-f6371d24a8f9/565d618f-800e-4749-9404-5d526e1c5f5e"/>
            <target bus="virtio" dev="vda"/>
            <serial>99de1119-aa46-4d10-a1a4-f6371d24a8f9</serial>
            <boot order="1"/>
            <driver cache="none" error_policy="stop" io="threads" name="qemu" type="qcow2"/>
        </disk>

Please check again that the vm was started with a *network* disk and not with
a *file*.

If this fail with libgfapi, lets file another bug for this.

Comment 8 Pranith Kumar K 2018-03-01 06:03:15 UTC

(In reply to Nir Soffer from comment #5)
> Sahina, can you get someone from the gluster team to explain what is the
> expected
> behavior in glusterfs in this case?
> 
> We have too use cases:
> - host becomes disconnected from network, so the host cannot reach any of
> the 
>   gluster nodes, and engine cannot reach the host. This is what was tested
> here,
>   see comment 0.
> - host loose power. The vm is not running on the host, but no code on the
> host 
>   can update the gluster nodes about this.
> 
> We we want to know what is the behaviour of the fcntl based locks used by
> qemu
> on the images. Do we have NFSv3 behavior when the locks remain until the
> client
> recover and connect again to the storage nodes, or NFSv4 behavior, when the
> lock
> expire after some time without any activity from the client?

On fuse mount locks are relinquished the moment the client dies/disconnects. Do you have brick logs or setup where we can get statedump of the bricks when this issue happens, because it looks like that didn't happen when the host died in your setup?

> 
> Finally, do we have a way to disable remote locks, similar to NFSv3 "nolock"
> mount
> option?

Not that I know of but it is not difficult to implement.

Comment 9 Raghavendra G 2018-03-01 06:20:21 UTC

Can this be relevant?
https://review.gluster.org/#/c/16731/

Note that hard-shutdown of client may not be identified by brick without keep-alive/user-timeouts monitoring connection status on the brick side.

Comment 10 Anoop C S 2018-03-01 06:29:20 UTC

(In reply to Nir Soffer from comment #5)
> We we want to know what is the behaviour of the fcntl based locks used by
> qemu on the images. Do we have NFSv3 behavior when the locks remain until the
> client recover and connect again to the storage nodes, or NFSv4 behavior, when > the lock expire after some time without any activity from the client?

In GlusterFS we definitely do not remove locks on account of idle client connection. But the behaviour on disconnection is based on two volume set options namely 'lock-heal' and 'grace-timeout' which are off by default. These options and corresponding logic has been removed(from v4.0) due to its incompleteness implementation and lack of test coverage. The logic that existed till v4.0 is briefly explained in the following doc under the sections "
What happens on client side after disconnect?:" and "What happens on server side after disconnect?:"

https://anoopcs.fedorapeople.org/Lock%20recovery%20in%20GlusterFS.txt 

> Finally, do we have a way to disable remote locks, similar to NFSv3 "nolock"
> mount option?

No.

@Pranith,
Please feel free to add/comment.

Comment 11 Pranith Kumar K 2018-03-01 06:36:14 UTC

(In reply to Anoop C S from comment #10)
> (In reply to Nir Soffer from comment #5)
> > We we want to know what is the behaviour of the fcntl based locks used by
> > qemu on the images. Do we have NFSv3 behavior when the locks remain until the
> > client recover and connect again to the storage nodes, or NFSv4 behavior, when > the lock expire after some time without any activity from the client?
> 
> In GlusterFS we definitely do not remove locks on account of idle client
> connection. But the behaviour on disconnection is based on two volume set
> options namely 'lock-heal' and 'grace-timeout' which are off by default.
> These options and corresponding logic has been removed(from v4.0) due to its
> incompleteness implementation and lack of test coverage. The logic that
> existed till v4.0 is briefly explained in the following doc under the
> sections "
> What happens on client side after disconnect?:" and "What happens on server
> side after disconnect?:"
> 
> https://anoopcs.fedorapeople.org/Lock%20recovery%20in%20GlusterFS.txt 
> 
> > Finally, do we have a way to disable remote locks, similar to NFSv3 "nolock"
> > mount option?
> 
> No.
> 
> @Pranith,
> Please feel free to add/comment.

Nope, I am just waiting for the setup to see what the issue is.

Comment 12 Pranith Kumar K 2018-03-01 06:36:41 UTC

(In reply to Pranith Kumar K from comment #11)
> (In reply to Anoop C S from comment #10)
> > (In reply to Nir Soffer from comment #5)
> > > We we want to know what is the behaviour of the fcntl based locks used by
> > > qemu on the images. Do we have NFSv3 behavior when the locks remain until the
> > > client recover and connect again to the storage nodes, or NFSv4 behavior, when > the lock expire after some time without any activity from the client?
> > 
> > In GlusterFS we definitely do not remove locks on account of idle client
> > connection. But the behaviour on disconnection is based on two volume set
> > options namely 'lock-heal' and 'grace-timeout' which are off by default.
> > These options and corresponding logic has been removed(from v4.0) due to its
> > incompleteness implementation and lack of test coverage. The logic that
> > existed till v4.0 is briefly explained in the following doc under the
> > sections "
> > What happens on client side after disconnect?:" and "What happens on server
> > side after disconnect?:"
> > 
> > https://anoopcs.fedorapeople.org/Lock%20recovery%20in%20GlusterFS.txt 
> > 
> > > Finally, do we have a way to disable remote locks, similar to NFSv3 "nolock"
> > > mount option?
> > 
> > No.
> > 
> > @Pranith,
> > Please feel free to add/comment.
> 
> Nope, I am just waiting for the setup to see what the issue is.

I mean I don't have any further comments :-)

Comment 13 Anoop C S 2018-03-01 06:57:40 UTC

(In reply to Pranith Kumar K from comment #11)
> (In reply to Anoop C S from comment #10)
> > (In reply to Nir Soffer from comment #5)
> > > We we want to know what is the behaviour of the fcntl based locks used by
> > > qemu on the images. Do we have NFSv3 behavior when the locks remain until the
> > > client recover and connect again to the storage nodes, or NFSv4 behavior, when > the lock expire after some time without any activity from the client?
> > 
> > In GlusterFS we definitely do not remove locks on account of idle client
> > connection. But the behaviour on disconnection is based on two volume set
> > options namely 'lock-heal' and 'grace-timeout' which are off by default.
> > These options and corresponding logic has been removed(from v4.0) due to its
> > incompleteness implementation and lack of test coverage. The logic that
> > existed till v4.0 is briefly explained in the following doc under the
> > sections "
> > What happens on client side after disconnect?:" and "What happens on server
> > side after disconnect?:"
> > 
> > https://anoopcs.fedorapeople.org/Lock%20recovery%20in%20GlusterFS.txt 

v4.0 in my above reply points to GlusterFS v4.0 *NOT* NFS protocol version. Sorry for the confusion.

> > > Finally, do we have a way to disable remote locks, similar to NFSv3 "nolock"
> > > mount option?
> > 
> > No.
> > 
> > @Pranith,
> > Please feel free to add/comment.
> 
> Nope, I am just waiting for the setup to see what the issue is.

Comment 14 Allon Mureinik 2018-03-01 09:49:20 UTC

Retargetting to RHV 4.1.10 (tentatively), as we did with bug 1550127.

If the research here results in a conclusion that platform changes in RHEL 7.5 require vdsm changes, they should also be backported to 4.1.z.

Comment 15 Nir Soffer 2018-03-01 13:14:46 UTC

Looking at the logs from Lilach, this issue is related only to using fuse mounts
and not to the gluster native connection (libgfapi) which is the default.

Lilach, please update the bug if you can reproduce this with libgfapi enabled.

Comment 16 Nir Soffer 2018-03-01 13:28:10 UTC

Based on comments from glsuterfs (starting at comment 8)
- glsuterfs locks do not behave as nfs v4 leases
- I could not understand how they do behave
- I don't think we can fix anything in vdsm
- I think qemu should use local locks for glusterfs fuse mounts

Comment 17 Nir Soffer 2018-03-01 13:47:00 UTC

(In reply to Nir Soffer from comment #15)
> and not to the gluster native connection (libgfapi) which is the default.

Turns out that libgfapi is the default only on RHHI systems (hyperconverged setup).
On oVirt/RHV the default is still fuse mount, because live storage migration with
gluster native disk is partly broken because of bug 1306562.

So this bug effects all oVirt/RHV installation using glusterfs.

Comment 18 Nir Soffer 2018-03-04 14:13:31 UTC

Lilach, in comment 6 Jef reports that he could acquire a lock on another host
after 10-15 minutes. How much time did you wait in your test for the VM to start
on another host?

Can you run the test again, and wait 20 minutes after powering off the original
host? I think engine will retry to run the VM on another host every few minutes.

Comment 19 Sahina Bose 2018-03-05 13:24:55 UTC

Sas, could you provide a reproducer to Pranith as asked for in Comment 11

Comment 20 Raz Tamir 2018-03-05 16:24:54 UTC

Nir,

Can you please change the bug to be on qemu-kvm-rhev? as this is on RHEL side

Comment 21 Raz Tamir 2018-03-05 17:08:24 UTC

(In reply to Raz Tamir from comment #20)
> Nir,
> 
> Can you please change the bug to be on qemu-kvm-rhev? as this is on RHEL side
sorry for typo - *if* this is on RHEL side

Comment 22 Nir Soffer 2018-03-05 20:41:29 UTC

Ademar, this bug is similar to 1547095. I don't think we can do anything
about this in oVirt, and it should move to qemu.

Comment 23 Lilach Zitnitski 2018-03-06 13:29:03 UTC

(In reply to Nir Soffer from comment #3)
> (In reply to Lilach Zitnitski from comment #0)
> > Steps to Reproduce:
> > 1. create VM with a disk on gluster domain and HA lease on gluster domain
> > 2. start the VM
> > 3. block connection between engine and host running the VM, and between the
> > same host and the storage domain with the VM's disk and lease
> 
> This is similar but not the tests we requested.
> 
> Blocking access to storage may cause qemu to kill the qemu process, and in
> this
> case the NFS lock may be released, hiding the issue we want to test. We want
> to 
> make sure the NFS lock on the client side exists on the original host.
> 
> Please test also this:
> 
> 3. power off the host, simulating power loss. Using a nested host would be
> best
>    for this test.
> 
> In this case other host can detect that the sanlock lease has expired, and
> engine
> will start the VM on another host.

I ran some tests.
When powering off the host, the engine starts the VM on a different host. I saw it when LibgfApi was enabled as well when it was disabled. 
I also tested it with killing the qemu process of the host's VM (using nested host)
and the VM started on a different host as well.

I'm adding the logs for all those tests.

Comment 24 Lilach Zitnitski 2018-03-06 13:30:42 UTC

Created attachment 1404803 [details]
host poweroff logs

Name of tested VM for searching the logs - vm_test_glusterfs

Comment 25 Nir Soffer 2018-03-06 15:52:02 UTC

(In reply to Lilach Zitnitski from comment #23)
> When powering off the host, the engine starts the VM on a different host. I
> saw it when LibgfApi was enabled as well when it was disabled. 
> I also tested it with killing the qemu process of the host's VM (using
> nested host)
> and the VM started on a different host as well.

Thanks, Lilach, I tested also on my setup, and oVirt is starting the vm successfuly
after a host was powered off (simulating fatal failure like lost power).

Here is my gluster volume sate (taken from gluster get-state):

Volume73.name: nir1
Volume73.id: c1228d3f-2998-485b-8fb6-d5149a351a3b
Volume73.type: Replicate
Volume73.transport_type: tcp
Volume73.status: Started
Volume73.brickcount: 3
Volume73.Brick1.path: gluster01.scl.lab.tlv.redhat.com:/gluster_volumes/nir1
Volume73.Brick1.hostname: gluster01.scl.lab.tlv.redhat.com
Volume73.Brick1.port: 49224
Volume73.Brick1.rdma_port: 0
Volume73.Brick1.status: Started
Volume73.Brick1.spacefree: 168958840832Bytes
Volume73.Brick1.spacetotal: 696518180864Bytes
Volume73.Brick2.path: gluster02.scl.lab.tlv.redhat.com:/gluster_volumes/nir1
Volume73.Brick2.hostname: gluster02.scl.lab.tlv.redhat.com
Volume73.Brick3.path: gluster03.scl.lab.tlv.redhat.com:/gluster_volumes/nir1
Volume73.Brick3.hostname: gluster03.scl.lab.tlv.redhat.com
Volume73.snap_count: 0
Volume73.stripe_count: 1
Volume73.replica_count: 3
Volume73.subvol_count: 1
Volume73.arbiter_count: 0
Volume73.disperse_count: 0
Volume73.redundancy_count: 0
Volume73.quorum_status: not_applicable
Volume73.snapd_svc.online_status: Offline
Volume73.snapd_svc.inited: True
Volume73.rebalance.id: 00000000-0000-0000-0000-000000000000
Volume73.rebalance.status: not_started
Volume73.rebalance.failures: 0
Volume73.rebalance.skipped: 0
Volume73.rebalance.lookedup: 0
Volume73.rebalance.files: 0
Volume73.rebalance.data: 0Bytes
Volume73.time_left: 0
Volume73.gsync_count: 0
Volume73.options.storage.owner-uid: 36
Volume73.options.storage.owner-gid: 36
Volume73.options.transport.address-family: inet
Volume73.options.nfs.disable: on

Seems that gluster has a mechanism to breaks locks when a host that took
the locks was powered off.

Is this behavior expected? if this is expected, what is the timeout used by gluster
to break a lock after a client became idle?

This seems to contradict comment 10, so maybe there was some misunderstanding.

Comment 27 Nir Soffer 2018-03-06 16:26:34 UTC

Created attachment 1404933 [details]
locking test script

I tested locking behavior when a host is powered off:

Run lock_test.py on host 1:

# python lock_test.py glusterfs
locking '/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_nir1/lock'
got the lock in 0.00 seconds, sleeping

Run same command on another host:

# date; python lock_test.py glusterfs
Tue Mar  6 18:19:10 IST 2018
locking '/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_nir1/lock'

[Power off host 1 here]

got the lock in 38.92 seconds, sleeping

So it seems that glusterfs break the lock 40 seconds after a client stop
communicating with the  server.

Comment 28 Nir Soffer 2018-03-06 16:31:18 UTC

(In reply to Lilach Zitnitski from comment #0)
> Steps to Reproduce:
> 1. create VM with a disk on gluster domain and HA lease on gluster domain
> 2. start the VM
> 3. block connection between engine and host running the VM, and between the
> same host and the storage domain with the VM's disk and lease

How did you block the connection to storage?

Comment 29 Nir Soffer 2018-03-06 17:54:41 UTC

I tested locking behavior when blocking access to glusterfs storage.

The gluster volume contains 3 bricks:

Volume73.Brick1.hostname: gluster01.scl.lab.tlv.redhat.com
Volume73.Brick2.hostname: gluster02.scl.lab.tlv.redhat.com
Volume73.Brick3.hostname: gluster03.scl.lab.tlv.redhat.com

1. I started the locking test script (see attachment 1404933 [details]) on one host:

[root@voodoo8 ~]# date; python lock_test.py glusterfs
Tue Mar  6 19:27:26 IST 2018
locking '/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_nir1/lock'
got the lock in 0.02 seconds, sleeping

2. Then started same script trying to acquire the lock on another host:

[root@voodoo6 ~]# date; python lock_test.py glusterfs
Tue Mar  6 19:27:43 IST 2018
locking '/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_nir1/lock'

3. Then blocked access to *all* bricks on the first host:

[root@voodoo8 ~]# date; iptables -A OUTPUT -p tcp -d gluster01.scl.lab.tlv.redhat.com,gluster02.scl.lab.tlv.redhat.com,gluster03.scl.lab.tlv.redhat.com -j DROP
Tue Mar  6 19:27:41 IST 2018
[root@voodoo8 ~]# iptables -L
...
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     udp  --  anywhere             anywhere             udp dpt:6081
DROP       tcp  --  anywhere             gluster01.scl.lab.tlv.redhat.com 
DROP       tcp  --  anywhere             gluster02.scl.lab.tlv.redhat.com 
DROP       tcp  --  anywhere             gluster03.scl.lab.tlv.redhat.com 

4. After about 40 seconds, the second host got the lock:

got the lock in 36.63 seconds, sleeping

So it seems that test described in comment 0 was inccorrect.


I did the same test with running vm:

1. Started vm on a host

2. On engine host, blocked access to the host:

[root@voodoo4 ~]# date; iptables -A OUTPUT -p tcp -d voodoo8.tlv.redhat.com -j DROP
Tue Mar  6 19:43:49 IST 2018

3. On the host, block access to all bricks:

[root@voodoo8 ~]# date; iptables -A OUTPUT -p tcp -d gluster01.scl.lab.tlv.redhat.com,gluster02.scl.lab.tlv.redhat.com,gluster03.scl.lab.tlv.redhat.com -j DROP
Tue Mar  6 19:44:29 IST 2018

4. On the host running the vm, we could see that sanlock lease was removed
   after about 80 seconds

5. The vm was paused:

[root@voodoo8 ~]# virsh -r list
 Id    Name                           State
----------------------------------------------------
 1     v42-ha-glusterfs               paused

6. Engine started the vm on another host at 19:46:58


Lilach, can you confirm that blocking access to storage was not correct in you test?

Comment 30 SATHEESARAN 2018-03-07 02:54:07 UTC

(In reply to Sahina Bose from comment #19)
> Sas, could you provide a reproducer to Pranith as asked for in Comment 11

Sahina, 

I have tried to reproduce with qemu-kvm-rhev & gluster with the following steps.

1. Created 3 VMs to run RHGS 3.3.1 async, and create replica 3 volume
2. Provisioned 2 baremetal machines with RHEL 7.5 RC
3. Fuse mount replica 3 volume on 2 physical hosts provisioned in 2
4. Created VM and started OS installation
5. While installation is in progress, turned off network ( systemctl stop network ),[ later repeated the experiment with powering off the node abruptly ]
6. Momentarily when I did 'qemu-io' on the same disk image from another host, I saw the lock

# qemu-io /mnt/test/rhel75.img

can't open device /mnt/test/rhel75.img: Failed to get "write" lock
Is another process using the image?

But in the very next attempt, the lock was released and I could access the file.

I have also tried to rerun the test by powering off the host abruptly, blocking all IO from one of the host to all gluster bricks. In all such cases, I haven't seen any locks

Comment 31 SATHEESARAN 2018-03-07 02:56:28 UTC

(In reply to SATHEESARAN from comment #30)
> (In reply to Sahina Bose from comment #19)
> > Sas, could you provide a reproducer to Pranith as asked for in Comment 11
> 
> Sahina, 
> 
> I have tried to reproduce with qemu-kvm-rhev & gluster with the following
> steps.
> 
> 1. Created 3 VMs to run RHGS 3.3.1 async, and create replica 3 volume
> 2. Provisioned 2 baremetal machines with RHEL 7.5 RC
> 3. Fuse mount replica 3 volume on 2 physical hosts provisioned in 2
> 4. Created VM and started OS installation
> 5. While installation is in progress, turned off network ( systemctl stop
> network ),[ later repeated the experiment with powering off the node
> abruptly ]
> 6. Momentarily when I did 'qemu-io' on the same disk image from another
> host, I saw the lock
> 
> # qemu-io /mnt/test/rhel75.img
> 
> can't open device /mnt/test/rhel75.img: Failed to get "write" lock
> Is another process using the image?
> 
> But in the very next attempt, the lock was released and I could access the
> file.
> 
> I have also tried to rerun the test by powering off the host abruptly,
> blocking all IO from one of the host to all gluster bricks. In all such
> cases, I haven't seen any locks

Tested with glusterfs-3.8.4-54.el7rhgs
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64
qemu-kvm-common-rhev-2.10.0-21.el7.x86_64
qemu-img-rhev-2.10.0-21.el7.x86_64
kernel-3.10.0-858.el7.x86_64

Comment 32 Anoop C S 2018-03-07 06:57:54 UTC

(In reply to Nir Soffer from comment #27)
> Created attachment 1404933 [details]
> locking test script
> 
> I tested locking behavior when a host is powered off:
> 
> Run lock_test.py on host 1:
> 
> # python lock_test.py glusterfs
> locking
> '/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_nir1/lock'
> got the lock in 0.00 seconds, sleeping
> 
> Run same command on another host:
> 
> # date; python lock_test.py glusterfs
> Tue Mar  6 18:19:10 IST 2018
> locking
> '/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_nir1/lock'
> 
> [Power off host 1 here]

This indicates a clear situation where client tcp connection with server is lost(client being idle is a different case).

> got the lock in 38.92 seconds, sleeping
> So it seems that glusterfs break the lock 40 seconds after a client stop
> communicating with the  server.

Locks are not flushed unless server receives a RPCSVC_EVENT_DISCONNECT event post a client disconnection. I think the delay in receiving *_DISCONNECT event in server is due to the effect of socket-keepalive interval time following client being powered off(may not get enough time for graceful socket cleanup). 

Are you able to match the time at which host was powered off with entries in brick logs under /var/log/glusterfs/bricks/ containing keywords "..disconnecting connection from .." ? If you could see time difference between two then my assumption can be confirmed.

Comment 33 Lilach Zitnitski 2018-03-07 13:48:45 UTC

(In reply to Nir Soffer from comment #29)
> I tested locking behavior when blocking access to glusterfs storage.
> 
> The gluster volume contains 3 bricks:
> 
> Volume73.Brick1.hostname: gluster01.scl.lab.tlv.redhat.com
> Volume73.Brick2.hostname: gluster02.scl.lab.tlv.redhat.com
> Volume73.Brick3.hostname: gluster03.scl.lab.tlv.redhat.com
> 
> 1. I started the locking test script (see attachment 1404933 [details]) on
> one host:
> 
> [root@voodoo8 ~]# date; python lock_test.py glusterfs
> Tue Mar  6 19:27:26 IST 2018
> locking
> '/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_nir1/lock'
> got the lock in 0.02 seconds, sleeping
> 
> 2. Then started same script trying to acquire the lock on another host:
> 
> [root@voodoo6 ~]# date; python lock_test.py glusterfs
> Tue Mar  6 19:27:43 IST 2018
> locking
> '/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_nir1/lock'
> 
> 3. Then blocked access to *all* bricks on the first host:
> 
> [root@voodoo8 ~]# date; iptables -A OUTPUT -p tcp -d
> gluster01.scl.lab.tlv.redhat.com,gluster02.scl.lab.tlv.redhat.com,gluster03.
> scl.lab.tlv.redhat.com -j DROP
> Tue Mar  6 19:27:41 IST 2018
> [root@voodoo8 ~]# iptables -L
> ...
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination         
> ACCEPT     udp  --  anywhere             anywhere             udp dpt:6081
> DROP       tcp  --  anywhere             gluster01.scl.lab.tlv.redhat.com 
> DROP       tcp  --  anywhere             gluster02.scl.lab.tlv.redhat.com 
> DROP       tcp  --  anywhere             gluster03.scl.lab.tlv.redhat.com 
> 
> 4. After about 40 seconds, the second host got the lock:
> 
> got the lock in 36.63 seconds, sleeping
> 
> So it seems that test described in comment 0 was inccorrect.
> 
> 
> I did the same test with running vm:
> 
> 1. Started vm on a host
> 
> 2. On engine host, blocked access to the host:
> 
> [root@voodoo4 ~]# date; iptables -A OUTPUT -p tcp -d voodoo8.tlv.redhat.com
> -j DROP
> Tue Mar  6 19:43:49 IST 2018
> 
> 3. On the host, block access to all bricks:
> 
> [root@voodoo8 ~]# date; iptables -A OUTPUT -p tcp -d
> gluster01.scl.lab.tlv.redhat.com,gluster02.scl.lab.tlv.redhat.com,gluster03.
> scl.lab.tlv.redhat.com -j DROP
> Tue Mar  6 19:44:29 IST 2018
> 
> 4. On the host running the vm, we could see that sanlock lease was removed
>    after about 80 seconds
> 
> 5. The vm was paused:
> 
> [root@voodoo8 ~]# virsh -r list
>  Id    Name                           State
> ----------------------------------------------------
>  1     v42-ha-glusterfs               paused
> 
> 6. Engine started the vm on another host at 19:46:58
> 
> 
> Lilach, can you confirm that blocking access to storage was not correct in
> you test?

Yes, you're right in the automation test it blocked only one brick instead of all. I'll run it again this time I'll block all of the bricks.

Comment 34 Yaniv Kaul 2018-03-08 12:52:51 UTC

Lilach, any updates?

Comment 35 Avihai 2018-03-12 15:10:08 UTC

(In reply to Yaniv Kaul from comment #34)
> Lilach, any updates?

Tested on 4.2.2.2-0.1

I see that with the revised test (see details below) VM is moved to another host successfully.

1) Block all 3 gluster bricks on VDSM host running the VM 

Blocked bricks:
gluster01.scl.lab.tlv.redhat.com
gluster02.scl.lab.tlv.redhat.com
gluster03.scl.lab.tlv.redhat.com 

2) Block connection between engine -> VDSM host running the VM

result:
VM is moved to another host successfully.

Comment 36 Nir Soffer 2018-03-12 15:17:39 UTC

(In reply to Avihai from comment #35)
Base on this I think we can safely close this as NOTABUG.

Comment 37 Elad 2018-03-12 15:20:05 UTC

Fine by me

Note You need to log in before you can comment on or make changes to this bug.

aefrat
amureini
anoopcs
areis
atumball
bugs
csaba
dchaplyg
dfediuck
ebenahar
jcody
kdhananj
lzitnits
nsoffer
pgurusid
pkarampu
ratamir
rgowdapp
sabose
sasundar