| Summary: | File lock propagates on failover but not failback | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Mike Watkins <mwatkins> | ||||
| Component: | glusterfs | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> | ||||
| Status: | CLOSED EOL | QA Contact: | storage-qa-internal <storage-qa-internal> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 2.1 | CC: | chrisw, dowoods, mwatkins, rramamoo, vbellur | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2015-12-03 17:10:51 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Mike Watkins
2013-11-27 22:17:13 UTC
I had a look at the vidoes. From what I understood, failover and failback of IO and locks is happening on distributed-replicated volume (as expected), but not working over distributed (non-replicated) volume. This is all as expected. Depending on how the hashing algorithm has distributed the files, locktest will surely fail when the node storing the file is powered off. For high availability distributed-replicated volume is a requirement, and from what I understand of the videos, that is working. Unless I have missed something, we can close this bug as NOTABUG. Anand, I tested 2-node replicated and 2-node distributed volumes for my inital tests (and videos). This is evident from my gluster volume info output in this BZ. However, based on your comment about full HA requiring a distributed-replicated volume, I re-did my testing with this volume type. [root@host120 ~]# gluster volume info amq-dist-rep-volume Volume Name: amq-dist-rep-volume Type: Distributed-Replicate Volume ID: f55427c1-ef24-4c2c-86d4-2afc92675ee5 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 192.168.122.135:/gluster_brick/distrep Brick2: 192.168.122.237:/gluster_brick/distrep Brick3: 192.168.122.53:/gluster_brick/distrep Brick4: 192.168.122.178:/gluster_brick/distrep [root@host120 ~]# Note here the following host120 = 192.168.122.135 host217 = 192.168.122.237 host300 = 192.168.122.53 host400 = 192.168.122.178 I then mounted the client... [root@vm3 ~]# mount -t glusterfs 192.168.122.135:/amq-dist-rep-volume /mnt/amqrhss Note here that I'm using 2 nodes for the client, running the locktest C program, namely vm1 and vm3. I then ran the locktest C program on vm3, then on vm1 (waiting for lock) and powered off/on RHS host120, host217, host300 and host400 with the locktest program behaving as expected. Please look at the new uploaded video in the gdrive link, named "C locktest with distrib-replic gluster volume + testing multiple single-node failures.webm" So...this does work :) But I need a very technical explaination as to why this works with distributed-replicated volumes only. And, fails with distributed volume and with replicated volume 2-nodes setups. Thanks, Mike Mike, I think the video of replicated volume with failover (and failback) show that it works? But you mention " And, fails with distributed volume and with replicated volume 2-nodes setups." - but I could not find any failures in the replicated video.. Can you point to the exact filename and timestamp in the video which shows the failure? Avati DISTRIBUTED VOLUME TEST ----------------------- Video: C locktest with distributed gluster volume.webm Status: Doesn't show failure since only powering off (then back on) 1st RHS node, and not running test long enough I can run this test longer to see of ./locktest prog on other VM picks up the lock (as expected) Video: C locktest with distributed gluster volume + failback.webm Status: Fails ~4:38 mark REPLICATED VOLUME TEST ---------------------- Video: C locktest with replicated gluster volume.webm Status: Doesn't show failure since only powering off (then back on) 1st RHS node, and not running test long enough I can run this test longer to see of ./locktest prog on other VM picks up the lock (as expected) Video: C locktest with replicated gluster volume + failback.webm Status: Fails ~5:57 mark. The locktest program is running on vm3 (holds lock, counts to 200). During this time, locktest is also running on vm1 (waiting to grab the lock when vm3 releases after count=200). I power off/on the first RHS node (host120), lock holds. Then, I power off/on 2nd RHS node (host217)...the lock appears to be still holding, but when the locktest counter=200, vm1 DOES NOT grab the lock as expected DISTRIBUTED-REPLICATED VOLUME TEST ---------------------------------- Video: C locktest with distrib-replic gluster volume + testing multiple single-node failures.webm Status: Active-lock VM (vm3) and waiting-for-lock VM (vm1) works as expected, EVEN after power off/on (ie failing) SEVERAL RHS nodes individually If you can explain how lock behavior works on different volume types (distributed, replicated and distributed-replicated)...that will help. From my tests, you can see that the locktest program behaves differently for different volumes types. Thanks, Mike Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release. |