Bug 963409 - [QEMU/KVM-RHS] Fuse mount crash when quorum ratio is not met
[QEMU/KVM-RHS] Fuse mount crash when quorum ratio is not met
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.1
Unspecified Unspecified
high Severity urgent
: ---
: ---
Assigned To: raghav
SATHEESARAN
: TestBlocker
: 960836 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-15 14:25 EDT by shilpa
Modified: 2014-06-24 20:51 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
virt qemu integration
Last Closed: 2013-09-23 18:29:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Screenshot of corruption captured on an Application VM (78.20 KB, image/png)
2013-05-15 14:25 EDT, shilpa
no flags Details
strace of "virsh list --all" command after quorum is met, where 'virsh' commands hang indefinitely (5.88 KB, text/plain)
2013-07-30 07:26 EDT, SATHEESARAN
no flags Details

  None (edit)
Description shilpa 2013-05-15 14:25:03 EDT
Created attachment 748426 [details]
Screenshot of corruption captured on an Application VM

Description of problem: Corruption on application VM's when quorum ratio is not met.


Version-Release number of selected component (if applicable): RHS2.1 - glusterfs 3.4.0.8rhs


How reproducible:
Always

Steps to Reproduce:
1. Create a 6x2 distributed-replicate volume [ Vol name: dist-rep ]
2. Fuse mount the volume and create 8 disk images[ 5 qcow2 + 3 raw ] of each 20GB.
3. Create Application VM with created disk images as backend, with RHEL6.3
4. Mount the gluster volume in all the Application VMs
5. Set the quorum type to server.
gluster volume set  <volume-name> cluster.server-quorum-type to server
6. Set the quorum ratio to 70%. 
6. Disconnect a few servers, so that less than 70% of the servers are online.
 
  
Actual results:

Corruption on all VMs including the ones that are on the online RHS nodes



Expected results:

Application VM's should be healthy.


Additional info:

1. Hostname of hypervisor:

rhs-client44.lab.eng.blr.redhat.com/ 10.70.33.209




2. RHS nodes:
rhs-vm1: 10.70.37.81
rhs-vm2: 10.70.37.196
rhs-vm3: 10.70.37.81
rhs-vm4: 10.70.37.101

3. Commands executed on rhs-vm1.

4. Volume name dist-rep

5. Volume info:

Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: b4cb28d9-7a6e-415f-a925-be8369d30be6
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.41:/rhs/brick1/b1
Brick2: 10.70.37.196:/rhs/brick1/b2
Brick3: 10.70.37.41:/rhs/brick1/b3
Brick4: 10.70.37.196:/rhs/brick1/b4
Brick5: 10.70.37.41:/rhs/brick1/b5
Brick6: 10.70.37.196:/rhs/brick1/b6
Brick7: 10.70.37.81:/rhs/brick1/b7
Brick8: 10.70.37.101:/rhs/brick1/b8
Brick9: 10.70.37.81:/rhs/brick1/b9
Brick10: 10.70.37.101:/rhs/brick1/b10
Brick11: 10.70.37.81:/rhs/brick1/b11
Brick12: 10.70.37.101:/rhs/brick1/b12
Options Reconfigured:
cluster.server-quorum-type: server
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: on
storage.owner-uid: 107
cluster.server-quorum-ratio: 70%


6. After shutting down rhs-vm3 and rhs-vm4:

gluster volume status
Status of volume: dist-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.37.41:/rhs/brick1/b1			N/A	N	N/A
Brick 10.70.37.196:/rhs/brick1/b2			N/A	N	N/A
Brick 10.70.37.41:/rhs/brick1/b3			N/A	N	N/A
Brick 10.70.37.196:/rhs/brick1/b4			N/A	N	N/A
Brick 10.70.37.41:/rhs/brick1/b5			N/A	N	N/A
Brick 10.70.37.196:/rhs/brick1/b6			N/A	N	N/A
NFS Server on localhost					2049	Y	31504
Self-heal Daemon on localhost				N/A	Y	31512
NFS Server on 4e0a0a37-1744-4a30-9646-2712cdf579a6	2049	Y	20023
Self-heal Daemon on 4e0a0a37-1744-4a30-9646-2712cdf579a
 

7. xfs_info /dev/mapper/RHS_vgvdb-RHS_lv1
meta-data=/dev/mapper/RHS_vgvdb-RHS_lv1 isize=512    agcount=64, agsize=348160 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=22281216, imaxpct=15
         =                       sunit=64     swidth=640 blks
naming   =version 2              bsize=8192   ascii-ci=0
log      =internal               bsize=4096   blocks=10880, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

8. rhs-client44:/tmp ]df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   11G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
df: `/var/lib/libvirt/images': Transport endpoint is not connected


9. Mount point on hypervisor: /var/lib/libvirt/images
   Mount point on VM's: /mnt/appstore
Comment 4 shilpa 2013-05-16 07:01:11 EDT
Brought up the two offline nodes.Though the screenshots show error messages on the VMs , Ctrl^C could get me the prompt. The VM's show running state. Could not reboot or run /var/log/dmesg on the VM's. The commands would hang. Tried force off. Did not work. 

Manually remounting the fuse mount and restarting the libvirtd did not work.  


df -h &
[1] 5369
[root@rhs-client44 glusterfs]# Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   12G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm

[root@rhs-client44 glusterfs]# mount
/dev/sda2 on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
10.70.37.41:dist-rep on /var/lib/libvirt/images type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
[root@rhs-client44 glusterfs]# umount /var/lib/libvirt/images
umount: /var/lib/libvirt/images: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

[root@rhs-client44 ~]# umount -l /var/lib/libvirt/images
[root@rhs-client44 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   12G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
[root@rhs-client44 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   12G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
[root@rhs-client44 ~]# mount -t glusterfs 10.70.37.41:dist-rep /var/lib/libvirt/images
[root@rhs-client44 ~]# df -h &
[2] 5420
[root@rhs-client44 ~]# Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   12G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
10.70.37.41:dist-rep  510G  297G  213G  59% /var/lib/libvirt/images

[2]+  Done                    df -h
[root@rhs-client44 ~]# 
[root@rhs-client44 ~]# 
[root@rhs-client44 ~]# 
[root@rhs-client44 ~]# cd /var/lib/libvirt/images
[root@rhs-client44 images]# ls
<No output, hangs>

 
Rebooting the hypervisor worked. Fuse mount got remounted. The VM's were in shut-off state. I could recover the VM's by rebooting them. Tried read/write on them. Works fine. 

[root@rhs-client44 ~]# cd /var/lib/libvirt/images
[root@rhs-client44 images]# ls
appvm1.qcow2  appvm2.qcow2  appvm3.qcow2  appvm4.qcow2  appvm5.qcow2  appvm6.img  appvm7.img  appvm8.img
Comment 5 raghav 2013-05-17 08:47:24 EDT
2 points here:

1) It looks like w.r.t the vms, the hypervisor did not have proper recovery mechanism when a volume goes offline. SO that is the reason why the hypervisor has to be rebooted to get the vms working. But as long as they did restart properly, it looks like the vmkds were not corrupted.
So I suggest you please remove the statement on the corruption of the application vms.

2) Also when the servers were offline, df -h as you have reported gives the error "Transport endpoint is not connected" This is fine since the volume is not in a connected state. 
However as soon as the servers are bought online, df should show the glusterfs volume mounted. From ur logs that does not seem to be the case and you had to manually unmount/mount.
Can you just re-repeat this test case without involvement of vms and check if it works. i.e
(a) create a similar volume setup
(b) detach some nodes. df should show "transport endpoint not connected"
(c) attach them again. df should show the proper volume immediately.
If it works, then we need to investigate if the involvement of hypervisor is causing df to misbehave; though I cannot think of any reasons why :(
Comment 6 shilpa 2013-05-20 06:30:10 EDT
Tested on a volume without having VMs attached to it as suggested above by Raghavan.

works. i.e
(a) create a similar volume setup:

gluster volume info quorum-vol 
 
Volume Name: quorum-vol
Type: Distributed-Replicate
Volume ID: f6970821-ba07-46a9-ba8d-899ec0b4e5e5
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.41:/rhs/brick2/b1
Brick2: 10.70.37.196:/rhs/brick2/b2
Brick3: 10.70.37.41:/rhs/brick2/b3
Brick4: 10.70.37.196:/rhs/brick2/b4
Brick5: 10.70.37.41:/rhs/brick2/b5
Brick6: 10.70.37.196:/rhs/brick2/b6
Brick7: 10.70.37.81:/rhs/brick2/b7
Brick8: 10.70.37.101:/rhs/brick2/b8
Brick9: 10.70.37.81:/rhs/brick2/b9
Brick10: 10.70.37.101:/rhs/brick2/b10
Brick11: 10.70.37.81:/rhs/brick2/b11
Brick12: 10.70.37.101:/rhs/brick2/b12
Options Reconfigured:
storage.owner-gid: 107
storage.owner-uid: 107
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 70%

(b) detach some nodes. df should show "transport endpoint not connected"

df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   12G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
10.70.37.41:dist-rep  510G   82G  428G  17% /var/lib/libvirt/images
df: `/mnt/quorum-test': Transport endpoint is not connected

(c) attach them again. df should show the proper volume immediately.

 df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   12G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
10.70.37.41:dist-rep  510G   82G  428G  17% /var/lib/libvirt/images
10.70.37.41:/quorum-vol
                      153G  219M  153G   1% /mnt/quorum-test


 After bringing back the servers online, the fuse mount point on client remounted itself and the "ls"command works without any delay.
Comment 7 raghav 2013-05-21 04:35:04 EDT
So it looks like an issue when vms are present.
Can u please rerun with vms and provide the following:
1) sos reports
2) strace from client side (df,ls...)
Comment 8 shilpa 2013-05-30 05:24:27 EDT
Re-ran the tests on 3.4.0.8rhs. Saw a new issue where the bricks dont go offline for a long time after bringing down the servers below the quorum ratio.
 
Brought down 2 out of 4 servers, rhs-vm2 and rhs-vm4 with server quorum ratio set to 70%. It took a very long time, almost about 10 minutes to show the gluster volume status command output. Also, during this time the volume was still available on the client side even though the servers were brought down. Looks like it took a long time to bring all the bricks offline.

During the first ten minutes, these were the outputs recorded:


[root@rhs-vm1 /]# gluster volume status
Another transaction is in progress. Please try again after sometime.
 
[root@rhs-vm1 brick1]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.75
Uuid: 5c523025-42f6-481d-b9b7-6f7a1e8722b3
State: Peer in Cluster (Disconnected)

Hostname: 10.70.37.208
Uuid: 9ee5530e-aaf7-49d0-aced-5bd5861ba227
State: Peer in Cluster (Connected)

Hostname: 10.70.37.77
Uuid: 9ca181ea-dfac-4d8e-9736-2e810570a620
State: Peer in Cluster (Disconnected)


[root@rhs-vm3 glusterfs]# gluster volume status
Status of volume: dist-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.37.93:/rhs/brick1/b1			49152	Y	2163
Brick 10.70.37.93:/rhs/brick1/b3			49153	Y	2172
Brick 10.70.37.93:/rhs/brick1/b5			49154	Y	2181
Brick 10.70.37.208:/rhs/brick1/b7			N/A	N	N/A
Brick 10.70.37.208:/rhs/brick1/b9			N/A	N	N/A
Brick 10.70.37.208:/rhs/brick1/b11			N/A	N	N/A
Brick 10.70.37.93:/rhs/brick1/b13			49155	Y	6699
NFS Server on localhost					2049	Y	6909
Self-heal Daemon on localhost				N/A	Y	6916
NFS Server on 974ebd1f-15db-4b56-beb9-cced60bb9ac8	2049	Y	7603
Self-heal Daemon on 974ebd1f-15db-4b56-beb9-cced60bb9ac
8							N/A	Y	6722



[root@rhs-client44 images]# df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   16G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
10.70.37.93:dist-rep  340G  165G  176G  49% /var/lib/libvirt/images

--------------------------------------------------

After 10 minutes, the bricks showed offline and the gluster volume was no longer available on the client:

[root@rhs-vm3 glusterfs]# gluster volume status
Status of volume: dist-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.37.93:/rhs/brick1/b1			N/A	N	N/A
Brick 10.70.37.93:/rhs/brick1/b3			N/A	N	N/A
Brick 10.70.37.93:/rhs/brick1/b5			N/A	N	N/A
Brick 10.70.37.208:/rhs/brick1/b7			N/A	N	N/A
Brick 10.70.37.208:/rhs/brick1/b9			N/A	N	N/A
Brick 10.70.37.208:/rhs/brick1/b11			N/A	N	N/A
Brick 10.70.37.93:/rhs/brick1/b13			N/A	N	N/A
NFS Server on localhost					2049	Y	6909
Self-heal Daemon on localhost				N/A	Y	6916
NFS Server on 974ebd1f-15db-4b56-beb9-cced60bb9ac8	2049	Y	7603
Self-heal Daemon on 974ebd1f-15db-4b56-beb9-cced60bb9ac
8							N/A	Y	6722


[root@rhs-client44 images]# df -kh
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             1.8T   16G  1.7T   1% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
df: `/var/lib/libvirt/images': Transport endpoint is not connected

----------------------------------------------

However, regarding the previous issue reported in comment 4, which was about not being able to immediately see the gluster volume mounted on the clients after bringing up the servers above the quorum ratio, was not found this time. I could see volume mounted using df -kh almost immediately.
Comment 9 raghav 2013-06-05 04:07:05 EDT
Shilpa,

Can u please attach the SOS reports for the remaining bricks and the client.
Comment 10 raghav 2013-06-05 06:23:14 EDT
*** Bug 960836 has been marked as a duplicate of this bug. ***
Comment 11 Gowrishankar Rajaiyan 2013-06-20 01:01:35 EDT
Please update "Fixed in version" field.
Comment 12 SATHEESARAN 2013-07-30 05:29:15 EDT
Tested this case with RHS2.1 - glusterfs-3.4.0.12rhs.beta6-1
Below are my observations :

All the below cases are tested with,following options :
1. 4 Node cluster for Trusted storage Pool
2. Quorum-type as Server
   (i.e) gluster volume set <vol-name> cluster.server-quorum-type server
3. Quorum ratio set to 75%
    (i.e) gluster volume set all cluster.server-quorum-ratio 75
4. Volume is fuse mounted a 6x2 distribute-replicate volume

Case 1 : With normal files on glusterfs volume
==============================================
In this case, I have created 6X2 distribute-replicate volume and mounted in Client, and tested the scenario when quorum is met and its not met.
1. When quorum is not met, I could see "df -Th" giving error "Transport endpoint not connected".
2. When one of the RHS Node is brought up, thereby Quorum is met, the mountpoint becomes active - means "df -Th" and "ls" command on mountpoint succeeds with correct information

Case 2 :With VM Image files on glusterfs volume, without VMs accessing those
=============================================================================
In this case, I have used the same 6X2 volume, but tagged it with group virt.
(i.e) gluster volume set <vol-name> group virt

Added storage.owner-uid and storage.owner-gid to 107/107 ( corresponding to qemu/qemu )
(i.e) gluster volume set <vol-name> storage.owner-uid 107
      gluster volume set <vol-name> storage.owner-gid 107

Created 5 VM Images, of type qcow2 and its set with preallocation=metadata, each
of size 20G

(i.e) qemu-img create -f qcow2 -o preallocation=metadata <Img-file> 20G

This case also seems to work like case 1, when quorum is met and its not met.
No issues or crash is seen.

Case 3 : With VM Images files on glusterfs volume, with VMs accessing those
===========================================================================
In this case, I used the same 6X2 distributed replicate volume, tagged with group virt and uid/gid set to 107/107. 

(i.e) gluster volume set <vol-name> group virt
      gluster volume set <vol-name> storage.owner-uid 107
      gluster volume set <vol-name> storage.owner-gid 107


Created 5 VM Images, of type qcow2 and its set with preallocation=metadata, each
of size 20G

(i.e) qemu-img create -f qcow2 -o preallocation=metadata <Img-file> 20G

Created VMs, which is using previously created image files
(i.e) virt-install --name <vm-name> --ram 2048 --vcpus 2 --pxe --disk <img-created-on-gluster-volume> 20G

1. When one of Node goes down, there are no issues. VMs are healthy and seems to work fine.

2. When quorum is not met, "df -Th' throws error - "transport endpoint not connected" and also "ls" also fails on mountpoint

3. When Quorum is not met, VM's are not healthy, which is expected, since the image files are not available. I could see "EXT4 FS Errors" in VMs

4. When one of the node is powered-up, mount-point again becomes active, "df -Th" and "ls" command on mount point works normal as earlier

5. Restarted VMs, using virsh commands. But "virsh command", struck there for long time.

Strace of virsh command is captured and attached

6. While restarting libvirtd, stopping libvirtd failed
But I could start libvirtd again

<Console_logs>
[Tue Jul 30 07:51:52 UTC 2013 root@10.70.36.32:~ ] # service libvirtd restart
Stopping libvirtd daemon:                                  [FAILED]
[Tue Jul 30 07:52:11 UTC 2013 root@10.70.36.32:~ ] # service libvirtd status
libvirtd dead but subsys locked
[Tue Jul 30 07:52:17 UTC 2013 root@10.70.36.32:~ ] # service libvirtd start
Starting libvirtd daemon:                                  [  OK  ]
[Tue Jul 30 07:52:22 UTC 2013 root@10.70.36.32:~ ] # virsh list --all
 Id    Name                           State
----------------------------------------------------
 8     appvm3                         running
 9     appvm4                         running
 10    appvm5                         running
 -     appvm1                         shut off
 -     appvm2                         shut off
</Console_logs>

libvirtd version :
[Tue Jul 30 08:32:17 UTC 2013 root@10.70.36.32:~ ] # libvirtd --version
libvirtd (libvirt) 0.10.2

7. Checked for qcow2 image for possible corruption.
   (i.e) qemu-img check <image-on-glusterfs-volume>
    All Images are found to be good

8. After restarting, libvirtd, I could power down all VMs( corrupted when quorum is not met and mount point not available ), and restarted all VMs.
All VMs came up and was seem to be perfect and in healthy state.

In all my observations,
1. No glusterfs fuse mount crash
2. EXT4 FS in VM going corrupted, when quorum is not met - This is expected as the backend image file is no longer available when quorum is not met
3. After bringing up the node, thereby quorum is met, mount-point comes up, but
 virsh commands seem to hang. Restarting libvirtd solves this issue.

Considering all the above said points, moving this bug to VERIFIED.
Comment 13 SATHEESARAN 2013-07-30 07:26:09 EDT
Created attachment 780571 [details]
strace of "virsh list --all" command after quorum is met, where 'virsh' commands hang indefinitely
Comment 14 Scott Haines 2013-09-23 18:29:50 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.