Created attachment 667
Created attachment 668
Created attachment 669
I was able to duplicate these results using 3 kvm servers running centos 5 and GlusterFS 3.2.3
I have three test servers also functioning as clients. Each had mounted the volume vol_mainhome via NFS to themselves. I then had each server create 250 10MB files from /dev/zero. I was first bit by bug 764964 and disabled io-cache and found another problem. One server created the 250 files with no error. The other two hung in the middle of creating them. df, ps, ls, and others all hang on the two problem servers when trying to get any information from Gluster. This only happens when I run an NFS client on the servers themselves. If I use three separate boxes with NFS I have no problem. If I use FUSE on any box I have no problem. Joe Julian stated he was able to reproduce the problem in a similar setup. [root@gfs-dev-01b ~]# gluster peer status Number of Peers: 2 Hostname: gfs-dev-02b.cssd.pitt.edu Uuid: af4fc22b-fa19-407e-828f-8b455ce9113e State: Peer in Cluster (Connected) Hostname: gfs-dev-03b.cssd.pitt.edu Uuid: a6ac1947-b6f3-4894-85dc-9f1144016729 State: Peer in Cluster (Connected) [root@gfs-dev-01b ~]# gluster volume info No volumes present Note: Each brick is a logical volume in LVM using the same volume group the OS is installed to. These are VMware VMs with a single VMDK. root@gfs-dev-01b ~]# gluster volume create vol_mainhome replica 2 gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_001a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_001b gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_002a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_002b gfs-dev-02b.cssd.pitt.edu:/bricks/lv_brick_02_003a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_003b Creation of volume vol_mainhome has been successful. Please start the volume to access data. [root@gfs-dev-01b ~]# gluster volume set vol_mainhome performance.io-cache off [root@gfs-dev-01b ~]# gluster volume info vol_mainhome Volume Name: vol_mainhome Type: Distributed-Replicate Status: Created Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_001a Brick2: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_001b Brick3: gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_002a Brick4: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_002b Brick5: gfs-dev-02b.cssd.pitt.edu:/bricks/lv_brick_02_003a Brick6: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_003b Options Reconfigured: performance.io-cache: off [root@gfs-dev-01b ~]# gluster volume start vol_mainhome Starting volume vol_mainhome has been successful On gfs-dev-01b.cssd.pitt.edu: mount -t nfs -o tcp,vers=3 gfs-dev-01b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome num=0;while [ $num -le 250 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done On gfs-dev-02b.cssd.pitt.edu: mount -t nfs -o tcp,vers=3 gfs-dev-02b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome num=251;while [ $num -le 500 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done On gfs-dev-03b.cssd.pitt.edu: mount -t nfs -o tcp,vers=3 gfs-dev-03b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome num=501;while [ $num -le 750 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done On this attempt in the attached logs 2 servers completed the loop while gfs-dev-02b.cssd.pitt.edu hung.
> On this attempt in the attached logs 2 servers completed the loop while > gfs-dev-02b.cssd.pitt.edu hung. I see the following log in nfs.log of gfs-dev-02b: [2011-09-21 10:23:03.133892] C [client-handshake.c:121:rpc_client_ping_timer_expired] 0-vol_mainhome-client-3: server 192.168.10.243:24010 has not responded in the last 42 seconds, disconnecting. Can you check if there was a network problem?
(In reply to comment #4) > I was able to duplicate these results using 3 kvm servers running centos 5 and > GlusterFS 3.2.3 Can you give us nfs.log? I was not able to reproduce the problem.
(In reply to comment #5) > Can you check if there was a network problem? These were 3 VMs on the same physical host so a network problem would be pretty much impossible. I saw no issues with nay physical host either. I'm not sure how you couldn't replicate it as I can cause the lockup over and over again. What distro were you using? I was using RHEL 6.1.
Mine were also all on the same physical host.
I am able to reproduce hang on VMs but not on physical machines. Investigating.
(In reply to comment #9) > I am able to reproduce hang on VMs but not on physical machines. Investigating. I see the following in the logs. ie afr selfheal gets triggered causing the nfs client application to hang. This can be reproduced by a single loop of: num=0;while [ $num -le 50 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/foo.$num ;num=$(($num + 1));done [2011-09-29 21:52:59.651353] I [afr-common.c:649:afr_lookup_self_heal_check] 0-test-replicate-0: size differs for /foo.48 [2011-09-29 21:52:59.651753] I [afr-common.c:811:afr_lookup_done] 0-test-replicate-0: background data self-heal triggered. path: /foo.48 [2011-09-29 21:52:59.791785] I [afr-self-heal-algorithm.c:520:sh_diff_loop_driver_done] 0-test-replicate-0: diff self-heal on /foo.48: completed. (0 blocks o f 78 were different (0.00%)) [2011-09-29 21:52:59.793055] I [client3_1-fops.c:1640:client3_1_setattr_cbk] 0-test-client-0: remote operation failed: No such file or directory [2011-09-29 21:52:59.793075] I [afr-self-heal-data.c:102:afr_sh_data_flush_cbk] 0-test-replicate-0: flush or setattr failed on /foo.48 on subvolume test-clie nt-0: No such file or directory [2011-09-29 21:52:59.793524] I [client3_1-fops.c:1640:client3_1_setattr_cbk] 0-test-client-1: remote operation failed: No such file or directory [2011-09-29 21:52:59.793571] I [afr-self-heal-data.c:102:afr_sh_data_flush_cbk] 0-test-replicate-0: flush or setattr failed on /foo.48 on subvolume test-clie nt-1: No such file or directory [2011-09-29 21:52:59.793616] I [afr-self-heal-common.c:1557:afr_self_heal_completion_cbk] 0-test-replicate-0: background data data self-heal completed on /f oo.48 [2011-09-29 21:59:20.87126] I [nfs.c:704:init] 0-nfs: NFS service started
This behavior is expected as it leads to deadlock. More info in bug logs: 764052 763983 *** This bug has been marked as a duplicate of bug 764052 ***