Red Hat Bugzilla – Full Text Bug Listing
|Summary:||NFS client on servers causes lock up|
|Component:||nfs||Assignee:||Vinayaga Raman <vraman>|
|Status:||CLOSED DUPLICATE||QA Contact:|
|Version:||3.2.3||CC:||gluster-bugs, joe, rwheeler|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2012-08-14 08:11:57 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description jaw171 2011-09-21 08:22:23 EDT
Created attachment 667
Comment 1 jaw171 2011-09-21 08:22:35 EDT
Created attachment 668
Comment 2 jaw171 2011-09-21 08:22:46 EDT
Created attachment 669
Comment 3 Joe Julian 2011-09-21 08:26:12 EDT
I was able to duplicate these results using 3 kvm servers running centos 5 and GlusterFS 3.2.3
Comment 4 jaw171 2011-09-21 11:21:21 EDT
I have three test servers also functioning as clients. Each had mounted the volume vol_mainhome via NFS to themselves. I then had each server create 250 10MB files from /dev/zero. I was first bit by bug 764964 and disabled io-cache and found another problem. One server created the 250 files with no error. The other two hung in the middle of creating them. df, ps, ls, and others all hang on the two problem servers when trying to get any information from Gluster. This only happens when I run an NFS client on the servers themselves. If I use three separate boxes with NFS I have no problem. If I use FUSE on any box I have no problem. Joe Julian stated he was able to reproduce the problem in a similar setup. [root@gfs-dev-01b ~]# gluster peer status Number of Peers: 2 Hostname: gfs-dev-02b.cssd.pitt.edu Uuid: af4fc22b-fa19-407e-828f-8b455ce9113e State: Peer in Cluster (Connected) Hostname: gfs-dev-03b.cssd.pitt.edu Uuid: a6ac1947-b6f3-4894-85dc-9f1144016729 State: Peer in Cluster (Connected) [root@gfs-dev-01b ~]# gluster volume info No volumes present Note: Each brick is a logical volume in LVM using the same volume group the OS is installed to. These are VMware VMs with a single VMDK. root@gfs-dev-01b ~]# gluster volume create vol_mainhome replica 2 gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_001a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_001b gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_002a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_002b gfs-dev-02b.cssd.pitt.edu:/bricks/lv_brick_02_003a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_003b Creation of volume vol_mainhome has been successful. Please start the volume to access data. [root@gfs-dev-01b ~]# gluster volume set vol_mainhome performance.io-cache off [root@gfs-dev-01b ~]# gluster volume info vol_mainhome Volume Name: vol_mainhome Type: Distributed-Replicate Status: Created Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_001a Brick2: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_001b Brick3: gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_002a Brick4: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_002b Brick5: gfs-dev-02b.cssd.pitt.edu:/bricks/lv_brick_02_003a Brick6: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_003b Options Reconfigured: performance.io-cache: off [root@gfs-dev-01b ~]# gluster volume start vol_mainhome Starting volume vol_mainhome has been successful On gfs-dev-01b.cssd.pitt.edu: mount -t nfs -o tcp,vers=3 gfs-dev-01b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome num=0;while [ $num -le 250 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done On gfs-dev-02b.cssd.pitt.edu: mount -t nfs -o tcp,vers=3 gfs-dev-02b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome num=251;while [ $num -le 500 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done On gfs-dev-03b.cssd.pitt.edu: mount -t nfs -o tcp,vers=3 gfs-dev-03b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome num=501;while [ $num -le 750 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done On this attempt in the attached logs 2 servers completed the loop while gfs-dev-02b.cssd.pitt.edu hung.
Comment 5 Krishna Srinivas 2011-09-26 03:30:27 EDT
> On this attempt in the attached logs 2 servers completed the loop while > gfs-dev-02b.cssd.pitt.edu hung. I see the following log in nfs.log of gfs-dev-02b: [2011-09-21 10:23:03.133892] C [client-handshake.c:121:rpc_client_ping_timer_expired] 0-vol_mainhome-client-3: server 192.168.10.243:24010 has not responded in the last 42 seconds, disconnecting. Can you check if there was a network problem?
Comment 6 Krishna Srinivas 2011-09-26 03:32:05 EDT
(In reply to comment #4) > I was able to duplicate these results using 3 kvm servers running centos 5 and > GlusterFS 3.2.3 Can you give us nfs.log? I was not able to reproduce the problem.
Comment 7 jaw171 2011-09-26 05:41:51 EDT
(In reply to comment #5) > Can you check if there was a network problem? These were 3 VMs on the same physical host so a network problem would be pretty much impossible. I saw no issues with nay physical host either. I'm not sure how you couldn't replicate it as I can cause the lockup over and over again. What distro were you using? I was using RHEL 6.1.
Comment 8 Joe Julian 2011-09-26 07:24:30 EDT
Mine were also all on the same physical host.
Comment 9 Krishna Srinivas 2011-09-28 04:55:55 EDT
I am able to reproduce hang on VMs but not on physical machines. Investigating.
Comment 10 Krishna Srinivas 2011-09-30 03:06:59 EDT
(In reply to comment #9) > I am able to reproduce hang on VMs but not on physical machines. Investigating. I see the following in the logs. ie afr selfheal gets triggered causing the nfs client application to hang. This can be reproduced by a single loop of: num=0;while [ $num -le 50 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/foo.$num ;num=$(($num + 1));done [2011-09-29 21:52:59.651353] I [afr-common.c:649:afr_lookup_self_heal_check] 0-test-replicate-0: size differs for /foo.48 [2011-09-29 21:52:59.651753] I [afr-common.c:811:afr_lookup_done] 0-test-replicate-0: background data self-heal triggered. path: /foo.48 [2011-09-29 21:52:59.791785] I [afr-self-heal-algorithm.c:520:sh_diff_loop_driver_done] 0-test-replicate-0: diff self-heal on /foo.48: completed. (0 blocks o f 78 were different (0.00%)) [2011-09-29 21:52:59.793055] I [client3_1-fops.c:1640:client3_1_setattr_cbk] 0-test-client-0: remote operation failed: No such file or directory [2011-09-29 21:52:59.793075] I [afr-self-heal-data.c:102:afr_sh_data_flush_cbk] 0-test-replicate-0: flush or setattr failed on /foo.48 on subvolume test-clie nt-0: No such file or directory [2011-09-29 21:52:59.793524] I [client3_1-fops.c:1640:client3_1_setattr_cbk] 0-test-client-1: remote operation failed: No such file or directory [2011-09-29 21:52:59.793571] I [afr-self-heal-data.c:102:afr_sh_data_flush_cbk] 0-test-replicate-0: flush or setattr failed on /foo.48 on subvolume test-clie nt-1: No such file or directory [2011-09-29 21:52:59.793616] I [afr-self-heal-common.c:1557:afr_self_heal_completion_cbk] 0-test-replicate-0: background data data self-heal completed on /f oo.48 [2011-09-29 21:59:20.87126] I [nfs.c:704:init] 0-nfs: NFS service started