This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 765335 - (GLUSTER-3603) NFS client on servers causes lock up
NFS client on servers causes lock up
Status: CLOSED DUPLICATE of bug 764052
Product: GlusterFS
Classification: Community
Component: nfs (Show other bugs)
3.2.3
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Vinayaga Raman
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-09-21 11:21 EDT by jaw171
Modified: 2015-12-01 11:45 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-08-14 08:11:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
/var/log/glusterfs from gfs-dev-01b (7.33 KB, application/x-compressed-tar)
2011-09-21 08:22 EDT, jaw171
no flags Details
/var/log/glusterfs from gfs-dev-02b (6.39 KB, application/x-compressed-tar)
2011-09-21 08:22 EDT, jaw171
no flags Details
/var/log/glusterfs from gfs-dev-03b (7.44 KB, application/x-compressed-tar)
2011-09-21 08:22 EDT, jaw171
no flags Details

  None (edit)
Description jaw171 2011-09-21 08:22:23 EDT
Created attachment 667
Comment 1 jaw171 2011-09-21 08:22:35 EDT
Created attachment 668
Comment 2 jaw171 2011-09-21 08:22:46 EDT
Created attachment 669
Comment 3 Joe Julian 2011-09-21 08:26:12 EDT
I was able to duplicate these results using 3 kvm servers running centos 5 and GlusterFS 3.2.3
Comment 4 jaw171 2011-09-21 11:21:21 EDT
I have three test servers also functioning as clients.  Each had mounted the volume vol_mainhome via NFS to themselves.  I then had each server create 250 10MB files from /dev/zero.  I was first bit by bug 764964 and disabled io-cache and found another problem.  One server created the 250 files with no error.  The other two hung in the middle of creating them.  df, ps, ls, and others all hang on the two problem servers when trying to get any information from Gluster.

This only happens when I run an NFS client on the servers themselves.  If I use three separate boxes with NFS I have no problem.  If I use FUSE on any box I have no problem.  Joe Julian stated he was able to reproduce the problem in a similar setup.

[root@gfs-dev-01b ~]# gluster peer status
Number of Peers: 2

Hostname: gfs-dev-02b.cssd.pitt.edu
Uuid: af4fc22b-fa19-407e-828f-8b455ce9113e
State: Peer in Cluster (Connected)

Hostname: gfs-dev-03b.cssd.pitt.edu
Uuid: a6ac1947-b6f3-4894-85dc-9f1144016729
State: Peer in Cluster (Connected)


[root@gfs-dev-01b ~]# gluster volume info
No volumes present


Note: Each brick is a logical volume in LVM using the same volume group the OS is installed to.  These are VMware VMs with a single VMDK.
root@gfs-dev-01b ~]# gluster volume create vol_mainhome replica 2 gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_001a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_001b gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_002a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_002b gfs-dev-02b.cssd.pitt.edu:/bricks/lv_brick_02_003a gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_003b
Creation of volume vol_mainhome has been successful. Please start the volume to access data.


[root@gfs-dev-01b ~]# gluster volume set vol_mainhome performance.io-cache off

 
[root@gfs-dev-01b ~]# gluster volume info vol_mainhome

Volume Name: vol_mainhome
Type: Distributed-Replicate
Status: Created
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_001a
Brick2: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_001b
Brick3: gfs-dev-01b.cssd.pitt.edu:/bricks/lv_brick_01_002a
Brick4: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_002b
Brick5: gfs-dev-02b.cssd.pitt.edu:/bricks/lv_brick_02_003a
Brick6: gfs-dev-03b.cssd.pitt.edu:/bricks/lv_brick_03_003b
Options Reconfigured:
performance.io-cache: off

 
[root@gfs-dev-01b ~]# gluster volume start vol_mainhome
Starting volume vol_mainhome has been successful


On gfs-dev-01b.cssd.pitt.edu:
mount -t nfs -o tcp,vers=3 gfs-dev-01b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome
num=0;while [ $num -le 250 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done

On gfs-dev-02b.cssd.pitt.edu:
mount -t nfs -o tcp,vers=3 gfs-dev-02b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome
num=251;while [ $num -le 500 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done

On gfs-dev-03b.cssd.pitt.edu:
mount -t nfs -o tcp,vers=3 gfs-dev-03b.cssd.pitt.edu:/vol_mainhome /mnt/nfs/vol_mainhome
num=501;while [ $num -le 750 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/nfs/vol_mainhome/foo.$num ;num=$(($num + 1));done

On this attempt in the attached logs 2 servers completed the loop while gfs-dev-02b.cssd.pitt.edu hung.
Comment 5 Krishna Srinivas 2011-09-26 03:30:27 EDT
> On this attempt in the attached logs 2 servers completed the loop while
> gfs-dev-02b.cssd.pitt.edu hung.

I see the following log in nfs.log of gfs-dev-02b:

[2011-09-21 10:23:03.133892] C [client-handshake.c:121:rpc_client_ping_timer_expired] 0-vol_mainhome-client-3: server 192.168.10.243:24010 has not responded in the last 42 seconds, disconnecting.

Can you check if there was a network problem?
Comment 6 Krishna Srinivas 2011-09-26 03:32:05 EDT
(In reply to comment #4)
> I was able to duplicate these results using 3 kvm servers running centos 5 and
> GlusterFS 3.2.3

Can you give us nfs.log? I was not able to reproduce the problem.
Comment 7 jaw171 2011-09-26 05:41:51 EDT
(In reply to comment #5)
> Can you check if there was a network problem?

These were 3 VMs on the same physical host so a network problem would be pretty much impossible.  I saw no issues with nay physical host either.

I'm not sure how you couldn't replicate it as I can cause the lockup over and over again.  What distro were you using?  I was using RHEL 6.1.
Comment 8 Joe Julian 2011-09-26 07:24:30 EDT
Mine were also all on the same physical host.
Comment 9 Krishna Srinivas 2011-09-28 04:55:55 EDT
I am able to reproduce hang on VMs but not on physical machines. Investigating.
Comment 10 Krishna Srinivas 2011-09-30 03:06:59 EDT
(In reply to comment #9)
> I am able to reproduce hang on VMs but not on physical machines. Investigating.

I see the following in the logs. ie afr selfheal gets triggered causing the nfs client application to hang.

This can be reproduced by a single loop of:
num=0;while [ $num -le 50 ];do dd bs=1024 count=10000 if=/dev/zero of=/mnt/foo.$num ;num=$(($num + 1));done

[2011-09-29 21:52:59.651353] I [afr-common.c:649:afr_lookup_self_heal_check] 0-test-replicate-0: size differs for /foo.48 
[2011-09-29 21:52:59.651753] I [afr-common.c:811:afr_lookup_done] 0-test-replicate-0: background  data self-heal triggered. path: /foo.48
[2011-09-29 21:52:59.791785] I [afr-self-heal-algorithm.c:520:sh_diff_loop_driver_done] 0-test-replicate-0: diff self-heal on /foo.48: completed. (0 blocks o
f 78 were different (0.00%))
[2011-09-29 21:52:59.793055] I [client3_1-fops.c:1640:client3_1_setattr_cbk] 0-test-client-0: remote operation failed: No such file or directory
[2011-09-29 21:52:59.793075] I [afr-self-heal-data.c:102:afr_sh_data_flush_cbk] 0-test-replicate-0: flush or setattr failed on /foo.48 on subvolume test-clie
nt-0: No such file or directory
[2011-09-29 21:52:59.793524] I [client3_1-fops.c:1640:client3_1_setattr_cbk] 0-test-client-1: remote operation failed: No such file or directory
[2011-09-29 21:52:59.793571] I [afr-self-heal-data.c:102:afr_sh_data_flush_cbk] 0-test-replicate-0: flush or setattr failed on /foo.48 on subvolume test-clie
nt-1: No such file or directory
[2011-09-29 21:52:59.793616] I [afr-self-heal-common.c:1557:afr_self_heal_completion_cbk] 0-test-replicate-0: background  data data self-heal completed on /f
oo.48
[2011-09-29 21:59:20.87126] I [nfs.c:704:init] 0-nfs: NFS service started
Comment 11 Krishna Srinivas 2012-08-14 08:11:57 EDT
This behavior is expected as it leads to deadlock.
More info in bug logs: 764052 763983

*** This bug has been marked as a duplicate of bug 764052 ***

Note You need to log in before you can comment on or make changes to this bug.