Description of problem: While doing NFS fail-over tests on 2 replicated GlusterFS nodes, the NFS clients produce errors ls: reading directory test1/: Too many levels of symbolic links Jan 30 13:41:58 xx kernel: [492498.250168] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359547768463822182 has duplicate cookie 1150239009500529518 Version-Release number of selected component (if applicable): kernel-2.6.32-279.19.1.el6 glusterfs-3.3.1-1.el6.x86_64 glusterfs-geo-replication-3.3.1-1.el6.x86_64 glusterfs-server-3.3.1-1.el6.x86_64 glusterfs-rdma-3.3.1-1.el6.x86_64 glusterfs-fuse-3.3.1-1.el6.x86_64 How reproducible: running a 'touch' loop and moving vips around Steps to Reproduce: start state: gluster node1 has vip1 gluster node2 has vip1 nfs test client1 mounts from vip1 nfs test client2 mounts from vip2 nfs test client3 mounts from vip1 nfs test client4 mounts from vip2 nfs test client1 does a while true touch loop in nfs folder test1 nfs test client2 does a while true touch loop in nfs folder test2 nfs test client3 does a ls count watch loop in nfs folders test1 and test2 nfs test client4 does a ls count watch loop in nfs folders test1 and test2 test commands: while true; do touch `date +%s%N`; sleep 1 ;done watch 'echo -n "test1 ";ls test1/ | wc -l; echo -n "test2 "; ls test2/ | wc -l' tests: move vip1 to gluster node2 move vip2 to gluster node1 move vip1 to gluster node1 move vip2 to gluster node2 Actual results: Clients do not loose their nfs mount, even if it sometimes takes a few minutes to recover while hanging. ls: reading directory test1/: Too many levels of symbolic links ls: reading directory test1/: Too many levels of symbolic links ls: reading directory test1/: Too many levels of symbolic links ls: reading directory test1/: Too many levels of symbolic links ls: reading directory test2/: Too many levels of symbolic links ls: reading directory test2/: Too many levels of symbolic links ls: reading directory test2/: Too many levels of symbolic links ls: reading directory test2/: Too many levels of symbolic links ls: reading directory test2/: Too many levels of symbolic links Jan 30 13:38:40 xx kernel: [492300.447793] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359548658577264637-64.so.2 has duplicate cookie 2975285876436019500 Jan 30 13:38:40 xx kernel: [492300.448100] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359548658577264637-64.so.2 has duplicate cookie 2975285876436019500 Jan 30 13:41:58 xx kernel: [492498.250168] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359547768463822182 has duplicate cookie 1150239009500529518 Jan 30 13:41:58 xx kernel: [492498.250377] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359547768463822182 has duplicate cookie 1150239009500529518 Jan 30 13:42:47 xx kernel: [492547.683880] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359549713888590789<F9>Lq<99>><FB><CF><BA> ^X<98><AD>^\1359549372849716119^A<A0><C3>^D<9E><FC>f<B8>[ has duplicate cookie 5323278086631868562 Jan 30 13:42:47 xx kernel: [492547.683987] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359549713888590789<F9>Lq<99>><FB><CF><BA> ^X<98><AD>^\1359549372849716119^A<A0><C3>^D<9E><FC>f<B8>[ has duplicate cookie 5323278086631868562 Jan 30 13:42:50 xx kernel: [492550.672743] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359549516038123212<E6> has duplicate cookie 7680743801850642182 Jan 30 13:42:50 xx kernel: [492550.672904] NFS: directory content/test1 contains a readdir loop.Please contact your server vendor. The file: 1359549516038123212<E6> has duplicate cookie 7680743801850642182 Jan 30 13:42:51 xx kernel: [492551.182458] NFS: directory content/test2 contains a readdir loop.Please contact your server vendor. The file: 1359547974591455205-64.so.2 has duplicate cookie 8076066796481300716 Jan 30 13:42:51 xx kernel: [492551.182569] NFS: directory content/test2 contains a readdir loop.Please contact your server vendor. The file: 1359547974591455205-64.so.2 has duplicate cookie 8076066796481300716 Jan 30 13:43:00 xx kernel: [492560.072294] NFS: directory content/test2 contains a readdir loop.Please contact your server vendor. The file: 1359548346453256184 has duplicate cookie 1703179529881495508 Jan 30 13:43:00 xx kernel: [492560.072600] NFS: directory content/test2 contains a readdir loop.Please contact your server vendor. The file: 1359548346453256184 has duplicate cookie 1703179529881495508 Expected results: Maybe some hick-ups during fail-over but no errors. Additional info: I'm testing on Amazon EC2. So the ip fail-over is being done by detaching and attaching secondary ips. File system on bricks is ext4.
We swithed to detachting and attaching a dedicated NFS network interface (Eni) on failover to the other server instead of only detachting and attaching the ip number. This way of failover does not show this problem. If anyone is still interested in debugging the issue and wants me to test a version of GlusterFS that includes a patch potentially fixing the initial problem, I'm willing to test this.
We had investigated this and did not find this to be an issue: Tested the issue in kernel-2.6.32-358.el6.x86_64 and the issue is not seen here. The "Too many levels of symbolic links" error is most likely due to bug in NFS client. Can you please check if you face the issue with the latest kernel?
This could indeed be an issue with the NFS-client (provide by the Linux kernel). Could you let us know if you can still reproduce this problem on more recent kernel versions? If you can reproduce this, can you let us know the exact steps how to do so? One of the most important things would be the number of files in the directory. capturing a tcpdump on the NFS-client (with "-s 0") and matching logs should help in analysing this behaviour too.
I have no test setup anymore to reproduce these tests. For me the problem was solved by using a different way of moving the ip between the NFS servers like described on 2013-09-12 04:54:19
Okay, thanks. I'll close this out for now. If this issue happens to return again, please open a new bug.