Description of problem: ======================== I see more than 700 sockets under /proc/$(pgrep glusterfsd)/fd which possibly are stale I don't think this is expected. Below are some sample set rwx------. 1 root root 64 Jul 10 17:44 12385 -> socket:[3399901] lrwx------. 1 root root 64 Jul 10 17:44 12386 -> socket:[3399902] lrwx------. 1 root root 64 Jul 10 17:44 12387 -> socket:[3399903] lrwx------. 1 root root 64 Jul 10 17:44 12388 -> socket:[3399904] lrwx------. 1 root root 64 Jul 10 17:44 12389 -> socket:[3399916] lrwx------. 1 root root 64 Jul 10 17:44 12390 -> socket:[3399917] lrwx------. 1 root root 64 Jul 10 17:44 12391 -> socket:[3399918] lrwx------. 1 root root 64 Jul 10 17:44 12392 -> socket:[3399919] lrwx------. 1 root root 64 Jul 10 17:44 12393 -> socket:[3399920] lrwx------. 1 root root 64 Jul 10 17:44 12394 -> socket:[3399921] lrwx------. 1 root root 64 Jul 10 17:44 12395 -> socket:[3399922] lrwx------. 1 root root 64 Jul 10 17:44 12396 -> socket:[3399923] lrwx------. 1 root root 64 Jul 10 17:44 12397 -> socket:[3399924] lrwx------. 1 root root 64 Jul 10 17:44 12398 -> socket:[3399925] lrwx------. 1 root root 64 Jul 10 17:44 12399 -> socket:[3399926] lrwx------. 1 root root 64 Jul 10 17:44 12400 -> socket:[3399927] lrwx------. 1 root root 64 Jul 10 17:44 12401 -> socket:[3399928] lrwx------. 1 root root 64 Jul 10 17:44 12402 -> socket:[3399929] lrwx------. 1 root root 64 Jul 10 17:44 12403 -> socket:[3399930] Version-Release number of selected component (if applicable): ---------------- 3.8.4-54.14 How reproducible: ============== have run below steps once and am seeing it on all nodes Steps to Reproduce: ===================== 1.have a 6 node cluster, 8x3 volume started 2.run glusterd restart in loop from one terminal of n1 and gluster v heal command from another terminal of n1 (as part of bz#1595752 verification) 3.now from 3 clients simultaneously keep mounting the same volume using n1 IP on 1000 directories ie from each node for i in {1..1000};do mkdir /mnt/vol.$i; mount -t glusterfs n1:vol /mnt/vol.$i;done 4. now the loops in step#3 5. now do in loop, glusterd restart from t1 of n1 and from t2 of n1 run quota enable/disable you will hit issue bz#1599702 on n1 6. now unmount all the mounts on 3 clients simultaneously ie for i in {1..1000};do umount /mnt/vol.$i; done this will succeed 7. stop step 5 and bring the cluster to idle state Actual results: ============= notice /proc/$glusterfsd/fd and you will see many sockets ie more than 700+ workaround: ========== reboot node or kill all gluster procs and restart glusterd
logs same as bz#1599702 http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1599702/
the lists are available under each server log directory as glusterfsd.proc.fd.list lsof_fd.list
A similar tests are required on top of RHGS3.4.0 (Once released, to see if we should consider for batch update. I am not aware if this is happening any more.
Setting a needinfo on Nag based on comment 7.
cleared needinfo accidentally, placing it back
the problem still exists even on 3.12.2-29 I am seeing anything b/w 15-75 stale sockets for the same steps as mentioned in description(only difference is I infact reduced number of mounts to 500 only)
sosreports and health reports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1599769
Upstream patch : https://review.gluster.org/#/c/glusterfs/+/21966/
@Mohit, what's the state of this ?
I have asked Sanju to look the same.