Description of problem: ----------------------- 4 node Ganesha cluster.4 clients,each one mounted from one particular server via its VIP. Ran I/O from different mounts from 3 different clients.Ran "find" from a 4th client.find did not start even after 36 hours of executing it from the cmd line.dds got hung too from one of the clients. Tried to mount the volume on 4 new clients.Mounts are unsuccessful from the server from which the client mounted which had a find hang.They get timed out eventually. Shared setup with Soumya.She suspects find hangs are causing the dd hangs(BZ#1379673).But find hangs and mount fails might need further investigation. To reiterate,this is the impact/observation : * Application side hang - find. * Unable to mount the volume from the server VIP/physical IP(the same one which the client mounted from where find was hanging). [root@gqac030 ~]# mount -t nfs -o vers=4 192.168.79.153:/testvol /gluster-mount/ -v mount.nfs: timeout set for Thu Oct 6 10:59:05 2016 mount.nfs: trying text-based options 'vers=4,addr=192.168.79.152,clientaddr=10.16.157.87' ^C [root@gqac030 ~]# [root@gqac030 ~]# ping 192.168.79.153 PING 192.168.79.153 (192.168.79.153) 56(84) bytes of data. 64 bytes from 192.168.79.153: icmp_seq=1 ttl=64 time=0.151 ms 64 bytes from 192.168.79.153: icmp_seq=2 ttl=64 time=0.096 ms 64 bytes from 192.168.79.153: icmp_seq=3 ttl=64 time=0.091 ms Mounts from other servers in the cluster are successful though. pcs status was OK all along. Unable to take BT. Setup and workload details in comments. Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.4.0-2.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64 How reproducible: ----------------- Reporting the first occurrence. Steps to Reproduce: ------------------- 1. Mount gluster volume via Ganesha 2. Run dd from diff clients 3. Run find on mount point from one of the clients while I/O is in progress.Check for progress continuously 4. Check on another client,if mounts are happening from the same server which the client mounted from where find was hung. Actual results: --------------- * Find hangs * Mounts from the server fails(from the same server which the client mounted from where find was hung) Expected results: ----------------- No hangs and successful mounts. Additional info: ---------------- * mount vers=4 * Client/Server OS : RHEL 7.2 *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on ganesha.enable: on features.cache-invalidation: off nfs.disable: on performance.readdir-ahead: on performance.stat-prefetch: off server.allow-insecure: on nfs-ganesha: enable cluster.enable-shared-storage: enable
EXACT WORKLOAD : ------------- *Data* - for i in {1..1000000};do dd if=/dev/urandom of=stressc3$i conv=fdatasync bs=100 count=10000;don *Metadata* - find . -mindepth 1 -type f
glibc version on clients n servers : glibc-2.17-149.el7.x86_64
As updated in the https://bugzilla.redhat.com/show_bug.cgi?id=1383559#c5, please collect process stack traces as well while the tests are being run.
I managed to delete the "Triaged" keyword added by Jiffin during a mid-air collision. Re-added.
As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing qa_ack
Upstream fix: https://review.gerrithub.io/304278 https://review.gerrithub.io/304279
Raised a new BZ for find hangs : https://bugzilla.redhat.com/show_bug.cgi?id=1403757
Verified on 2.4.1-6/3.8.4-13. finds were hung.(Expected =>https://bugzilla.redhat.com/show_bug.cgi?id=1403757) Subsequent mounts were successful.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0493.html