Description of problem: ----------------------- Setup consists of a 4 node cluster and the volume mounted on 6 clients - 4 via VIP and 2 via physical IP. rm -rf * from one of my clients was hung(with continuous writes from other clients) for almost 10 hours after beginning execution.A normal keyboard interrupt did not help me break out of it. I/O wasn't affected,though. pcs status shows OK : [root@gqas013 ~]# pcs status Cluster name: G1474623742.03 Last updated: Mon Oct 10 23:49:36 2016 Last change: Mon Oct 10 13:14:53 2016 by root via cibadmin on gqas006.sbu.lab.eng.bos.redhat.com Stack: corosync Current DC: gqas006.sbu.lab.eng.bos.redhat.com (version 1.1.13-10.el7-44eb2dd) - partition with quorum 4 nodes and 16 resources configured Online: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] gqas013.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas013.sbu.lab.eng.bos.redhat.com gqas005.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas011.sbu.lab.eng.bos.redhat.com PCSD Status: gqas013.sbu.lab.eng.bos.redhat.com: Online gqas005.sbu.lab.eng.bos.redhat.com: Online gqas006.sbu.lab.eng.bos.redhat.com: Online gqas011.sbu.lab.eng.bos.redhat.com: Online Daemon Status: corosync: active/disabled pacemaker: active/enabled pcsd: active/disabled [root@gqas013 ~]# ganesha,pacemaker,corosync,pcsd,glusterd were active and alive at all times. I could not take BT of the hung process as gdb itself was hanging(coz of the hung process). sosreport,ganesha logs and tcp dump location in commets.Since the issue is pretty consistent,I can work with Dev on whatever else they may need. Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.4.0-2.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64 How reproducible: ---------------- 2/2 Steps to Reproduce: ------------------ 1. Mount the volume on the client via v4. 2. Run I/O.After an hour into the workload,trigger rm -rf * from one of the mounts. Actual results: --------------- rm hangs and cannot be interrupted from the keyboard via Ctrl+C/X/Z. Expected results: ----------------- No hangs. Additional info: ---------------- * mount vers=4 * Client/Server OS : RHEL 7.2 *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on ganesha.enable: on features.cache-invalidation: off nfs.disable: on performance.readdir-ahead: on performance.stat-prefetch: off server.allow-insecure: on nfs-ganesha: enable cluster.enable-shared-storage: enable
Last few lines from dmesg on the client where rm was hung : [15787.291198] nfs: server 192.168.79.152 not responding, still trying [15829.189303] nfs: server 192.168.79.152 OK [16009.508639] nfs: server 192.168.79.152 not responding, still trying [16049.900337] nfs: server 192.168.79.152 OK [16230.190385] nfs: server 192.168.79.152 not responding, still trying [16266.302110] nfs: server 192.168.79.152 OK [16446.519894] nfs: server 192.168.79.152 not responding, still trying [16481.261059] nfs: server 192.168.79.152 OK [16661.569060] nfs: server 192.168.79.152 not responding, still trying [16682.844981] nfs: server 192.168.79.152 OK [16863.049949] nfs: server 192.168.79.152 not responding, still trying [16883.367595] nfs: server 192.168.79.152 OK [17063.506703] nfs: server 192.168.79.152 not responding, still trying [17083.507404] nfs: server 192.168.79.152 OK [17263.707289] nfs: server 192.168.79.152 not responding, still trying [17286.747354] nfs: server 192.168.79.152 OK [17466.980140] nfs: server 192.168.79.152 not responding, still trying [17492.516096] nfs: server 192.168.79.152 OK [17672.813343] nfs: server 192.168.79.152 not responding, still trying [17700.774987] nfs: server 192.168.79.152 OK [17880.950164] nfs: server 192.168.79.152 not responding, still trying [17912.274259] nfs: server 192.168.79.152 OK [18092.415398] nfs: server 192.168.79.152 not responding, still trying [18123.437874] nfs: server 192.168.79.152 OK [18303.624676] nfs: server 192.168.79.152 not responding, still trying [18338.743966] nfs: server 192.168.79.152 OK [18518.930261] nfs: server 192.168.79.152 not responding, still trying [18558.453436] nfs: server 192.168.79.152 OK [18738.587582] nfs: server 192.168.79.152 not responding, still trying [18781.420108] nfs: server 192.168.79.152 OK [18961.573582] nfs: server 192.168.79.152 not responding, still trying [19009.816447] nfs: server 192.168.79.152 OK [19189.935436] nfs: server 192.168.79.152 not responding, still trying [19239.119287] nfs: server 192.168.79.152 OK [19419.321237] nfs: server 192.168.79.152 not responding, still trying [19470.307914] nfs: server 192.168.79.152 OK [19650.499519] nfs: server 192.168.79.152 not responding, still trying [19702.953909] nfs: server 192.168.79.152 OK [19883.213746] nfs: server 192.168.79.152 not responding, still trying [19941.678346] nfs: server 192.168.79.152 OK [20121.815849] nfs: server 192.168.79.152 not responding, still trying [20180.769899] nfs: server 192.168.79.152 OK [20360.930509] nfs: server 192.168.79.152 not responding, still trying [20425.082200] nfs: server 192.168.79.152 OK [20605.164898] nfs: server 192.168.79.152 not responding, still trying [20671.109880] nfs: server 192.168.79.152 OK [20851.191604] nfs: server 192.168.79.152 not responding, still trying [20919.229722] nfs: server 192.168.79.152 OK [21099.522467] nfs: server 192.168.79.152 not responding, still trying [21169.330454] nfs: server 192.168.79.152 OK [21349.517303] nfs: server 192.168.79.152 not responding, still trying [21426.511795] nfs: server 192.168.79.152 OK [21606.680499] nfs: server 192.168.79.152 not responding, still trying [21686.828542] nfs: server 192.168.79.152 OK [21867.043910] nfs: server 192.168.79.152 not responding, still trying [21949.950313] nfs: server 192.168.79.152 OK [22130.223292] nfs: server 192.168.79.152 not responding, still trying [22214.748634] nfs: server 192.168.79.152 OK [22394.938944] nfs: server 192.168.79.152 not responding, still trying [22485.755527] nfs: server 192.168.79.152 OK [22666.054944] nfs: server 192.168.79.152 not responding, still trying [22757.152999] nfs: server 192.168.79.152 OK [22937.426469] nfs: server 192.168.79.152 not responding, still trying [23032.585237] nfs: server 192.168.79.152 OK [23212.894709] nfs: server 192.168.79.152 not responding, still trying [23313.699666] nfs: server 192.168.79.152 OK [23493.994685] nfs: server 192.168.79.152 not responding, still trying [23597.995454] nfs: server 192.168.79.152 OK [23778.167395] nfs: server 192.168.79.152 not responding, still trying [23884.246880] nfs: server 192.168.79.152 OK [24064.387741] nfs: server 192.168.79.152 not responding, still trying [24175.945142] nfs: server 192.168.79.152 OK [24356.240259] nfs: server 192.168.79.152 not responding, still trying [24469.714996] nfs: server 192.168.79.152 OK [24649.885912] nfs: server 192.168.79.152 not responding, still trying [24770.392334] nfs: server 192.168.79.152 OK [24950.698811] nfs: server 192.168.79.152 not responding, still trying [25074.436048] nfs: server 192.168.79.152 OK [25254.584307] nfs: server 192.168.79.152 not responding, still trying [25381.491915] nfs: server 192.168.79.152 OK [25561.797420] nfs: server 192.168.79.152 not responding, still trying [25693.170982] nfs: server 192.168.79.152 OK [25873.363190] nfs: server 192.168.79.152 not responding, still trying [26007.629525] nfs: server 192.168.79.152 OK [26187.744721] nfs: server 192.168.79.152 not responding, still trying [26328.828752] nfs: server 192.168.79.152 OK [26509.038729] nfs: server 192.168.79.152 not responding, still trying [26652.198925] nfs: server 192.168.79.152 OK [26832.381036] nfs: server 192.168.79.152 not responding, still trying [26979.925668] nfs: server 192.168.79.152 OK [27160.075098] nfs: server 192.168.79.152 not responding, still trying [27311.543851] nfs: server 192.168.79.152 OK [27491.737763] nfs: server 192.168.79.152 not responding, still trying [27648.618471] nfs: server 192.168.79.152 OK [27828.776258] nfs: server 192.168.79.152 not responding, still trying [27989.645529] nfs: server 192.168.79.152 OK [28169.783396] nfs: server 192.168.79.152 not responding, still trying [28335.413907] nfs: server 192.168.79.152 OK [28515.654211] nfs: server 192.168.79.152 not responding, still trying [28685.605315] nfs: server 192.168.79.152 OK [28865.877812] nfs: server 192.168.79.152 not responding, still trying [29042.667438] nfs: server 192.168.79.152 OK [29222.757053] nfs: server 192.168.79.152 not responding, still trying [29399.954698] nfs: server 192.168.79.152 OK [29580.148863] nfs: server 192.168.79.152 not responding, still trying [29764.278283] nfs: server 192.168.79.152 OK [29944.452511] nfs: server 192.168.79.152 not responding, still trying [30132.901852] nfs: server 192.168.79.152 OK [30313.108856] nfs: server 192.168.79.152 not responding, still trying [30507.374532] nfs: server 192.168.79.152 OK [30687.652930] nfs: server 192.168.79.152 not responding, still trying [30885.120391] nfs: server 192.168.79.152 OK [31065.269669] nfs: server 192.168.79.152 not responding, still trying [31269.086123] nfs: server 192.168.79.152 OK [31449.286149] nfs: server 192.168.79.152 not responding, still trying [31657.934197] nfs: server 192.168.79.152 OK [31838.167220] nfs: server 192.168.79.152 not responding, still trying [32053.904796] nfs: server 192.168.79.152 OK [32234.216394] nfs: server 192.168.79.152 not responding, still trying [32453.459217] nfs: server 192.168.79.152 OK [32633.594217] nfs: server 192.168.79.152 not responding, still trying [32858.505960] nfs: server 192.168.79.152 OK [33038.603535] nfs: server 192.168.79.152 not responding, still trying [33269.554897] nfs: server 192.168.79.152 OK [33449.757447] nfs: server 192.168.79.152 not responding, still trying [33686.773525] nfs: server 192.168.79.152 OK [33867.055842] nfs: server 192.168.79.152 not responding, still trying [34107.805884] nfs: server 192.168.79.152 OK [34287.938014] nfs: server 192.168.79.152 not responding, still trying [34537.542505] nfs: server 192.168.79.152 OK [34717.781094] nfs: server 192.168.79.152 not responding, still trying [34971.558902] nfs: server 192.168.79.152 OK [35151.719727] nfs: server 192.168.79.152 not responding, still trying [35409.633699] nfs: server 192.168.79.152 OK [35589.755241] nfs: server 192.168.79.152 not responding, still trying [35854.690875] nfs: server 192.168.79.152 OK [36034.958509] nfs: server 192.168.79.152 not responding, still trying [36305.122210] nfs: server 192.168.79.152 OK [36485.281962] nfs: server 192.168.79.152 not responding, still trying [36760.839711] nfs: server 192.168.79.152 OK [36940.982055] nfs: server 192.168.79.152 not responding, still trying [37222.462890] nfs: server 192.168.79.152 OK [37402.570004] nfs: server 192.168.79.152 not responding, still trying [37692.681818] nfs: server 192.168.79.152 OK [37872.862521] nfs: server 192.168.79.152 not responding, still trying [38170.312869] nfs: server 192.168.79.152 OK [38350.579669] nfs: server 192.168.79.152 not responding, still trying [38654.524461] nfs: server 192.168.79.152 OK [38834.696511] nfs: server 192.168.79.152 not responding, still trying [39144.660876] nfs: server 192.168.79.152 OK [39324.957890] nfs: server 192.168.79.152 not responding, still trying [39637.950340] nfs: server 192.168.79.152 OK [39818.035587] nfs: server 192.168.79.152 not responding, still trying [40127.927662] nfs: server 192.168.79.152 OK [40308.040787] nfs: server 192.168.79.152 not responding, still trying [root@gqac015 ~]#
As discussed, please check the following - * collect pkt trace for sometime (from client and server) to check if there is active I/O going on. * Collect thread backtraces of the ganesha processes at 2-3 intervals and provide us the same. Also collect the core if possible everytime you collect the trace. I suspect "rm -rf *" invokes READDIR which ends up taking lot of time as observed in bug1382912, if the mount point contains millions of entries. I am working on to optimize that code path. Will provide test-patches once they are ready.
As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing qa_ack
Upstream fix: https://review.gerrithub.io/304278 https://review.gerrithub.io/304279
POST with rebase to nfs-ganesha-2.5.x
Verified this with # rpm -qa | grep ganesha nfs-ganesha-2.5.5-8.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-14.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-8.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-8.el7rhgs.x86_64 Steps performed for verification- 1.Created Distributed-Replicate 6*3 volume. 2.Exported the volume via ganesha 3.Mounted the volume to 6 clients via v4.4 with VIP and 2 with server physical IP of the server 4.Perform IO's from all 6 mount points. 5.While IO's is in process,after 1 hour,performed rm -rf * from one of the client. All files were deleted from mount point.No hangs were observed. Moving this BZ to verified state.
This should be moved out of 3.4, since dirent chunk is removed.
Verified this with # rpm -qa | grep ganesha nfs-ganesha-2.7.3-7.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.7.3-7.el7rhgs.x86_64 glusterfs-ganesha-6.0-11.el7rhgs.x86_64 nfs-ganesha-gluster-2.7.3-7.el7rhgs.x86_64 Steps: 1.Create 4 node ganesha cluster 2.Create 1 Distributed-Disperse volume 2 x (4 + 2) = 12 3.Mount the volume on 6 clients via v4.1 4.Run IO's from 5 clients 5. Wait for around 1 hour and then run rm -rf * from another client with IO's still ongoing No hungs were observed while IO's were running.Moving this BZ to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3252